Site Reliability Engineer at Geomotiv
ABOUT THE PROJECT:
The project is a cutting-edge healthcare technology provider dedicated to transforming patient care through real-time data insights. By leveraging advanced machine learning and programmatic automation, it empowers healthcare organizations to deeply understand patient data, ultimately accelerating and improving the accuracy of clinical diagnoses and solutions. The core mission is to elevate the predictive capabilities of modern healthcare through the comprehensive suite of analytics solutions.
*Working hours: necessary overlap until 1 or 2 pm EST. *
WHAT YOU’LL BE UP TO:
Architect, scale, and maintain existing Kubernetes-based platform to ensure long-term stability, performance, and adaptability;
Gather and analyze requirements from development, data management, and SRE teams to continuously evolve the platform to meet business needs;
Develop and implement missing functionalities within the open-source tools that serve as our core K8s platform components;
Engage with open-source community maintainers to integrate our customizations upstream. For custom forks, manage and merge necessary bug fixes from upstream projects;
Build and manage CI/CD pipelines to run regression tests and securely release new versions of component images;
Rigorously test new platform component versions and orchestrate seamless, zero-downtime upgrades for production clusters;
Proactively identify system limitations, evaluate alternative platform components, and stage proof-of-concept migrations to validate performance benefits;
Partner with the SRE team to design and automate cluster diagnostics and rapid issue recovery workflows;
Guide and assist development teams in configuring their own CI/CD pipelines and successfully implementing GitOps practices for their specific workloads.
WHAT YOU’LL NEED TO HAVE:
5+ years of experience running, bootstrapping, and automating highly available, secure Kubernetes clusters in production environments;
Deep knowledge of Kubernetes concepts, APIs, and inner workings (both control plane and kubelet);
Proven ability to port legacy applications into a Kubernetes cluster;
Experience managing persistent state in a cluster using reliable storage tools;
Bare metal/on-premise environment experience
2+ years of Golang development, specifically with async programming and TDD;
Proficiency with Golang’s Kubernetes client library and the ability to customize open-source Golang tools;
Strong grasp of GitOps principles and hands-on experience with tools like Argo CD or Flux;
Ability to automate infrastructure tasks and build CI/CD pipelines;
Strong networking knowledge, specifically involving LVS, BGP protocols, and Linux firewalls;
Required experience setting up external ingress into K8s clusters using Project Contour, Envoy, HAProxy, or Nginx reverse proxies;
Ability to configure monitoring/alerting using Prometheus, including writing custom metric exporters for legacy apps;
Experience running centralized log aggregation using Elasticsearch stacks or Loki;
Proven ability to implement issue detection and recovery mechanisms using K8s;
Ability to troubleshoot and resolve production issues under pressure;
Strong cross-functional communication skills to collect requirements and deliver solutions for SRE, network engineering, and application development teams;
English B2+.
NICE TO HAVE:
Ability to proactively detect or automatically resolve production issues before they impact revenue;
Active contributions to Golang-based Kubernetes open-source projects;
Experience running workloads in GCP or AWS;
Familiarity with Kube-router, Kubeadm, and Puppet;
Knowledge of specific storage solutions like Rook and Ceph;
Experience packaging applications as Helm charts;
Familiarity with Java and Kubernetes-native Java frameworks (e.g., Quarkus).
INTERVIEW STEPS:
HR interview + English check;
Technical interview (2 rounds);
Additional interview with headquarters.