Site Reliability Engineer / Tech for Good / Kubernetes, Terraform, IaC / Series A at TrustIn
This health tech company is looking for an SRE who’ll own the entire production environment and improve reliability at every layer, from Kubernetes infrastructure to application-level observability and performance. This is a rare role that demands both deep infrastructure expertise and strong backend engineering ability.
WHAT YOU'LL DO:
• Design, implement, and maintain infrastructure at scale — including 500+ machine deployments — using Kubernetes, Helm, and Terraform IaC. You're accountable for 99.9%+ uptime.
• Set reliability targets for every service in partnership with engineering and product — then monitor, respond to violations, and drive durable fixes rather than symptom management.
• Roll out consistent metrics, OpenTelemetry traces, and structured logging across services — with dashboards and alerts that surface what matters before it pages you.
• Use trace and metrics data to identify bottlenecks — slow endpoints, expensive queries, queue backlogs — and ship fixes in application code and configuration, not just infra-level workarounds.
• Lead post-incident reviews with a bias toward prevention — shipping instrumentation, tests, and guardrails that mean the same incident doesn't happen twice.
• Optimise CI/CD pipelines across TypeScript and Python/ML stacks, streamline developer workflows, and make deploying fast, safe, and boring.
WHAT WE'RE LOOKING FOR:
✓Several years as a backend software engineer — you can read, write, and improve application code, not just configure infrastructure around it
✓Deep Kubernetes expertise — you've run it at scale, managed production incidents, and understand the operational edge cases
✓Hands-on Terraform experience — you've owned IaC projects, not just made small changes to existing configs
✓Strong observability instincts — OpenTelemetry, metrics, structured logging, and SLO-based alerting feel like natural tools to you
✓Comfortable with database performance — indexing, query optimisation, connection pooling, and caching aren't foreign territory