Senior Infrastructure Engineer

at Cosine

Posted on November 23rd, 2025

Job Description

Senior Infrastructure Engineer at Cosine (W23) £75K - £100K GBP Fully Agentic SWE London, England, GB Full-time US citizen/visa only 3+ years About Cosine Cosine is a fully agentic SWE that allows you to instantly work on every single ticket at once, asynchronously. About the role Skills: Kubernetes, Amazon Web Services (AWS) About the Role We’re looking for a Senior Platform / Infra Engineer to own the core infrastructure that powers Cosine’s products — from Kubernetes and deployment pipelines to networking and platform services. You’ll design and run the “paved road” that our engineers, researchers, and customers build on: reliable Kubernetes clusters, fast and safe CI/CD, solid observability, and hardened environments for demanding enterprise and on-prem deployments. You’ll also wear a classic “DevOps/SRE” hat: thinking in SLOs, running incident response, and keeping us up even as we move quickly. This is a high-ownership role at a fast-paced, venture-backed Silicon Valley startup. You’ll work directly with founding engineers and leadership, and your decisions will materially shape how we build and ship products. What You’ll Do Own core infrastructure Design, operate, and evolve our Kubernetes-based platform (EKS or similar), including cluster topology, node groups, autoscaling, and multi-environment isolation. Manage supporting cloud resources: container registries, load balancers, queues, caches, and data infra needed to run our APIs and agents. Build the deployment & tooling layer Design and maintain CI/CD pipelines for image builds and infra rollouts (e.g. Pulumi/Terraform + Helm/Docker). Implement safe rollout strategies (blue/green, canary, staged rollouts) and fast rollback paths. Build internal tools and abstractions that make it easy for product teams to self-serve infra safely. Own reliability & operations (SRE-ish) Define and track SLOs/SLIs for key services (latency, error rates, availability). Improve our observability stack (metrics, logs, traces, alerts) so issues are obvious, actionable, and debuggable. Participate in the on-call rotation, lead incident response when needed, and drive blameless post-mortems and fixes. Shape networking & security Design and maintain networking: VPCs, subnets, ingress/egress, service meshes / L7 routing, DNS, and TLS. Implement least-privilege access via IAM, secure secret management, and hardened configurations for multi-tenant and isolated customer environments. Help design patterns for secure enterprise and on-prem / regulated deployments. Partner with product & research Work closely with application, ML, and research teams to understand their needs and translate them into reusable infra building blocks. Provide guidance on “how to run this in production” — capacity planning, failure modes, and operational readiness reviews. You Might Be a Great Fit If You Have strong experience 5+ years building and operating production infrastructure on a major cloud (AWS, GCP, or Azure). Significant hands-on experience running Kubernetes in production (EKS/GKE/AKS or self-managed): Cluster upgrades, autoscaling, node group design, and multi-env setups. Helm or similar for packaging services. Think in infrastructure-as-code Deep experience with IaC tools (Pulumi, Terraform, CDK, or similar). Comfortable managing infra changes via code review, CI, and automated rollouts. Care deeply about reliability Have owned the uptime and performance of user-facing systems. Comfortable participating in (and improving) on-call rotations and incident management. Experience setting up / tuning observability (Prometheus, Grafana, CloudWatch, OpenTelemetry, etc.). Build great tooling & abstractions You’ve built internal tools, libraries, or platforms on top of cloud providers so product teams can move faster with fewer foot-guns. You think about developer experience and “golden paths,” not just raw infra. Are comfortable in code Strong scripting and programming skills in at least one modern language (e.g. TypeScript, Go, Python). Happy to dive into app code when needed to debug a production issue or improve an integration. Have the startup mindset Enjoy working in a fast-moving environment with evolving priorities and incomplete specs. Bias toward pragmatic solutions: ship something small, measure, iterate. Communicate clearly, give/receive direct feedback, and collaborate across functions. Nice to Have (Not Required) Experience with: AWS primitives like EKS, ECS/Fargate, ECR, SQS, ElastiCache/Redis. Argo CD or other GitOps tools for Kubernetes. On-prem, air-gapped, or regulated industry deployments (e.g. finance, healthcare). AI/ML infrastructure (GPU workloads, model hosting, feature stores). Prior experience as an early infra / platform hire at a startup.

Location

London

England

Salary

£75K - £100K GBP

Experience

3+ Years