VP, Site Reliability Engineer at Galaxy Digital Who You Are: You are a Senior SRE specializing in AWS and containerized infrastructure. You thrive working hands-on, tackling migration from legacy VMs to container ecosystems with a focus on EKS, automation, and reliability. What You’ll Do: Reliability Engineering: Architect, deploy, and maintain robust, scalable, secure AWS-based infrastructure. Drive adoption and optimization of EKS and Kubernetes for containerized workloads. Support migration initiatives, moving workloads from legacy VMs to containers in AWS. Implement and fine-tune SLOs, SLAs, and error budgets to balance innovation and stability. Collaborate on best practices with Security and Engineering teams for workload reliability. Automation & Infrastructure as Code: Build Infrastructure as Code (IaC) with Terraform; maintain compliant, repeatable environments. Enhance CI/CD pipelines for efficient, secure, and reliable cloud delivery. Develop and refine automated solutions for autoscaling, failover, and disaster recovery. Observability & Incident Response Design and implement metrics, logging, and tracing tools (Datadog, OpenTelemetry). Set up robust monitoring and alerting to proactively detect and address failures. Lead incident analysis and post-mortems; drive improvements in operational playbooks. AWS & Cloud SME Serve as a subject matter expert for AWS, EKS, and cloud-native tooling within the SRE team. Optimize AWS resources, cost management, and resiliency best practices. Ensure secure key management and regulatory compliance for decentralized workloads. What We’re Looking For: 8+ years in SRE, DevOps, or Infrastructure Engineering (IC capacity preferred). Deep hands-on expertise in AWS, Kubernetes/EKS, and containerization. Extensive IaC experience (Terraform) and cloud-native automation. Proven track record migrating VM-based workloads to containers in AWS at scale. Strong experience with observability stacks (Datadog, Prometheus, Grafana, OpenTelemetry). Excellent analytical, problem-solving, and incident management abilities. Clear communicator who thrives in team environments, collaborating cross-functionally. Bonus Points: Experience supporting blockchain infrastructure is a strong plus.