Founding Platform Engineer, Data & ML Systems

Posted on April 16th, 2026

Job Description

About CellType CellType is building foundation models and agent systems for biology. We believe the next major advances in biotech AI will come from rich biological data, strong model systems, and reliable infrastructure working together. We work with pharma and biotech partners on problems such as preclinical-to-clinical translation, response prediction, biomarker discovery, and scientific reasoning across complex biological datasets. We are building the core intelligence layer for biology, and that requires a world-class data and ML platform. About the role We are hiring a Founding Platform Engineer to build the infrastructure backbone behind our training, evaluation, and inference stack. We are looking for someone who can build the systems that make biological data usable for model development at speed and at scale: ingestion, indexing, search, retrieval, dataset interfaces, reproducibility, validation, orchestration, observability, and distributed performance. You will work on the full path from raw data to training-ready datasets to reliable production workflows. The right person will make it dramatically easier for the rest of the team to build, evaluate, and ship models. What you'll do Build and maintain data infrastructure for model training, evaluation, and inference Design and scale high-performance inference serving systems for biological foundation models Design standardized dataset interfaces so biological data is consistent, discoverable, and easy to use across the team Build ingestion and processing pipelines for public, proprietary, and customer datasets Build indexing, search, and retrieval systems that make large datasets queryable and useful in practice Establish safeguards and validation systems so datasets are reproducible, versioned, and trustworthy once standardized Improve throughput, latency, and reliability of distributed data loading and ML pipelines Profile and eliminate performance bottlenecks across GPU, networking, and storage layers Automate fault detection and recovery for serving and training systems Build internal tools for dataset inspection, debugging, quality control, and operational visibility Partner closely with ML engineers and researchers so the platform fits real workflows rather than abstract platform ideals Help define how we handle permissions, privacy, compliance boundaries, and operational rigor for sensitive biological and customer data You may be a fit if you Have deep experience in backend, infrastructure, distributed systems, or data platform engineering Have built scalable data pipelines or stateful distributed systems in production Have experience building or operating large-scale inference or training systems Have a deep understanding of GPU execution constraints, memory trade-offs, and data-loading bottlenecks around training workloads Have experience with dataset infrastructure for large-scale ML systems, training pipelines, or inference-adjacent systems Have worked with multimodal or very large datasets that cannot simply fit in memory Have hands-on experience with data indexing, search, or retrieval infrastructure, and understand how to make large datasets discoverable, queryable, and usable in practice Can reason about system-level trade-offs between latency, throughput, and cost Have experience working with privacy-sensitive or compliance-sensitive data systems Have built internal developer tools for ML or data teams Have a track record of owning critical production infrastructure Are comfortable designing APIs, modular abstractions, and internal platform interfaces with strong attention to user experience Have strong instincts around reliability, reproducibility, and operational simplicity Are comfortable with cloud infrastructure, containers, Kubernetes, Infrastructure-as-Code, CI/CD, and observability Produce maintainable code and make pragmatic architecture decisions under time pressure Thrive in a small team where ownership is broad and priorities can change quickly We'd be especially excited if you also have Experience with biological, genomic, or scientific data formats and workflows Contributions to open-source data or ML infrastructure projects Experience building streaming or real-time data systems Background in database internals, storage engines, or query optimization Experience designing systems that serve both batch training and low-latency inference workloads At CellType, the quality of our data and ML platform directly determines research speed, model quality, and customer trust. The right person will make the entire company faster and will shape the foundation we build on for years. If you want to build the systems layer behind frontier AI for biology, we'd love to talk.

Location

New York

Salary

$145K - $250K

Experience

3+ Years