About CellType
CellType is building foundation models and agent systems for biology.
We believe the next major advances in biotech AI will come from rich biological data, strong model systems, and reliable infrastructure working together. We work with pharma and biotech partners on problems such as preclinical-to-clinical translation, response prediction, biomarker discovery, and scientific reasoning across complex biological datasets.
We are building the core intelligence layer for biology, and that requires a world-class data and ML platform.
About the role
We are hiring a Founding Platform Engineer to build the infrastructure backbone behind our training, evaluation, and inference stack.
We are looking for someone who can build the systems that make biological data usable for model development at speed and at scale: ingestion, indexing, search, retrieval, dataset interfaces, reproducibility, validation, orchestration, observability, and distributed performance.
You will work on the full path from raw data to training-ready datasets to reliable production workflows. The right person will make it dramatically easier for the rest of the team to build, evaluate, and ship models.
What you'll do
Build and maintain data infrastructure for model training, evaluation, and inference
Design and scale high-performance inference serving systems for biological foundation models
Design standardized dataset interfaces so biological data is consistent, discoverable, and easy to use across the team
Build ingestion and processing pipelines for public, proprietary, and customer datasets
Build indexing, search, and retrieval systems that make large datasets queryable and useful in practice
Establish safeguards and validation systems so datasets are reproducible, versioned, and trustworthy once standardized
Improve throughput, latency, and reliability of distributed data loading and ML pipelines
Profile and eliminate performance bottlenecks across GPU, networking, and storage layers
Automate fault detection and recovery for serving and training systems
Build internal tools for dataset inspection, debugging, quality control, and operational visibility
Partner closely with ML engineers and researchers so the platform fits real workflows rather than abstract platform ideals
Help define how we handle permissions, privacy, compliance boundaries, and operational rigor for sensitive biological and customer data
You may be a fit if you
Have deep experience in backend, infrastructure, distributed systems, or data platform engineering
Have built scalable data pipelines or stateful distributed systems in production
Have experience building or operating large-scale inference or training systems
Have a deep understanding of GPU execution constraints, memory trade-offs, and data-loading bottlenecks around training workloads
Have experience with dataset infrastructure for large-scale ML systems, training pipelines, or inference-adjacent systems
Have worked with multimodal or very large datasets that cannot simply fit in memory
Have hands-on experience with data indexing, search, or retrieval infrastructure, and understand how to make large datasets discoverable, queryable, and usable in practice
Can reason about system-level trade-offs between latency, throughput, and cost
Have experience working with privacy-sensitive or compliance-sensitive data systems
Have built internal developer tools for ML or data teams
Have a track record of owning critical production infrastructure
Are comfortable designing APIs, modular abstractions, and internal platform interfaces with strong attention to user experience
Have strong instincts around reliability, reproducibility, and operational simplicity
Are comfortable with cloud infrastructure, containers, Kubernetes, Infrastructure-as-Code, CI/CD, and observability
Produce maintainable code and make pragmatic architecture decisions under time pressure
Thrive in a small team where ownership is broad and priorities can change quickly
We'd be especially excited if you also have
Experience with biological, genomic, or scientific data formats and workflows
Contributions to open-source data or ML infrastructure projects
Experience building streaming or real-time data systems
Background in database internals, storage engines, or query optimization
Experience designing systems that serve both batch training and low-latency inference workloads
At CellType, the quality of our data and ML platform directly determines research speed, model quality, and customer trust. The right person will make the entire company faster and will shape the foundation we build on for years.
If you want to build the systems layer behind frontier AI for biology, we'd love to talk.