About us
Infer is building the operating system for insurance agencies. We make AI agents(including voice agents) that handle the work agencies have always done by hand: qualifying inbound leads, helping producers during live calls, auditing calls after, running renewals, and bringing churned customers back.
Our long bet is that AI eventually sells insurance directly. Agencies are the wedge because that is where the work, the data, and the customer relationships actually live. Get good there, and the rest follows.
We are a YC company and have raised from Stellaris Venture partners and others. Founders are: Vaibhav, Urvin and Suneel. Vaibhav was an architect and AI researcher(at Purdue) now a licensed insurance agent. Urvin worked at BCG, is a surfer with six pack abs. Suneel is an IITian and a philomath.
A few reasons to join us:
We like pushing each other on team to test the limits because that's when you rediscover yourself.
We’re paranoid about making customers succeed (we challenge whats already good)
We love people who question-challenge-build.
We're highly transparent founders to work with & love getting challenged.
Finally, we love people who’re interdisciplinary.
About the role
You'll own the model quality bar for our voice AI platform, building the evals that tell us if we're getting better, and driving real, measurable improvements in transcription accuracy and TTS quality. This role sits at the intersection of applied ML, audio, and rigorous experimentation: if you ship a change, you'll know exactly what it bought us.
What you'll do
Build and maintain the eval framework that scores voice agent quality end-to-end transcription, response quality, TTS, and full-conversation outcomes
Design voice agent behavior: system prompts, tool use, conversation flow, error recovery, and guardrails for real-time interactions
Drive transcription accuracy improvements across STT providers and configurations (Deepgram, Whisper, AssemblyAI, Nvidia, etc.)
Drive TTS quality improvements voice selection, latency vs. fidelity tradeoffs, prosody, edge cases
Curate and grow our evaluation datasets, including hard-case mining from production traffic
Run rigorous A/B experiments and report results that the team can actually act on
Partner with backend engineers to wire eval signals into CI so regressions get caught before they ship
Must-haves
ML engineering experience shipping production systems
Strong Python and a working ML stack (PyTorch, Huggingface, pandas, scikit-learn)
Hands-on experience designing LLM-based agents: prompting, tool/function calling, multi-turn state, structured outputs
Hands-on experience building evals or eval frameworks for ML, LLM, or voice systems. Built LLM-as-judge eval pipelines and know their failure modes
Practical experience with ASR/STT comparing providers, fine-tuning, or running open models like Whisper
Practical experience with TTS systems (ElevenLabs or open models)
Comfortable working with audio data: sample rates, codecs, noise, alignment
Nice-to-haves
Designed voice agents specifically handled barge-in, interruption recovery, disfluencies, and natural turn-taking at the prompt/behavior layer
Experience with diarization, VAD, or endpointing models
Audio dataset curation, labeling, or annotation pipelines
Trained or fine-tuned ASR or TTS models from scratch or on domain audio
Experience with active learning or data-flywheel patterns over production traffic
Open-source contributions to AI/ML frameworks
Familiarity with cost/latency tradeoffs across model providers for real-time voice