Founding Engineer, Voice AI

Posted on May 26th, 2026

Job Description

Founding Engineer, Voice AI at Kosha AI Thesis Kosha is building the voice-native operating system for the physical world of commerce. We believe that over the next decade, voice will become the primary user interface. Today's advanced voice-enabled workflows are designed for knowledge workers, people that sit at desks with access to their laptops, not field workers. We see a massive whitespace opportunity for those in the field, people constantly on-the-go, having in-person conversations, and for whom talking is already the natural way they work. Our first wedge is CPG field sales. Every day, hundreds of thousands of field reps walk into stores, have conversations, and make judgment calls, and almost none of that intelligence is ever captured. The desktop CRM was never built for someone standing in an aisle. We think the interface for that person is not a screen - it's their voice. We capture in-person conversations (today with debriefs, tomorrow with ambient voice capture, and in the future with wearables) and turn them into structured CRM data and market intelligence, building, visit by visit, a proprietary dataset of what is actually happening on shelves that incumbents like Nielsen and Circana cannot see. A real-time knowledge graph of what's happening on the ground, an entirely new dataset for brands, distributors, and investors alike. We are a pre-seed team that includes the CEO (ex H.I.G. Capital, J.P. Morgan, Harvard Business School) and CTO (ex Field Sales Team at Salesforce, Optimism / OP Labs, U.C. Berkeley Computer Science). We are looking for our founding engineer to own the layer that makes all of this work. What you'll do: Own Kosha's conversational speech stack end to end, including capture, transcription, diarization, structuring, and the latency and accuracy tradeoffs that make voice feel effortless Build with and around modern speech and conversational-AI tooling such as Deepgram, ElevenLabs, Sesame, and Vapi, and decide what to buy, what to wrap, and what to build ourselves Turn messy real-world audio, including noisy stores, accents, jargon, and interruptions, into clean, structured intelligence Work directly with early pilot customers and ship based on what you hear Shape architecture, tooling, and engineering culture from day one You might be a fit if you: Have built production systems with conversational speech models or voice-agent infrastructure such as Deepgram, ElevenLabs, Sesame, Vapi, Whisper, Retell, or LiveKit, or similar Understand the real engineering problems of voice, including latency, streaming, turn-taking, and transcription accuracy in hostile audio environments Have shipped independently and can move fast without a spec Want the upside, ownership, and defining influence of a true founding role The best voice products feel like magic and are brutally hard underneath. If that gap is the kind of problem you want to live in, let's talk. This will be a fun ride - we will learn a lot. This role is in-person in NYC or remote.

Location

New York City (Onsite or remote)

Salary

$120k – $200k

Experience

4+ years