Founding Engineer, Voice AI at Kosha AI
Thesis
Kosha is building the voice-native operating system for the physical world of commerce. We believe that over the next decade, voice will become the primary user interface. Today's advanced voice-enabled workflows are designed for knowledge workers, people that sit at desks with access to their laptops, not field workers. We see a massive whitespace opportunity for those in the field, people constantly on-the-go, having in-person conversations, and for whom talking is already the natural way they work.
Our first wedge is CPG field sales. Every day, hundreds of thousands of field reps walk into stores, have conversations, and make judgment calls, and almost none of that intelligence is ever captured. The desktop CRM was never built for someone standing in an aisle. We think the interface for that person is not a screen - it's their voice. We capture in-person conversations (today with debriefs, tomorrow with ambient voice capture, and in the future with wearables) and turn them into structured CRM data and market intelligence, building, visit by visit, a proprietary dataset of what is actually happening on shelves that incumbents like Nielsen and Circana cannot see. A real-time knowledge graph of what's happening on the ground, an entirely new dataset for brands, distributors, and investors alike.
We are a pre-seed team that includes the CEO (ex H.I.G. Capital, J.P. Morgan, Harvard Business School) and CTO (ex Field Sales Team at Salesforce, Optimism / OP Labs, U.C. Berkeley Computer Science).
We are looking for our founding engineer to own the layer that makes all of this work.
What you'll do:
Own Kosha's conversational speech stack end to end, including capture, transcription, diarization, structuring, and the latency and accuracy tradeoffs that make voice feel effortless
Build with and around modern speech and conversational-AI tooling such as Deepgram, ElevenLabs, Sesame, and Vapi, and decide what to buy, what to wrap, and what to build ourselves
Turn messy real-world audio, including noisy stores, accents, jargon, and interruptions, into clean, structured intelligence
Work directly with early pilot customers and ship based on what you hear
Shape architecture, tooling, and engineering culture from day one
You might be a fit if you:
Have built production systems with conversational speech models or voice-agent infrastructure such as Deepgram, ElevenLabs, Sesame, Vapi, Whisper, Retell, or LiveKit, or similar
Understand the real engineering problems of voice, including latency, streaming, turn-taking, and transcription accuracy in hostile audio environments
Have shipped independently and can move fast without a spec
Want the upside, ownership, and defining influence of a true founding role
The best voice products feel like magic and are brutally hard underneath. If that gap is the kind of problem you want to live in, let's talk. This will be a fun ride - we will learn a lot.
This role is in-person in NYC or remote.