Tonari Tutor: The Idea
Why I built a voice-based interview practice tool on ElevenLabs Conversational AI, and what the first commit looked like.
I was deep in interview prep for Apple Japan and Uber when I realized every practice tool I could find was a text chatbot. You type an answer, GPT types back “great answer, but consider…” and you learn absolutely nothing about how you actually sound in a conversation. The gap between typing a STAR-format response and delivering one out loud, under pressure, in real time — that gap is where interviews are won or lost.
I wanted something that would talk to me. A voice agent that could read my resume, read the job description, and run a realistic interview loop. Not a recording-and-playback tool. A live conversation where I could stumble, recover, and get feedback on the whole thing.
Why ElevenLabs Conversational AI
I looked at a few options. Most voice AI APIs at the time were either text-to-speech wrappers (you still handle the conversation logic yourself) or closed platforms with no customization. ElevenLabs had just shipped their Conversational AI product, and three things made it the obvious choice:
- Real-time voice agents over WebSocket. Sub-second latency, actual back-and-forth conversation. Not request-response — a persistent connection where the agent listens and speaks naturally.
- Multilingual TTS. I was prepping for roles in Japan. I needed an agent that could switch to Japanese mid-conversation without sounding like Google Translate.
- Dynamic variables. You can inject context — resume text, job description, company name — into the agent’s system prompt at connection time. This meant I could build one agent and parameterize it per interview, rather than creating a new agent for every job application.
The First Commit
December 13, 2025. Three commits in one day. The initial prototype was embarrassingly simple: a static page with an ElevenLabs embed widget, a resume upload flow (PDF with drag-and-drop), and a text area to paste the job description. Upload your resume, paste the JD, click start, and the widget would open a voice session with an interviewer agent that had your full context.
By the third commit that day I’d added multi-format support for the JD upload — PDF, DOCX, TXT, Markdown — because it turns out recruiters send job descriptions in every format imaginable.
A week later I was already fixing the widget display. The default ElevenLabs embed didn’t play well with the page layout, so I switched to a floating button pattern with clearer UI instructions. Small thing, but it was the first sign that the UX around the voice session mattered as much as the voice session itself.
Where It Was Heading
Even in that first week I had a rough roadmap in my head: structured scoring rubrics so the agent could grade your answers on frameworks like STAR, a post-interview feedback panel that breaks down what you said and where you lost the thread, and multi-language sessions where the agent interviews you in Japanese or Portuguese depending on the role.
The prototype worked. I used it to prep for real interviews. It was rough, but talking to an AI interviewer that actually knew my resume and the job I was applying for — that was a different experience from anything else I’d tried.
The next post covers what happened when I started building the scoring system and realized the agent needed a lot more structure than “just be an interviewer.”