Teaching an AI to Interview in Four Languages
Why one multilingual agent didn't work and how separate per-language agents with tuned TTS configs solved the problem.
The plan that didn’t work
The original idea was clean: one ElevenLabs conversational agent, one prompt, one knowledge base. A dynamic variable would tell the agent which language the user selected, and it would conduct the mock interview in that language. Simple routing, no duplication.
It fell apart at the TTS layer.
ElevenLabs’ turbo_v2 model is fast and sounds great in English. It does not handle Japanese well. The cadence is wrong, the pitch contours flatten out, and it occasionally drops particles. Spanish and Mandarin had similar issues — comprehensible, but clearly an English model doing its best impression.
Worse, the Japanese output started wrapping every response in XML name tags. Every single reply came back as <Hatake Kohei>The interview question here.</Hatake Kohei>. The TTS would then read the tags aloud. It took me longer than I’d like to admit to figure out why the agent kept saying “less-than Hatake Kohei greater-than” before each question.
Separate agents, shared knowledge
The fix was straightforward once I accepted the duplication: separate agents per language. EN, JA, ES, ZH. Each one gets the correct TTS model and voice for its language. The prompt and knowledge base are identical across all four — only the voice and TTS configuration differ.
The frontend routes to the correct agent ID based on the user’s language selection. The language presets live in a simple config object. No conditional logic in the prompt, no dynamic variable switching, no hoping the model remembers which language it’s supposed to be speaking.
Prompt engineering for voice
The prompt work was more involved than expected. A conversational AI agent that speaks its output has different formatting requirements than one that writes text. Rules I had to add through trial and error:
- Never use bullet points, bold text, or markdown formatting (TTS reads asterisks)
- Never wrap output in XML tags or any markup (this one took multiple iterations to stick)
- Keep responses to 2-3 sentences — long monologues sound unnatural when spoken
- Use natural speech patterns: contractions, filler acknowledgments, conversational transitions
That XML tag rule was the hardest to enforce. Adding “never use XML tags” wasn’t enough. I had to be specific: “Never wrap your output in tags like <Name> or </Name>. Your response must be plain spoken text with no markup of any kind.” Even then, the Japanese agent needed a separate iteration with the rule restated in the system prompt header.
Knowledge base: direct injection, not RAG
The agent’s knowledge base is six documents: an interview contract defining the agent’s role and boundaries, a scoring rubric, a role signal map, question banks for different interview types, and follow-up probe templates. Total size is around 11KB.
At that scale, RAG is overhead for no benefit. All six documents are injected directly into the agent’s context. The model sees everything on every turn. No retrieval latency, no relevance ranking failures, no chunking artifacts. When the total knowledge fits comfortably in context, just put it in context.
What shipped
Four language-specific agents with identical knowledge, language-appropriate TTS configs, and a frontend that routes between them. The tag-stripping regex on the client side stayed in as a safety net — if the model ever regresses on the XML rule, the UI silently strips it before display.
The whole multilingual flow went from broken prototype to stable across EN/JA/ES/ZH in about a week. Most of that time was spent on prompt iteration, not code.