The widget problem

When I first wired up Tonari Tutor, I used the official ElevenLabs widget embed. Drop in a script tag, pass your agent ID, get a working voice conversation. It worked. For about two weeks.

The widget is a black box. You get a floating button, their UI, their layout. I couldn’t style it, couldn’t add audio visualization, couldn’t gate access behind auth. And when I wanted to let users choose between two interviewer voices — Kohei and Ai — there was no clean way to swap the agent config mid-session.

As described in the first Tutor post, the whole point is a custom interview experience. A generic embedded widget defeats the purpose.

Ripping it out

On January 1st I started the migration by adding interviewer voice selection — the ability to pick between Kohei and Ai before starting a session. This was the first multi-voice feature, and it immediately required more control than the widget offered. I needed to dynamically set the voice ID based on the user’s selection, which meant initializing the connection myself.

The ElevenLabs WebSocket SDK gives you exactly that. You open a WebSocket connection, send audio frames from the user’s mic, and receive audio frames back. No UI opinions. No hidden iframe. Just a bidirectional audio stream and lifecycle events.

What the migration actually involved

More than I expected. The widget had been hiding a lot of complexity:

Audio stream management. The browser’s MediaRecorder API gives you audio chunks, but you need to handle the format, sample rate, and chunk size that the SDK expects. Getting this wrong produces silence or garbled audio with no useful error message.

Connection lifecycle. Open, authenticate, handle disconnects, handle the server ending the conversation, handle the user closing the tab mid-sentence. Each of these is a distinct state that the widget had been managing invisibly.

Custom UI. With the widget gone, I needed to build the entire conversation interface — start/stop controls, connection status, and the interviewer selection panel.

Audio visualization. This was the feature that motivated the migration more than anything. On January 24th I added real-time waveform display from the mic input using the Web Audio API’s AnalyserNode. Watching the waveform respond as you speak makes the interview feel alive in a way a static widget button never could.

Mobile browser guards. navigator.mediaDevices doesn’t exist in all mobile contexts. I had to add explicit guards to avoid crashing on browsers that don’t support it — a small commit on Feb 13, but the kind of thing that only surfaces after real users hit it.

Timeline

The migration took about six weeks of on-and-off work, from the first voice selection commit on January 1st to the final SDK migration cleanup on February 13th. Most of that calendar time was other work — the actual implementation was probably two focused weeks. The Jan 12 commit was the big one: the full ElevenLabs SDK integration with custom UI, replacing the widget entirely.

Was it worth it

Without question. The widget was faster to ship initially, but it was a ceiling. Every feature I’ve added since — voice selection, visualization, access gating, custom styling — required owning the audio pipeline. The WebSocket SDK is more work upfront, but it’s work that compounds. Every new feature builds on the same connection and audio infrastructure instead of fighting an embed’s constraints.