Managing ElevenLabs Agents Before MCP

The dashboard loop

Before I built any tooling, every change to a Tonari Tutor agent went through the ElevenLabs web dashboard. The workflow looked like this: open the agent config page, scroll to the system prompt text box, make edits, save, call the agent on my phone to test, hang up, navigate to the conversation history, click into the specific conversation, read the transcript, decide what to change, go back to the config page. Repeat.

It’s a fine workflow for one agent. I had four.

As I wrote about in the multilingual agents post, each supported language (EN, JA, ES, ZH) has its own agent with identical prompts and knowledge base documents. Every prompt tweak, every KB doc update, every voice config adjustment had to be made four times. There’s no “apply to all” concept. You’re editing text boxes one at a time.

The testing problem

Verifying that changes actually worked was worse than making them. The feedback scoring system uses ElevenLabs’ data_collection fields to evaluate candidate responses against a rubric. To check if scoring was working correctly after a prompt change, I had to:

Call the agent and run through a mock interview
Find that conversation in the dashboard’s conversation list
Click into it, navigate to the analysis tab
Manually read each data_collection field
Compare the values against the scoring rubric document
Do this for each language agent to confirm parity

There’s no way to query conversation data in bulk. No way to diff scoring results across conversations. No way to verify that all four agents are producing consistent evaluations. Just clicking and reading.

No version history

The dashboard has no prompt versioning. When you edit and save, the previous version is gone. If a prompt change broke scoring or introduced weird agent behavior, I had to reconstruct the old version from memory or hope I’d pasted it into a notes file somewhere.

I started keeping a local markdown file with timestamped prompt snapshots. That helped, but it was another manual step in an already manual process. And it didn’t cover KB doc changes or voice settings — just the system prompt.

The real cost

The friction compounded. I’d notice something the agent should handle differently — a follow-up probe that was too aggressive, a scoring criterion that needed rewording — and I’d mentally file it away instead of fixing it. The edit-test cycle was slow enough that small improvements weren’t worth the context switch.

That’s the worst outcome for any development workflow: when the cost of making a change exceeds your motivation to make it. The agents stagnated not because I ran out of ideas, but because the tooling made iteration expensive.

I needed a way to manage prompts, KB docs, and conversation data programmatically. The official ElevenLabs MCP connector exists — but it doesn’t cover this API surface. More on that next.