Patch Pilot: Reverse Sound Design — Starting Over
Why we shelved the v1, what we learned, and how we're approaching the research reboot for an audio-to-synth-parameter tool.
What is Patch Pilot?
Patch Pilot is a reverse sound design tool. You give it audio — a synth snippet, a sample from a track, even a text description like “that gritty Billie Eilish bassline” — and it outputs synthesizer parameter settings that would recreate a similar timbre. Oscillator type, ADSR envelope, filter cutoff, FX chain. No musical information (notes, melody) is ever output, only synthesis parameters.
The copyright-safe angle is the whole point: it never reproduces music, just tells you how to build a sound.
What v1 looked like
The first version was a Python project using:
- Librosa + Essentia for audio feature extraction
- Demucs for source separation (isolate the synth layer from a full mix)
- CLAP / OpenL3 / VGGish / Wav2Vec2 for audio embeddings
- DistilBERT for text-to-embedding (the “describe a sound” input path)
- PyTorch + CUDA on an RTX 4090
We got the input layer working: text processing, audio file loading, YouTube URL ingestion via yt-dlp. The audio analysis module was written but never synced from the Windows dev machine (a git reset wiped the local commit before push — lesson learned).
Why we stopped
The honest assessment: the embedding approach — extract audio → CLAP embedding → compare to a database of analyzed presets — was the right idea, but the project never actually had the synth parameter database or the matching model. We were building a pipeline to feed data into a system that didn’t exist yet. The hardest part was never started.
Also, the ML landscape has moved significantly since this started. DDSP (Differentiable Digital Signal Processing) from Google Magenta, newer audio embedding models like AudioMAE, and the emergence of actual synth parameter datasets all suggest the v1 architecture may have been heading in the wrong direction.
The research reboot
Before writing any new code, we’re doing a proper survey of what exists now. The research is structured around five questions, run across Claude, ChatGPT, and Gemini in parallel for cross-reference:
- State of the art — What open-source projects or papers tackle audio-to-synth-parameter estimation? Is this a solved problem for specific synth architectures?
- Embeddings — Which audio embedding model is best for timbre similarity specifically? CLAP, AudioMAE, MuLan, EnCodec — which wins for this use case?
- The dataset problem — Do open datasets of [synth preset parameters ↔ rendered audio] pairs exist? Is synthetic generation via Surge XT or FAUST feasible?
- Architecture — Given what exists, what’s the realistic MVP? Retrieval-based? Differentiable synthesis? Fine-tuned foundation model?
- Python synth rendering — Is
surgepyviable as a headless synth renderer for dataset generation?
Findings will be posted here as they come in.
This is post #1 in an ongoing research log. The goal is to document the process transparently — including the dead ends — so other builders in similar territory can learn from it.