The five approaches

Our cross-model research (Claude, ChatGPT, Gemini) surfaced five distinct ML approaches to the synth parameter estimation problem. Each has real trade-offs.

1. Supervised parameter regression (black-box)

The conceptually simplest approach. A neural network (CNN or Audio Spectrogram Transformer) maps an input mel-spectrogram directly to a parameter vector. Training minimizes MSE between predicted and ground-truth parameters from synthetically generated data.

Strengths: Simple, scalable, deterministic output, fast inference. The DAFx24 paper demonstrates that an AST backbone outperforms CNNs on a 1M-sample synthetic dataset.

Weaknesses: Fails on out-of-domain audio. Can’t handle categorical choices (waveform type) gracefully. Forces a single “correct” answer on a problem that has many valid ones — often averaging multiple solutions into noise.

This is our planned v2 architecture once the retrieval baseline is validated.

2. Differentiable synthesis (DDSP and derivatives)

Instead of treating the synth as a black box, differentiable synthesis makes audio generation itself differentiable. Gradients flow from audio output back to parameter inputs, enabling training directly on audio similarity rather than parameter error.

Strengths: More perceptually aligned loss. Can adapt to out-of-domain audio via spectral loss fine-tuning.

Weaknesses: DDSP’s canonical implementation (harmonic + noise model) is not isomorphic to real synth UIs. Complex modulation, FM, and polyphony are poorly handled. See the DDSP deep dive for the full assessment.

3. Retrieval-based matching

Rather than predicting parameters directly, build a large indexed library of (patch params, rendered audio embedding) pairs, then find the nearest neighbor to the query audio in embedding space.

Strengths: No training loop required. Works immediately once a patch library is built. Handles multiple valid answers naturally by returning top-K candidates. Incrementally improvable by adding more patches.

Weaknesses: Can’t return parameters for sounds outside the library. Quality bounded by library size and embedding quality.

This is our v1 architecture — the fastest path to a working demo.

4. Reinforcement learning (SynthRL)

A 2025 approach (SynthRL, IJCAI 2025) frames synth matching as a contextual bandit problem. Target sound = state, predicted parameters = action, reward = perceptual similarity. Because RL doesn’t require a differentiable environment, it can optimize parameters for any synth including closed-source VSTs.

Best used as a Phase 3 fine-tuning technique after the basic pipeline is proven.

5. Generative / probabilistic models

The most advanced direction models the distribution of valid parameter sets rather than predicting a single point estimate. A 2025 ISMIR paper uses Continuous Normalizing Flows (CNFs) equivariant to synthesizer symmetries, tested specifically on Surge XT.

This is the research frontier — not a starting point.

Our decision

Both research sources converge on the same answer: retrieval-first (CLAP/OpenL3 embeddings + Faiss nearest-neighbor) over a synthetically generated Surge XT patch library, with optional black-box local search refinement.

This isn’t a compromise — it’s what the research literature demonstrates actually works, and it’s the fastest path to a usable demo without training a large model. The AST regression model becomes v2 once retrieval validates the concept.


Part of the Patch Pilot research series. See also: DDSP deep dive · Audio embeddings · MVP build plan.