Architecture overview

Both research tracks converge on the same answer: retrieval-first — CLAP/OpenL3 embeddings + Faiss nearest-neighbor search over a synthetically generated Surge XT patch library, with optional black-box local search refinement.

The scope for MVP: one-shot audio in → top-5 Surge XT patch suggestions. No FX chain, no song stems, no polyphony.

The build plan

Step A: Lock down Surge XT

Install Surge XT VST3 on Windows. Confirm it appears at the standard VST3 path. GPL3 license means our generated audio is ours to use and distribute freely.

Step B: Accept the WSL2 reality

This is the most likely place the project stalls. VST3 plugins are Windows DLL binaries — a Linux Python process (WSL2) cannot load them. Any attempt to host Surge XT inside WSL2 will fail.

The solution is a two-process architecture: Windows Python renders audio via Pedalboard, WSL2 Python handles ML. Communication over localhost HTTP (FastAPI on Windows side, requests on WSL2 side). Alternatively, do everything in Windows Python — PyTorch, Faiss, and all ML libraries run natively on Windows with CUDA support.

Step C: Build the renderer service

Use Pedalboard (Spotify) to load Surge XT VST3, set parameters, send MIDI, capture audio. Use Pluginary to enumerate all parameter names, IDs, and ranges — cache to SQLite. Define a small “solvable” parameter schema: continuous params only (no discrete waveform type, no mod matrix). Output schema: parameter_name, raw_value [0-1], unit label.

Step D: Generate the patch library

Render N patches from Surge XT with randomly sampled continuous parameters. Fixed performance: C4 MIDI note, standard velocity, 2-4 seconds. Reject silence (RMS below -60 dB). Process-isolate renders. Validate determinism. Target: 50k-200k patches for a meaningful retrieval library.

Step E: Embed audio and build search index

Compute CLAP or OpenL3 embedding for each rendered patch. Store (embedding, parameter vector) pairs. Build a Faiss index over embeddings using GPU index for the RTX 4090.

Step F: Inference pipeline

Trim and loudness-normalize input audio. Compute embedding. Query Faiss for top-K nearest neighbors. Return parameter vectors for top-K patches.

Step G (optional): Black-box local search refinement

Start from best retrieved patch. Mutate subset of continuous parameters via CMA-ES or hillclimb. Evaluate by re-rendering and measuring spectral distance. Use this path instead of gradient descent — gradients are unreliable at synth discontinuities.

The four make-or-break experiments

These tell us whether the approach is viable within 1-2 weeks, before committing to months of work.

Experiment 1: Programmatic Surge XT control. Load VST3, print parameter keys, set 10-20 params, render MIDI, confirm waveform changes audibly. If this fails → stop and switch renderer strategy.

Experiment 2: Render 5,000 patches without crashing. Less than 1% crash rate with auto-restart. Same parameter seed → same audio hash. No state leakage. If this fails → fix process isolation before proceeding.

Experiment 3: Closed-world retrieval accuracy. Generate 10k patches, hold out 500. Top-1 and top-10 retrieval rates meaningfully above random. If this fails → investigate embedding choice and pitch handling.

Experiment 4: Out-of-domain sanity. 20 real sounds (10 from other synths, 10 acoustic). Top-K results not random nonsense — at least a few perceptually plausible matches. If this fails → accept the domain gap and scope to “synth-only input” for v1.

Weekend 1 fail-fast prototype

Before running the full experiment battery: predict only 3 parameters (oscillator waveform, filter cutoff, envelope attack) using surgepy to render 10k samples and a simple ResNet-18 regressor. If the model can’t accurately predict filter cutoff on this trivial case, the data pipeline is broken. If it can, scale to AST and full parameter sets.

Tech stack

Target synth: Surge XT (GPL3, hybrid engine, Python-controllable) VST rendering: Spotify Pedalboard (primary), DawDreamer (alternative) Plugin discovery: Pluginary (parameter enumeration → SQLite) Audio embeddings: OpenL3-music-512 (stage 1), CLAP style embeddings (stage 2 re-ranking) Vector search: Faiss with GPU (cuVS integration, up to 8x on 4090) ML framework: PyTorch + HuggingFace Transformers (AST for v2) Black-box optimization: CMA-ES via pycma System glue: FastAPI (Windows renderer) + requests (WSL2 ML)

What to avoid

Don’t host VST3 in WSL2. Don’t train a large model before proving retrieval works. Don’t target full Surge XT parameter inference (all params + mod matrix + FX) as v1 — restrict to 16-32 continuous params. Don’t use genetic algorithms for primary optimization (too slow for interactive use). Don’t use Wav2Vec2 for audio embeddings (speech objective, misaligned). Don’t use DDSP as the core architecture. Don’t target multiple synths in v1.

Phased roadmap

v1 — Retrieval MVP: Working demo, 1-shot audio → top-5 Surge XT patch suggestions. Pedalboard renderer + CLAP/OpenL3 + Faiss. Success: the four experiments pass.

v2 — Supervised Model: AST regressor trained on 100k+ patches. Better accuracy than pure retrieval. Success: top-1 beats embedding baseline on held-out test set.

v2.5 — Refinement: CMA-ES local search improves match quality. Success: subjective listening test prefers refined over retrieval-only.

v3 — Out-of-Domain: Handle real-world audio (mixed, processed, from any source). SynthRL fine-tuning + optional source separation. Success: meaningful results on 10 real producer audio samples.

v4 — Agentic UI: Conversational refinement (“make it darker,” “add more movement”). LLM orchestrator + parameter adjustment layer. This aligns directly with the Tonari Labs vision of Claude-integrated audio tools.


This post concludes the initial Patch Pilot v2 research compilation. The full PDF with references is available on request. Key papers: DAFx24 AST paper, InverSynth II, ISMIR 2025 flow-matching. See the complete research series for background.