Existing paired datasets

Paired datasets exist but have significant constraints. “Paired” often means (features, params) rather than full-waveform audio, and datasets from commercial synths are typically not redistributable.

Multi-Task ASP Preset Dataset (Faronbi, NYU / Zenodo) — ~12.9 GB mel NPY files from Serum, Diva, and TyrellN6. No raw audio, but mel spectrograms + param arrays are ready to use for a quick supervised prototype without rendering.

DAFx24 Massive Dataset (Bruford et al.) — ~22 GB, 1M samples from NI Massive. 64-bin mels + 16 continuous params. Not publicly available (commercial synth), but a template for building our own pipeline.

Sound2Synth / Preset-Gen-VAE — ~30k examples each from Dexed (DX7 FM). 155 and 144 params respectively. FM-only, narrow scope but well-documented.

synth1B1 (torchsynth) — 1 billion 4-second sounds, GPU-rendered from a generic “Voice” synth. Massive scale but not a real commercial/open-source synth. Good for pretraining.

SynthCAT — 3M monophonic samples from Xfer Serum. 250 timbres × 120 ADSR configurations. Excellent for learning how envelopes shape wavetable sounds.

Why we chose Surge XT

This is a real constraint that affects whether we can publish work, share datasets, or distribute a tool. Commercial synths (Serum, Diva, Massive) require treating all generated data as non-redistributable by default — even if the audio output may technically be yours.

Surge XT is explicitly free and open-source under GPL3. The Surge Synth Team FAQ confirms: GPL3 governs modifications and distribution of the software code; audio output is ours. This means we can generate, publish, and share our patch-audio dataset freely.

It’s also a hybrid engine — subtractive, FM, and wavetable — making it a genuine “universal” starting point. And it’s Python-controllable via Pedalboard and surgepy.

Synthetic data generation strategy

Generating our own dataset is not only feasible — it’s the standard approach in published research. The DAFx24 paper generated 1 million samples in ~24 hours on a 2018 Mac Mini using Pedalboard with multiple parallel plugin instances.

Primary tool: Pedalboard (Spotify) — Loads VST3 instruments on Windows, supports MIDI-driven rendering, exposes plugin parameters programmatically. Used in the DAFx24 pipeline.

Alternative: DawDreamer — More complex graph support, better for multi-effect chains.

GPU-native: SynthAX / torchsynth — Up to 90,000x real-time on GPU. Use when we need millions of samples quickly but don’t need a specific VST.

Discovery: Pluginary — Scans VST3 plugins, extracts parameter metadata, caches to SQLite. Essential for reliably enumerating Surge XT’s full parameter list.

Critical requirements for our pipeline

Fixed MIDI note/velocity/duration for each render (C4, standard velocity, 2–4 seconds). Silence rejection below -60 dB RMS. Parameter sampling from preset distributions, not uniform — biases toward “usable” sounds. State reset between renders to prevent leakage. Process isolation per plugin instance with crash auto-restart. Determinism checks: same parameter seed → same audio hash.

Difficulty by synthesis paradigm

Not all synthesis types are equally amenable to ML estimation. This directly affects our scoping:

Simple subtractive (virtual analog) — LOW difficulty. Filter cutoff, resonance, ADSR are visually distinct in spectrograms. Effectively solved for in-domain matching.

Complex subtractive (Surge XT, Serum) — MEDIUM. Unison detune, hard sync, wavetable scanning, modulation routing increase parameter count and interdependence. Workable with a constrained parameter subset (16–32 params).

Wavetable — MEDIUM-HIGH. Wavetable index selection is a retrieval problem in itself (thousands of frames).

FM (Dexed, DX7) — HIGH. Non-linear, small parameter changes cause massive spectral shifts. Operator permutation symmetry. Requires specialist models.

Additive / Granular — VERY HIGH. Research frontier. A 16-oscillator additive synth has 16! (>20 trillion) equivalent parameter configurations.

We’re starting with Surge XT in subtractive mode, restricting to continuous parameters only. This is the “solved-ish” regime that gets a working system up quickly.


Part of the Patch Pilot research series. Previous: Audio embeddings. Next: MVP architecture & build plan.