DDSP: Why We're Not Using It as the Core

What DDSP actually is

DDSP (Differentiable Digital Signal Processing) from Google Magenta integrates classic DSP elements — oscillators, envelopes, filters, reverb — into an automatic differentiation framework. The canonical DDSP autoencoder uses a harmonic additive synthesizer (sums sinusoids at harmonic multiples of a detected F0), a subtractive noise synthesizer (time-varying filtered noise), and optional reverb. Training uses multi-scale spectral reconstruction loss.

Its published strength is monophonic, pitched, harmonic sources — solo instruments, voice, single notes.

Where it falls apart for our use case

For subtractive/wavetable parameter inference specifically:

What can work: Monophonic synth tones that are approximately harmonic can be approximated by the harmonic + noise representation. Feature extraction (F0, loudness, harmonic envelopes) as inputs to a downstream parameter predictor is a legitimate use.

What breaks: DDSP’s canonical controls (harmonic distributions, noise band magnitudes) are NOT the same as oscillator wave selection, unison detune, filter topology, envelope shape, or modulation routing. The representation mismatch is fundamental — DDSP speaks a different language than real synth UIs.

Wavetable aliasing: Magenta’s own tutorial notebook notes audible aliasing from linear interpolation in DDSP’s wavetable module. This is a red flag for modern bright wavetable patches.

The specific limitations

Monophonic assumption. Polyphonic input produces unstable results. DDSP mixture models exist but are complex. We’ll enforce monophonic input in v1.

Harmonic+noise model. FM sidebands, sync, complex modulations, and transients are not well-represented. The model smooths envelopes deliberately — plucks, percussion, and aggressive attacks are poorly handled.

Parameter mismatch. DDSP controls ≠ real synth knob labels. A user can’t apply DDSP output directly to Surge XT.

FM/inharmonic sounds. Spectral loss refinement converges poorly for FM synthesis. Multiple papers explicitly report this failure mode.

Our verdict

We won’t use DDSP as the core of Patch Pilot v2. Instead, we’ll apply the idea — differentiable rendering + audio loss — to a differentiable synthesizer whose parameterization actually matches the target synth’s UI (like DiffMoog’s approach). DDSP proper will serve as an auxiliary feature extractor (F0, loudness, harmonic envelopes) feeding a downstream parameter estimator, or as a creative transformation mode separate from the main matching pipeline.

Part of the Patch Pilot research series. Previous: ML approaches compared. Next: Audio embeddings for timbre.