Patch Pilot v1: The Architecture and What We Built

The idea

Patch Pilot started as a question most producers have asked at some point: “How do I make that sound?”

The concept: give the tool audio — a synth snippet, a YouTube link, a text description like “warm detuned supersaw” — and get back synthesizer parameter settings that recreate the timbre. Not the notes. Not the melody. Just the recipe for the sound itself. Copyright-safe by design.

Four-layer architecture

The system was designed as a pipeline with four distinct layers, each feeding into the next.

Layer 1 — Input processing

The InputHandler auto-detects what you throw at it and routes accordingly:

Text descriptions go through a TextProcessor with a built-in timbre vocabulary — it maps words like “gritty,” “warm,” “plucky” to synthesis parameter ranges using NLP embeddings (DistilBERT).
Audio files get loaded via AudioLoader (librosa), normalized to 22050 Hz.
YouTube URLs are handled by YouTubeAudioFetcher using yt-dlp — you can specify a start time and duration to isolate a specific sound.

This layer was built in August 2025 with Claude Code, and it was the cleanest piece of the project — 398 lines, 27 tests, well-typed. It stayed stable through every subsequent change.

Layer 2 — Audio analysis

This is where it gets interesting. The audio analysis engine has four sub-modules:

Source separation via Meta’s Demucs — isolates the target sound from a full mix by splitting into stems (vocals, drums, bass, other).
SpectralAnalyzer — extracts timbre characteristics: spectral centroid, bandwidth, rolloff, flatness, contrast. These tell you what kind of sound you’re dealing with (bright vs dark, noisy vs tonal).
PitchAnalyzer — uses the PYIN algorithm to detect fundamental frequency, voiced ratio, and vibrato characteristics.
EnvelopeAnalyzer — estimates ADSR envelope parameters from the amplitude contour. How does the sound attack? How long does it sustain? How quickly does it release?

An AudioAnalyzer orchestrator ties these together and produces synthesis_hints — high-level suggestions like “this sounds like a sawtooth with a low-pass filter” based on the spectral profile.

Layer 3 — Synth matching

The SynthMatcher takes those analysis results and maps them to a UniversalSynthParameters schema — a normalized representation of synth settings that isn’t tied to any specific synthesizer:

Oscillator settings (waveform type, level, detune, phase, unison count)
Filter configuration (type, cutoff in Hz, resonance, drive, envelope amount)
ADSR envelopes for both amplitude and filter
LFO parameters (shape, rate, amount, sync mode)
Effects chain (reverb, delay, distortion amounts)
Voice settings (polyphony, portamento, master volume)

The mapping is rule-based — spectral centroid maps to filter cutoff, voiced ratio determines oscillator type suggestions, percussiveness maps to filter envelope amount, vibrato maps to LFO rate. Not ML, just thoughtful heuristics.

Layer 4 — Output

Two output formats survived to the final version:

Audio WAV — the AudioSynthesizer actually renders a one-shot sample from the parameters. Basic waveform generation (sine, saw, square, triangle, noise), Butterworth IIR filters, ADSR envelopes. Pitched sounds get transposed to C4 (261.63 Hz); unpitched sounds use the spectral centroid. Output: 44.1kHz 16-bit PCM mono WAV.
Markdown instructions — human-readable step-by-step instructions for recreating the sound manually. Formatted with proper units (Hz/kHz, ms/s, percentages) and conditionally includes/excludes sections based on parameter relevance.

A full CLI (tonari-pp analyze) tied everything together with format selection, device choice (CPU/CUDA), and output directory configuration.

The December sprint

The most striking thing about this project’s history is the pace.

The input layer was built in August 2025. Then the project sat dormant for three and a half months.

On December 3, 2025, in a single session, all of Phase 2-2 (audio analysis — five modules), Phase 2-3 (synth matching — five components), and Phase 3 (CLI) were implemented. That’s roughly 3,500 lines of code and 86 tests, all co-authored with Claude Code. The git log shows commits from 10:31 AM to 2:32 PM — four hours.

Five days later, the audio synthesis layer was added (waveform generation, envelopes, filtering, full integration). Then the Vital preset format — which had been built as part of the synth matcher — was completely removed because it never actually produced working .vital files. The complex JSON structure required by Vital’s preset format was never properly reverse-engineered.

By December 12, the codebase was clean: audio + markdown output, E2E tests passing, dead code removed.

Then it went dormant again.

What worked and what didn’t

What worked well:

The four-layer pipeline design was sound. Each layer had clear inputs and outputs.
Claude Code was remarkably effective for building the analysis and matching layers — the kind of systematic, well-typed Python where you can specify behavior precisely.
The test coverage was good (86+ tests before audio synthesis, 137+ after).
The rule-based synth matching was a reasonable first approach.

What didn’t work:

The Vital preset generator was a waste of effort — reverse-engineering proprietary preset formats is a losing game.
The embedding models (CLAP, VGGish, Wav2Vec2) were set up but never actually used in the pipeline. The matching layer used rule-based heuristics instead.
The audio synthesis output, while functional, was extremely basic — simple waveforms with Butterworth filters can’t reproduce the complex timbres that make reverse sound design interesting.
Pre-commit hooks (especially mypy strict mode) created friction that contributed to the project stalling. Several commits were blocked by type annotation issues that had nothing to do with correctness.

The fundamental gap: the system could analyze a sound and guess at parameters, but there was no validation loop — no way to compare “does the output actually sound like the input?” That’s the hard problem, and it was never addressed.

Next: Why we’re starting over — and what the research reboot looks like.