Why most embeddings are the wrong tool

“Timbre similarity” is a perceptual concept: two sounds are close if their perceived tonal quality is similar, controlling for pitch, loudness, and duration. Most popular audio embedding models were NOT trained for this. They were trained for semantic alignment (audio-text matching), event classification (what is this sound?), or speech representation — all of which can collapse “how it sounds” into “what it is,” losing the fine-grained timbral detail needed for patch retrieval.

Model-by-model breakdown

CLAP (LAION) — Trained for audio-text contrastive alignment. Medium timbre suitability as-is, but high with style embedding extraction — treating internal transformer block outputs as feature maps and computing channel-wise statistics across early layers. A 2025 ISMIR paper benchmarked this against 2,614 human timbre similarity ratings and found it outperforms other representations.

OpenL3 (music variant) — Self-supervised audio-visual correspondence. Good for attribute ranking — a 2025 TU Berlin study found it best among models without handcrafted features for synth parameter-attribute consistency. 512 or 6144-D. Music variant preferred.

EfficientAT — AudioSet classification (MobileNetV3). Best overall for synth-parameter attribute ranking in the TU Berlin study. Our recommended stage-1 fast embedding.

PaSST — AudioSet via patchout spectrogram transformer. Strong performer in attribute ranking, efficient attention. Good alternative to EfficientAT.

AST — AudioSet classification. Best backbone for supervised parameter regression (not retrieval). Used in the DAFx24 paper that’s our primary blueprint.

VGGish — AudioSet event classification. 128-D, 16kHz ceiling, semantic/eventy. Misses high-frequency synth content. Not recommended.

Wav2Vec2 — Speech SSL. Very poor for musical timbre. Speech objective is fundamentally misaligned. Don’t start here.

AudioMAE — Reconstruction (masked autoencoder). Moderate — preserves “how it sounds” better than semantic models, but not a timbre metric by design.

EnCodec — Audio compression codec. Poor for distance. Codec latents don’t reliably reflect perceptual timbre distance. DAC and M2L perform poorly in attribute ranking.

2024–2025 best options

For human-perceptual timbre similarity: CLAP with style embedding extraction. A 2025 ISMIR paper (Queen Mary / Sony CSL) benchmarks this against 2,614 human ratings and finds it outperforms all other representations.

For synth parameter-attribute consistency: EfficientAT (best overall) or OpenL3-music-6144 (best without handcrafted features), per the 2025 TU Berlin study evaluating against monotonic synth parameter sweeps.

Purpose-built (2025): A new contrastive timbre model trained specifically for instrument and synthesizer retrieval reports 81.7% top-1 / 95.7% top-5 accuracy. This is the ideal long-term direction.

Our two-stage retrieval architecture

We’re using a two-stage embedding strategy to balance speed and accuracy:

Stage 1 — Candidate generation (fast): OpenL3-music-512 or EfficientAT. Retrieve top-100 to top-1000 candidates. Cheap and stable.

Stage 2 — Re-ranking (perceptual): CLAP style embeddings on the candidate set. More expensive but best alignment with human timbre judgment.

Vector search: Faiss with GPU acceleration via cuVS integration (up to 8x faster on the RTX 4090). IVF-PQ or CAGRA/HNSW for larger libraries.

Pitch handling is critical — timbre similarity is meaningless if pitch confounds dominate. We’ll use pitch-conditioned indexing: estimate F0, compare only against patches rendered near that F0, or render each patch at multiple pitches.


Part of the Patch Pilot research series. Previous: DDSP verdict. Next: The dataset problem.