DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions

Abstract

This study introduces a novel and interpretable model, DiffVox, for matching vocal effects in music production. DiffVox, short for "Differentiable Vocal Fx", integrates parametric equalisation, dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for parameter estimation. Vocal presets are retrieved from two datasets, comprising 70 tracks from MedleyDB and 365 tracks from a private collection. Analysis of parameter correlations reveals strong relationships between effects and parameters, such as the high-pass and low-shelf filters often working together to shape the low end, and the delay time correlating with the intensity of the delayed signals. Principal component analysis reveals connections to McAdams' timbre dimensions, where the most crucial component modulates the perceived spaciousness while the secondary components influence spectral brightness. Statistical testing confirms the non-Gaussian nature of the parameter distribution, highlighting the complexity of the vocal effects space. These initial findings on the parameter distributions set the foundation for future research in vocal effects modelling and automatic mixing.

Figure 1: The proposed model (upper left) and individual effects for vocal effects processing.

Listening Samples

This section contains the listening samples selected from the vocal tracks of the MedleyDB dataset. No effects refers to the original vocal recording, Target refers to the vocal track processed with the target effects, DiffVox refers to the vocal track processed with the DiffVox model, and w/o Approximation refers to replacing the approximated delay and reverb effects in DiffVox with its real-time implementations.

Pay attention to the tails of the reverbs and delays between the DiffVox and w/o Approximation. The latter has slightly longer tails due to replacing the approximated FIRs with IIRs.

Sample	No effects	Target	DiffVox	w/o Approximation
AClassicEducation_NightOwl
Auctioneer_OurFutureFaces
BigTroubles_Phantom
BrandonWebster_DontHearAThing
BrandonWebster_YesSirICanFly
CatMartino_IPromise
ClaraBerryAndWooldog_WaltzForMyVictims
FamilyBand_Again
HopAlong_SisterCities
LizNelson_Coldwar
LizNelson_ImComingHome
LizNelson_Rainfall
MatthewEntwistle_DontYouEver
MidnightBlue_StarsAreScreaming

Different Effects Configurations

The following table shows the different effects configurations that is used in the paper. Target is the target effects that we want to match, DiffVox is the full DiffVox model, w/o DLY is the DiffVox model without the delay effect, w/o FDN is the DiffVox model without the FDN reverb effect, and w/o DLY & FDN is the DiffVox model without both delay and FDN reverb effects. The audio samples are selected from the vocal tracks of the MedleyDB dataset.

Sample	Target	DiffVox	w/o DLY	w/o FDN	w/o DLY & FDN
MusicDelta_80sRock
MusicDelta_Beatles
MusicDelta_Britpop
MusicDelta_Country1
MusicDelta_Disco
MusicDelta_Gospel
MusicDelta_Grunge
MusicDelta_Punk
MusicDelta_Rockabilly
MutualBenefit_NotForNothing
PortStWillow_StayEven
StevenClark_Bounty
StrandOfOaks_Spacestation
TheScarletBrand_LesFleursDuMal
Torres_NewSkin

Citation


@inproceedings{ycy2025diffvox,
    title={DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions}, 
    author={Chin-Yun Yu and Marco A. Martínez-Ramírez and Junghyun Koo and Ben Hayes and Wei-Hsiang Liao and György Fazekas and Yuki Mitsufuji},
    year={2025},
    booktitle={Proc. DAFx},
}