Chin-Yun Yu 1*, Marco A. Martínez-Ramírez 2, Junghyun Koo 2, Ben Hayes 1, Wei-Hsiang Liao 2, György Fazekas 1, Yuki Mitsufuji 23
1 Centre for Digital Music, Queen Mary
University of London
2 Sony AI
3 Sony Group Corporation
* Work done during an internship at Sony AI.
This study introduces a novel and interpretable model, DiffVox, for matching vocal effects in music production. DiffVox, short for "Differentiable Vocal Fx", integrates parametric equalisation, dynamic range control, delay, and reverb with efficient differentiable implementations to enable gradient-based optimisation for parameter estimation. Vocal presets are retrieved from two datasets, comprising 70 tracks from MedleyDB and 365 tracks from a private collection. Analysis of parameter correlations reveals strong relationships between effects and parameters, such as the high-pass and low-shelf filters often working together to shape the low end, and the delay time correlating with the intensity of the delayed signals. Principal component analysis reveals connections to McAdams' timbre dimensions, where the most crucial component modulates the perceived spaciousness while the secondary components influence spectral brightness. Statistical testing confirms the non-Gaussian nature of the parameter distribution, highlighting the complexity of the vocal effects space. These initial findings on the parameter distributions set the foundation for future research in vocal effects modelling and automatic mixing.
Figure 1: The proposed model (upper left) and individual effects for vocal effects processing.
This section contains the listening samples selected
from the vocal tracks of the MedleyDB dataset.
No effects refers to the original vocal
recording, Target refers to the vocal track
processed with the target effects, DiffVox
refers to the vocal track processed with the
DiffVox model, and w/o Approximation refers to
replacing the approximated delay and reverb
effects in DiffVox with its real-time
implementations.
Pay attention to the tails of the reverbs and delays
between the DiffVox and w/o Approximation.
The latter has slightly longer tails due to
replacing the approximated FIRs with IIRs.
Sample | No effects | Target | DiffVox | w/o Approximation |
---|---|---|---|---|
AClassicEducation_NightOwl | ||||
Auctioneer_OurFutureFaces | ||||
BigTroubles_Phantom | ||||
BrandonWebster_DontHearAThing | ||||
BrandonWebster_YesSirICanFly | ||||
CatMartino_IPromise | ||||
ClaraBerryAndWooldog_WaltzForMyVictims | ||||
FamilyBand_Again | ||||
HopAlong_SisterCities | ||||
LizNelson_Coldwar | ||||
LizNelson_ImComingHome | ||||
LizNelson_Rainfall | ||||
MatthewEntwistle_DontYouEver | ||||
MidnightBlue_StarsAreScreaming |
The following table shows the different effects configurations that is used in the paper. Target is the target effects that we want to match, DiffVox is the full DiffVox model, w/o DLY is the DiffVox model without the delay effect, w/o FDN is the DiffVox model without the FDN reverb effect, and w/o DLY & FDN is the DiffVox model without both delay and FDN reverb effects. The audio samples are selected from the vocal tracks of the MedleyDB dataset.
Sample | Target | DiffVox | w/o DLY | w/o FDN | w/o DLY & FDN |
---|---|---|---|---|---|
MusicDelta_80sRock | |||||
MusicDelta_Beatles | |||||
MusicDelta_Britpop | |||||
MusicDelta_Country1 | |||||
MusicDelta_Disco | |||||
MusicDelta_Gospel | |||||
MusicDelta_Grunge | |||||
MusicDelta_Punk | |||||
MusicDelta_Rockabilly | |||||
MutualBenefit_NotForNothing | |||||
PortStWillow_StayEven | |||||
StevenClark_Bounty | |||||
StrandOfOaks_Spacestation | |||||
TheScarletBrand_LesFleursDuMal | |||||
Torres_NewSkin |
@inproceedings{ycy2025diffvox,
title={DiffVox: A Differentiable Model for Capturing and Analysing Vocal Effects Distributions},
author={Chin-Yun Yu and Marco A. Martínez-Ramírez and Junghyun Koo and Ben Hayes and Wei-Hsiang Liao and György Fazekas and Yuki Mitsufuji},
year={2025},
booktitle={Proc. DAFx},
}