Posts by Collection

music

portfolio

publications

Multi-layered Cepstrum for Instantaneous Frequency Estimation

Published in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2018

We propose the multi-layered cepstrum (MLC) method to estimate multiple fundamental frequencies (MF0) of a signal under challenging contamination such as high-pass filter noise. Taking the operation of cepstrum (i.e., Fourier transform, filtering, and nonlinear activation) recursively, MLC is shown as an efficient method to enhance MF0 saliency in a step-by-step manner. Evaluation on a real-world polyphonic music dataset under both normal and low-fidelity conditions demonstrates the potential of MLC.

Recommended citation: Chin-Yun Yu and Li Su, "Multi-layered Cepstrum for Instantaneous Frequency Estimation", IEEE Global Conference on Signal and Information Processing, Novermber 2018.
Download Paper

Harmonic Preserving Neural Networks for Efficient and Robust Multipitch Estimation

Published in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020

Multi-pitch estimation (MPE) is a fundamental yet challenging task in audio processing. Recent MPE techniques based on deep learning have shown improved performance but are computation-hungry and relatively sensitive to the variation of data such as noise contamination, cross-domain data, etc. In this paper, we present the harmonic preserving neural network (HPNN), a model that incorporates deep learning and domain knowledge in signal processing to improve the efficiency and robustness of MPE. The proposed method starts from the multilayered cepstrum (MLC), a feature representation that utilizes repeated Fourier transform and nonlinear scaling to suppress the non-periodic components in signals. Following the combined frequency and periodicity (CFP) principle, the last two layers of the MLC are integrated to suppress the harmonics of pitches in the spectrum and enhance the components of true fundamental frequencies. A convolutional neural network (CNN) is then placed to further optimize the pitch activation. The whole system is constructed as an end-to-end learning scheme. Improved time efficiency and performance robustness to noise and cross-domain data are demonstrated with experiments on polyphonic music in various noise levels and multi-talker speech.

Recommended citation: Chin-Yun Yu, Jing-Hua Lin, and Li Su, "Harmonic Preserving Neural Networks for Efficient and Robust Multipitch Estimation", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, December 2020.
Download Paper

Danna-Sep: Unite to separate them all

Published in Music Demixing Workshop, ISMIR, 2021

Deep learning-based music source separation has gained a lot of interest in the last decades. Most of the existing methods operate with either spectrograms or waveforms. Spectrogram based models learn suitable masks for separating magnitude spectrogram into different sources, and waveform-based models directly generate waveforms of individual sources. The two types of models have complementary strengths; the former is superior given harmonic sources such as vocals, while the latter demonstrates better results for percussion and bass instruments. In this work, we improved upon the state-of-the-art (SoTA) models and successfully combined the best of both worlds. The backbones of the proposed framework, dubbed Danna-Sep, are two spectrogram-based models including a modified X-UMX and U-Net, and an enhanced Demucs as the waveform-based model. Given an input of mixture, we linearly combined respective outputs from the three models to obtain the final result. We showed in the experiments that, despite its simplicity, Danna-Sep surpassed the SoTA models by a large margin in terms of Source-to-Distortion Ratio.

Recommended citation: Chin-Yun Yu and Kin-Wai Cheuk, "Danna-Sep: Unite to separate them all", The ISMIR Workshop on Music Source Separation, November 2021.
Download Paper

Music Demixing Challenge 2021

Published in Frontiers in Signal Processing, 2022

Music source separation has been intensively studied in the last decade and tremendous progress with the advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models and corresponding papers, which can help researchers integrate the best practices into their models. In recent years, the widely used MUSDB18 dataset played an important role in measuring the performance of music source separation. While the dataset made a considerable contribution to the advancement of the field, it is also subject to several biases resulting from a focus on Western pop music and a limited number of mixing engineers being involved. To address these issues, we designed the Music Demixing Challenge on a crowd-based machine learning competition platform where the task is to separate stereo songs into four instrument stems (Vocals, Drums, Bass, Other). The main differences compared with the past challenges are 1) the competition is designed to more easily allow machine learning practitioners from other disciplines to participate, 2) evaluation is done on a hidden test set created by music professionals dedicated exclusively to the challenge to assure the transparency of the challenge, i.e., the test set is not accessible from anyone except the challenge organizers, and 3) the dataset provides a wider range of music genres and involved a greater number of mixing engineers. In this paper, we provide the details of the datasets, baselines, evaluation metrics, evaluation results, and technical challenges for future competitions.

Recommended citation: Y Mitsuji, G Fabbro, S Uhlich, F-R Stöter, A Défossez, M Kim, W Choi, C-Y Yu, K-W Cheuk, "Music Demixing Challenge 2021", Front. Sig. Proc. 1:808395, January 2022.
Download Paper

Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution

Published in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

Recently, diffusion models (DMs) have been increasingly used in audio processing tasks, including speech super-resolution (SR), which aims to restore high-frequency content given low-resolution speech utterances. This is commonly achieved by conditioning the network of noise predictor with low-resolution audio. In this paper, we propose a novel sampling algorithm that communicates the information of the low-resolution audio via the reverse sampling process of DMs. The proposed method can be a drop-in replacement for the vanilla sampling process and can significantly improve the performance of the existing works. Moreover, by coupling the proposed sampling method with an unconditional DM, i.e., a DM with no auxiliary inputs to its noise predictor, we can generalize it to a wide range of SR setups. We also attain state-of-the-art results on the VCTK Multi-Speaker benchmark with this novel formulation.

Recommended citation: Chin-Yun Yu, Sung-Lin Yeh, György Fazekas, and Hao Tang, "Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution", IEEE International Conference on Acoustics, Speech and Signal Processing, June 2023.
Download Paper

Zero-Shot Duet Singing Voices Separation with Diffusion Models

Published in Sound Demixing Workshop, ISMIR, 2023

In recent studies, diffusion models have shown promise as priors for solving audio inverse problems, including source separation. These models allow us to sample from the posterior distribution of a target signal given an observed signal by manipulating the diffusion process. However, when separating audio sources of the same type, such as duet singing voices, the prior learned by the diffusion process may not be sufficient to maintain the consistency of the source identity in the separated audio. For example, the singer may change from one to another from time to time. Tackling this problem will be useful for separating sources in a choir, or a mixture of multiple instruments with similar timbre, without acquiring large amounts of paired data. In this paper, we examine this problem in the context of duet singing voices separation, and propose a method to enforce the coherency of singer identity by splitting the mixture into overlapping segments and performing posterior sampling in an auto-regressive manner, conditioning on the previous segment. We evaluate the proposed method on the MedleyVox dataset with different overlap ratios, and show that the proposed method outperforms naive posterior sampling baseline.

Recommended citation: Chin-Yun Yu, Emilian Postolache, Emanuele Rodolà, and György Fazekas, "Zero-Shot Duet Singing Voices Separation with Diffusion Models", The ISMIR Sound Demixing Workshop, November 2023.
Download Paper

Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables

Published in International Society for Music Information Retrieval Conference (ISMIR), 2023

This paper introduces GlOttal-flow LPC Filter (GOLF), a novel method for singing voice synthesis (SVS) that exploits the physical characteristics of the human voice using differentiable digital signal processing. GOLF employs a glottal model as the harmonic source and IIR filters to simulate the vocal tract, resulting in an interpretable and efficient approach. We show it is competitive with state-of-the-art singing voice vocoders, requiring fewer synthesis parameters and less memory to train, and runs an order of magnitude faster for inference. Additionally, we demonstrate that GOLF can model the phase components of the human voice, which has immense potential for rendering and analysing singing voices in a differentiable manner. Our results highlight the effectiveness of incorporating the physical properties of the human voice mechanism into SVS and underscore the advantages of signal-processing-based approaches, which offer greater interpretability and efficiency in synthesis.

Recommended citation: Chin-Yun Yu and György Fazekas, "Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables", International Society for Music Information Retrieval Conference, November 2023.
Download Paper

Time-of-arrival estimation and phase unwrapping of head-related transfer functions with integer linear programming

Published in Audio Engineering Society 156th Convention, 2024

In binaural audio synthesis, aligning head-related impulse responses (HRIRs) in time has been an important pre-processing step, enabling accurate spatial interpolation and efficient data compression. The maximum correlation time delay between spatially nearby HRIRs has previously been used to get accurate and smooth alignment by solving a matrix equation in which the solution has the minimum Euclidean distance to the time delay. However, the Euclidean criterion could lead to an over-smoothing solution in practice. In this paper, we solve the smoothing issue by formulating the task as solving an integer linear programming problem equivalent to minimising an L1-norm. Moreover, we incorporate 1) the cross-correlation of inter-aural HRIRs, and 2) HRIRs with their minimum-phase responses to have more reference measurements for optimisation. We show the proposed method can get more accurate alignments than the Euclidean-based method by comparing the spectral reconstruction loss of time-aligned HRIRs using spherical harmonics representation on seven HRIRs consisting of human and dummy heads. The extra correlation features and the L1-norm are also beneficial in extremely noisy conditions. In addition, this method can be applied to phase unwrapping of head-related transfer functions, where the unwrapped phase could be a compact feature for downstream tasks.

Recommended citation: Chin-Yun Yu, Johan Pauwels, and György Fazekas, "Time-of-arrival estimation and phase unwrapping of head-related transfer functions with integer linear programming", Audio Engineering Society Convention 156, June 2024.
Download Paper

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Published in INTERSPEECH, 2024

Training the linear prediction (LP) operator end-to-end for audio synthesis in modern deep learning frameworks is slow due to its recursive formulation. In addition, frame-wise approximation as an acceleration method cannot generalise well to test time conditions where the LP is computed sample-wise. Efficient differentiable sample-wise LP for end-to-end training is the key to removing this barrier. We generalise the efficient time-invariant LP implementation from the GOLF vocoder to time-varying cases. Combining this with the classic source-filter model, we show that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts. Moreover, in our listening test, synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.

Recommended citation: Chin-Yun Yu and György Fazekas, "Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis", Proc. Interspeech, September 2024.
Download Paper

Differentiable All-pole Filters for Time-varying Audio Systems

Published in International Conference on Digital Audio Effects (DAFx), 2024

Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-to-end training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by re-expressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin at this URL.

Recommended citation: Chin-Yun Yu, Christopher Mitcheltree, Alistair Carson, Stefan Bilbao, Joshua Reiss and György Fazekas, "Differentiable All-pole Filters for Time-varying Audio Systems", International Conference on Digital Audio Effects, September 2024.
Download Paper

talks

NMF in PyTorch

Published:

I was invited by Doctor Lin, the Head of MEIL, to give a talk about the making process and the design of my personal project, PyTorch NMF. It was a wonderful experience to share my findings with experts in other fields and take discussions together.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.