Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Abstract

Training the linear prediction (LP) operator end-to-end for audio synthesis in modern deep learning frameworks is slow due to its recursive formulation. In addition, frame-wise approximation as an acceleration method cannot generalise well to test time conditions where the LP is computed sample-wise. Efficient differentiable sample-wise LP for end-to-end training is the key to removing this barrier. We generalise the efficient time-invariant LP implementation from the GOLF vocoder to time-varying cases. Combining this with the classic source-filter model, we show that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts. Moreover, in our listening test, synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.

Figure 1: Flow diagram of the proposed end-to-end analysis-by-synthesis experiment. S(z) is the speech, G(z) is the harmonic oscillator, C(z) is the noise filter, and H(z) is the vocal tract filter.

Listening Samples

This section contains the listening samples from the VCTK Corpus used for subjective evaluation in the paper. Reference is the ground truth audio, and the other columns are the copy-synthesis speech using the same clip with different models, which are the traditional linear predictive coding (LPC), the neural homomorphic vocoder (NHV), the differentiable WORLD vocoder (▽World), and the proposed source-filter GOLF vocoder (GOLF).

p361

Test clip	Reference	LPC	NHV	▽World	GOLF
p361_004
p361_006
p361_008
p361_011
p361_024

p360

Test clip	Reference	LPC	NHV	▽World	GOLF
p360_003
p360_005
p360_007
p360_009
p360_019

GOLF variants

This section compares the following variants of the GOLF synthesiser: the original GOLF (GOLF-v1), the source-filter GOLF trained with frame-wise LPC and synthesise with frame-wise (GOLF-ff) or sample-wise (GOLF-fs) LPC, and the proposed source-filter GOLF with sample-wise LPC both used in training and evaluation (GOLF-ss).

Test clip	Reference	GOLF-v1	GOLF-ff	GOLF-fs	GOLF-ss
p361_004
p361_006
p361_008
p361_011
p361_024
p360_003
p360_005
p360_007
p360_009
p360_019

Citation


@inproceedings{ycy2024golf,
    title={Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis},
    author={Chin-Yun Yu and Gy{\"o}rgy Fazekas},
    booktitle={INTERSPEECH},
    year={2024}
}