Deep Hidden Semi-Markov Model-Based Speech Synthesis

Yoshihiko Nankaku, Takato Fujimoto, Takenori Yoshimura, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, and Keiichi Tokuda
Nagoya Institute of Technology, Japan

Abstract: This paper proposes a speech synthesis technique based on a neural sequence-to-sequence (Seq2Seq) model that incorporates the structure of hidden semi-Markov models (HSMMs). While Seq2Seq models with attention mechanisms have achieved high-quality speech synthesis, they suffer from alignment instability and the absence of explicit duration modeling, making direct duration control difficult. To address these challenges, recent approaches have explored models that incorporate explicit alignment and duration representations instead of attention mechanisms. However, these methods have yet to fully achieve the consistency in duration handling that traditional HSMM-based synthesis provides. The proposed model is a theoretically well-grounded deep generative model that integrates HSMM structure into a variational autoencoder (VAE). It performs probabilistic full-space alignment search considering duration probabilities, and its training algorithm is derived purely from the maximization of the evidence lower bound (ELBO), without relying on heuristic assumptions or auxiliary criteria. A key contribution of this work is the clarification of an essential two-stage approximation necessary for the proposed model, comprising: (i) a conjugate posterior distribution with an HSMM structure, and (ii) a subsequent mean-field approximation for the VAE decoder. Furthermore, interpreting the proposed model as a Seq2Seq model with an HSMM-structured attention mechanism establishes a theoretical connection between attention mechanisms and explicit alignment modeling. Experiments on a Japanese speech database demonstrate that the proposed method achieves higher-quality synthesized speech compared to conventional neural network-based acoustic models, while maintaining high modeling efficiency even with limited training data.

Code: https://github.com/sp-nitech/DHSMM-TTS

Audio samples

Filename	fn1	fn2	fn3	fn4	fn5
XIMERA corpus (Full-set: 9.5 hours, WaveGrad)
Natural
AS
Tacotron 2
Fastspeech 2
Glow-TTS
DHSMM-TTS VEM
DHSMM-TTS ELBO
XIMERA corpus (Full-set: 9.5 hours, HiFi-GAN)
Natural
AS
Tacotron 2
Fastspeech 2
Glow-TTS
DHSMM-TTS VEM
DHSMM-TTS ELBO
XIMERA corpus (Small-set: 0.55 hours, WaveGrad)
AS
Fastspeech 2
DHSMM-TTS VEM_Viterbi
DHSMM-TTS VEM_sg_γ
DHSMM-TTS VEM
DHSMM-TTS ELBO
Duration cotrol demo (Full-set: 9.5 hours, WaveGrad)
r_d=1.0
r_d=0.8
r_d=1.2
Modified

Audio samples

XIMERA corpus (Full-set: 9.5 hours, WaveGrad)

XIMERA corpus (Full-set: 9.5 hours, HiFi-GAN)

XIMERA corpus (Small-set: 0.55 hours, WaveGrad)

Duration cotrol demo (Full-set: 9.5 hours, WaveGrad)