Deep Hidden Semi-Markov Model-Based Speech Synthesis
Yoshihiko Nankaku, Takato Fujimoto, Takenori Yoshimura, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, and Keiichi Tokuda
Nagoya Institute of Technology, Japan
Abstract:
This paper proposes a speech synthesis technique based on a neural
sequence-to-sequence (Seq2Seq) model that incorporates the structure
of hidden semi-Markov models (HSMMs).
While Seq2Seq models with attention mechanisms have achieved
high-quality speech synthesis, they suffer from alignment instability
and the absence of explicit duration modeling, making direct duration
control difficult.
To address these challenges, recent approaches have explored models
that incorporate explicit alignment and duration representations
instead of attention mechanisms.
However, these methods have yet to fully achieve the consistency in
duration handling that traditional HSMM-based synthesis provides.
The proposed model is a theoretically well-grounded deep generative
model that integrates HSMM structure into a variational autoencoder
(VAE).
It performs probabilistic full-space alignment search considering
duration probabilities, and its training algorithm is derived purely
from the maximization of the evidence lower bound (ELBO), without
relying on heuristic assumptions or auxiliary criteria.
A key contribution of this work is the clarification of an essential
two-stage approximation necessary for the proposed model, comprising:
(i) a conjugate posterior distribution with an HSMM structure, and
(ii) a subsequent mean-field approximation for the VAE decoder.
Furthermore, interpreting the proposed model as a Seq2Seq model with
an HSMM-structured attention mechanism establishes a theoretical
connection between attention mechanisms and explicit alignment
modeling.
Experiments on a Japanese speech database demonstrate that the
proposed method achieves higher-quality synthesized speech compared to
conventional neural network-based acoustic models, while maintaining
high modeling efficiency even with limited training data.
|
Code: https://github.com/sp-nitech/DHSMM-TTS
Audio samples
|