vc

Functions

int main(int argc, char *argv[])

vc [ option ] gmmfile [ infile ]

  • -l int

    • length of source vector \((1 \le M_1 + 1)\)

  • -m int

    • order of source vector \((0 \le M_1)\)

  • -L int

    • length of target vector \((1 \le M_2 + 1)\)

  • -M int

    • order of target vector \((0 \le M_2)\)

  • -k int

    • number of mixtures \((1 \le K)\)

  • -f

    • use full or block covariance instead of diagonal one

  • -d double+

    • delta coefficients

  • -D str

    • filename of double-type delta coefficients

  • -r int+

    • width of 1st (and 2nd) regression coefficients

  • -magic double

    • magic number

  • gmmfile str

    • double-type GMM parameters

  • infile str

    • double-type source static+dynamic vector sequence

  • stdout

    • double-type target static vector sequence

In the following example, the converted 4-th order vectors corresponding data.source are obtained using the trained 2-mixture GMM data.gmm.

delta -l 5 -d -0.5 0.0 0.5 data.source | \
  vc -k 2 -l 5 data.gmm > data.target
Parameters:
  • argc[in] Number of arguments.

  • argv[in] Argument vector.

Returns:

0 on success, 1 on failure.

See also

gmm gmmp

class GaussianMixtureModelBasedConversion

Perform GMM-based voice conversion.

The input is the \((D+1)(M_1+1)\)-length source vectors:

\[ \begin{array}{cccc} \boldsymbol{X}_0, & \boldsymbol{X}_1, & \ldots, & \boldsymbol{X}_{T-1}, \end{array} \]
where
\[ \boldsymbol{X}_t = \left[ \begin{array}{cccc} \boldsymbol{x}_t^{\mathsf{T}} & \Delta^{(1)} \boldsymbol{x}_t^{\mathsf{T}} & \cdots & \Delta^{(D)} \boldsymbol{x}_t^{\mathsf{T}} \end{array} \right]^{\mathsf{T}}. \]
The output is the \((M_2+1)\)-length target vectors:
\[ \begin{array}{cccc} \boldsymbol{y}_0, & \boldsymbol{y}_1, & \ldots, & \boldsymbol{y}_{T-1}. \end{array} \]

The optimal target vectors can be drived in a maximum likelihood sense:

\[\begin{split}\begin{eqnarray} \hat{\boldsymbol{y}} &=& \mathop{\mathrm{argmax}}_{\boldsymbol{y}} \, p(\boldsymbol{Y} \,|\, \boldsymbol{X}, \boldsymbol{\lambda}) \\ &=& \mathop{\mathrm{argmax}}_{\boldsymbol{y}} \sum_{\boldsymbol{m}} p(\boldsymbol{m} \,|\, \boldsymbol{X}, \boldsymbol{\lambda}) \, p(\boldsymbol{Y} \,|\, \boldsymbol{X}, \boldsymbol{m}, \boldsymbol{\lambda}), \end{eqnarray}\end{split}\]
where \(\boldsymbol{\lambda}\) is the GMM that models the joint vectors. The mean vector and the covariance matrix of the \(m\)-th mixture component are written as
\[\begin{split} \boldsymbol{\mu}_m = \left[ \begin{array}{c} \boldsymbol{\mu}_m^{(X)} \\ \boldsymbol{\mu}_m^{(Y)} \end{array} \right] \end{split}\]
and
\[\begin{split} \boldsymbol{\varSigma}_m = \left[ \begin{array}{cc} \boldsymbol{\varSigma}_m^{(XX)} & \boldsymbol{\varSigma}_m^{(XY)} \\ \boldsymbol{\varSigma}_m^{(YX)} & \boldsymbol{\varSigma}_m^{(YY)} \end{array} \right]. \end{split}\]
To easily compute the ML estimate, the maximum a posteriori approximation is applied:
\[\begin{split}\begin{eqnarray} \hat{\boldsymbol{y}} &=& \mathop{\mathrm{argmax}}_{\boldsymbol{y}} \, p(\hat{\boldsymbol{m}} \,|\, \boldsymbol{X}, \boldsymbol{\lambda}) \, p(\boldsymbol{Y} \,|\, \boldsymbol{X}, \hat{\boldsymbol{m}}, \boldsymbol{\lambda}) \\ &=& \mathop{\mathrm{argmax}}_{\boldsymbol{y}} \prod_{t=0}^{T-1} p(\hat{m}_t \,|\, \boldsymbol{X}_t, \boldsymbol{\lambda}) \, p(\boldsymbol{Y}_t \,|\, \boldsymbol{X}_t, \hat{m}_t, \boldsymbol{\lambda}), \end{eqnarray}\end{split}\]
where
\[ p(\hat{m}_t \,|\, \boldsymbol{X}_t, \boldsymbol{\lambda}) = \max_m \, \frac{w_m \mathcal{N} \left( \boldsymbol{X}_t \,\big|\, \boldsymbol{\mu}_m^{(X)}, \boldsymbol{\varSigma}_m^{(XX)} \right)} {\sum_{n=1}^M w_n \mathcal{N} \left( \boldsymbol{X}_t \,\big|\, \boldsymbol{\mu}_n^{(X)}, \boldsymbol{\varSigma}_n^{(XX)} \right)}, \]
and
\[\begin{split}\begin{eqnarray} p(\boldsymbol{Y}_t \,|\, \boldsymbol{X}_t, m, \boldsymbol{\lambda}) &=& \mathcal{N} \left( \boldsymbol{Y}_t \,\big|\, \boldsymbol{E}_{m,t}^{(Y)}, \boldsymbol{D}_{m,t}^{(Y)} \right), \\ \boldsymbol{E}_{m,t}^{(Y)} &=& \boldsymbol{\mu}_m^{(Y)} + \boldsymbol{\varSigma}_m^{(YX)} \boldsymbol{\varSigma}_m^{(XX)^{-1}} \left( \boldsymbol{X}_t - \boldsymbol{\mu}_m^{(X)} \right), \\ \boldsymbol{D}_{m,t}^{(Y)} &=& \boldsymbol{\varSigma}_m^{(YY)} - \boldsymbol{\varSigma}_m^{(YX)} \boldsymbol{\varSigma}_m^{(XX)^{-1}} \boldsymbol{\varSigma}_m^{(XY)}. \end{eqnarray}\end{split}\]
The converted static vector sequence \(\hat{\boldsymbol{y}}\) under the constraint between static and dynamic components is obtained by the maximum likelihood parameter generation algorithm.

[1] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222-2235, 2007.

Public Functions

GaussianMixtureModelBasedConversion(int num_source_order, int num_target_order, const std::vector<std::vector<double>> &window_coefficients, const std::vector<double> &weights, const std::vector<std::vector<double>> &mean_vectors, const std::vector<SymmetricMatrix> &covariance_matrices, bool use_magic_number, double magic_number = 0.0)
Parameters:
  • num_source_order[in] Order of source vector, \(M_1\).

  • num_target_order[in] Order of target vector, \(M_2\).

  • window_coefficients[in] Window coefficients. e.g.) { {-0.5, 0.0, 0.5}, {1.0, -2.0, 1.0} }

  • weights[in] \(K\) mixture weights.

  • mean_vectors[in] \(K\) mean vectors. The shape is \([K, (D+1)(M_1+M_2+2)]\).

  • covariance_matrices[in] \(K\) covariance matrices. The shape is \([K, (D+1)(M_1+M_2+2), (D+1)(M_1+M_2+2)]\).

  • use_magic_number[in] Whether to use magic number.

  • magic_number[in] A magic number represents a discrete symbol.

inline int GetNumSourceOrder() const
Returns:

Order of source vector.

inline int GetNumTargetOrder() const
Returns:

Order of target vector.

inline bool IsValid() const
Returns:

True if this object is valid.

bool Run(const std::vector<std::vector<double>> &source_vectors, std::vector<std::vector<double>> *target_vectors) const
Parameters:
  • source_vectors[in] \(M_1\)-th order source vectors containing dynamic components. The shape is \([T, (D+1)(M_1+1)]\).

  • target_vectors[out] \(M_2\)-th order target vectors. The shape is \([T, (M_2+1)]\).

Returns:

True on success, false on failure.