plp#

diffsptk.PLP#: alias of PerceptualLinearPredictiveCoefficientsAnalysis

class diffsptk.PerceptualLinearPredictiveCoefficientsAnalysis(*, fft_length: int, plp_order: int, n_channel: int, sample_rate: int, compression_factor: float = 0.33, lifter: int = 1, f_min: float = 0, f_max: float | None = None, floor: float = 1e-05, gamma: float = 0, scale: str = 'htk', erb_factor: float | None = None, n_fft: int = 512, out_format: str | int = 'y', learnable: bool = False, device: device | None = None, dtype: dtype | None = None)[source]#

See this page for details.

Parameters:

fft_lengthint >= 2: The number of FFT bins, \(L\).
plp_orderint >= 1: The order of the PLP, \(M\).
n_channelint >= 1: Number of mel filter banks, \(C\).
sample_rateint >= 1: The sample rate in Hz.
compression_factorfloat > 0: The amplitude compression factor.
lifterint >= 1: The liftering coefficient.
f_minfloat >= 0: The minimum frequency in Hz.
f_maxfloat <= sample_rate // 2: The maximum frequency in Hz.
floorfloat > 0: The minimum mel filter bank output in linear scale.
gammafloat in [-1, 1]: The parameter of the generalized logarithmic function.
scale[‘htk’, ‘mel’, ‘inverted-mel’, ‘bark’, ‘linear’]: The type of auditory scale used to construct the filter bank.
erb_factorfloat > 0 or None: The scale factor for the ERB scale, referred to as the E-factor. If not None, the filter bandwidths are adjusted according to the scaled ERB scale.
n_fftint >> M: The number of FFT bins for the conversion from LPC to cepstrum. The accurate conversion requires the large value.
out_format[‘y’, ‘yE’, ‘yc’, ‘ycE’]: y is PLP, c is C0, and E is energy.
learnablebool: Whether to make the mel basis learnable.
devicetorch.device or None: The device of this module.
dtypetorch.dtype or None: The data type of this module.

References

[1]

Young et al., “The HTK Book,” Cambridge University Press, 2006.

forward(x: Tensor) → Tensor[source]#

Compute the PLP from the power spectrum.

Parameters:

xTensor [shape=(…, L/2+1)]: The power spectrum.

Returns:

yTensor [shape=(…, M)]: The PLP without C0.
ETensor [shape=(…, 1)] (optional): The energy.
cTensor [shape=(…, 1)] (optional): The C0.

Examples

>>> x = diffsptk.ramp(19)
>>> stft = diffsptk.STFT(frame_length=10, frame_period=10, fft_length=32)
>>> plp = diffsptk.PLP(
...     fft_length=32, mfcc_order=4, n_channel=8, sample_rate=8000
... )
>>> y = plp(stft(x))
>>> y
tensor([[-0.2896, -0.2356, -0.0586, -0.0387],
        [ 0.4468, -0.5820,  0.0104, -0.0505]])

diffsptk.functional.plp(x: Tensor, plp_order: int, n_channel: int, sample_rate: int, compression_factor: float = 0.33, lifter: int = 1, f_min: float = 0, f_max: float | None = None, floor: float = 1e-05, gamma: float = 0, scale: str = 'htk', erb_factor: float | None = None, n_fft: int = 512, out_format: str = 'y') → Tensor[source]#

Compute the MFCC from the power spectrum.

Parameters:

xTensor [shape=(…, L/2+1)]: The power spectrum.
plp_orderint >= 1: The order of the PLP, \(M\).
n_channelint >= 1: The number of mel filter banks, \(C\).
sample_rateint >= 1: The sample rate in Hz.
compression_factorfloat > 0: The amplitude compression factor.
lifterint >= 1: The liftering coefficient.
f_minfloat >= 0: The minimum frequency in Hz.
f_maxfloat <= sample_rate // 2: The maximum frequency in Hz.
floorfloat > 0: The minimum mel filter bank output in linear scale.
gammafloat in [-1, 1]: The parameter of the generalized logarithmic function.
scale[‘htk’, ‘mel’, ‘inverted-mel’, ‘bark’, ‘linear’]: The type of auditory scale used to construct the filter bank.
erb_factorfloat > 0 or None: The scale factor for the ERB scale, referred to as the E-factor. If not None, the filter bandwidths are adjusted according to the scaled ERB scale.
n_fftint >> M: The number of FFT bins for the conversion from LPC to cepstrum.
out_format[‘y’, ‘yE’, ‘yc’, ‘ycE’]: y is MFCC, c is C0, and E is energy.

Returns:

yTensor [shape=(…, M)]: The MFCC without C0.
ETensor [shape=(…, 1)] (optional): The energy.
cTensor [shape=(…, 1)] (optional): The C0.