fbank#
- diffsptk.FBANK#
alias of
MelFilterBankAnalysis
- class diffsptk.MelFilterBankAnalysis(*, fft_length: int, n_channel: int, sample_rate: int, f_min: float = 0, f_max: float | None = None, floor: float = 1e-05, gamma: float = 0, scale: str = 'htk', erb_factor: float | None = None, use_power: bool = False, out_format: str | int = 'y', learnable: bool = False)[source]#
See this page for details.
- Parameters:
- fft_lengthint >= 2
The number of FFT bins, \(L\).
- n_channelint >= 1
The number of mel filter banks, \(C\).
- sample_rateint >= 1
The sample rate in Hz.
- f_minfloat >= 0
The minimum frequency in Hz.
- f_maxfloat <= sample_rate // 2
The maximum frequency in Hz.
- floorfloat > 0
The minimum mel filter bank output in linear scale.
- gammafloat in [-1, 1]
The parameter of the generalized logarithmic function.
- scale[‘htk’, ‘mel’, ‘inverted-mel’, ‘bark’, ‘linear’]
The type of auditory scale used to construct the filter bank.
- erb_factorfloat or None
The scale factor for the ERB scale, referred to as the E-factor. If not None, the filter bandwidths are adjusted according to the scaled ERB scale.
- use_powerbool
If True, use the power spectrum instead of the amplitude spectrum.
- out_format[‘y’, ‘yE’, ‘y,E’]
y is mel filber bank output and E is energy. If this is yE, the two output tensors are concatenated and return the tensor instead of the tuple.
- learnablebool
Whether to make the basis learnable.
References
[1]S. Young et al., “The HTK Book Version 3.4,” Cambridge University Press, 2006.
[2]T. Ganchev et al., “Comparative evaluation of various MFCC implementations on the speaker verification task,” Proceedings of SPECOM, vol. 1, pp. 191-194, 2005.
[3]M. D. Skowronski et al., “Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition,” The Journal of the Acoustical Society of America, vol. 116, no. 3, pp. 1774-1780, 2004.
- forward(x: Tensor) Tensor | tuple[Tensor, Tensor] [source]#
Apply mel filter banks to the STFT.
- Parameters:
- xTensor [shape=(…, L/2+1)]
The power spectrum.
- Returns:
- yTensor [shape=(…, C)]
The mel filter bank output.
- ETensor [shape=(…, 1)] (optional)
The energy.
Examples
>>> x = diffsptk.ramp(19) >>> stft = diffsptk.STFT(frame_length=10, frame_period=10, fft_length=32) >>> fbank = diffsptk.MelFilterBankAnalysis( ... fft_length=32, n_channel=4, sample_rate=8000 ... ) >>> y = fbank(stft(x)) >>> y tensor([[0.1214, 0.4825, 0.6072, 0.3589], [3.3640, 3.4518, 2.7717, 0.5088]])
- diffsptk.functional.fbank(x: Tensor, n_channel: int, sample_rate: int, f_min: float = 0, f_max: float | None = None, floor: float = 1e-05, gamma: float = 0, scale: str = 'htk', erb_factor: float | None = None, use_power: bool = False, out_format: str = 'y') tuple[Tensor, Tensor] | Tensor [source]#
Apply mel-filter banks to the STFT.
- Parameters:
- xTensor [shape=(…, L/2+1)]
The power spectrum.
- n_channelint >= 1
The number of mel filter banks, \(C\).
- sample_rateint >= 1
The sample rate in Hz.
- f_minfloat >= 0
The minimum frequency in Hz.
- f_maxfloat <= sample_rate // 2
The maximum frequency in Hz.
- floorfloat > 0
The minimum mel filter bank output in linear scale.
- gammafloat in [-1, 1]
The parameter of the generalized logarithmic function.
- scale[‘htk’, ‘mel’, ‘inverted-mel’, ‘bark’, ‘linear’]
The type of auditory scale used to construct the filter bank.
- erb_factorfloat or None
The scale factor for the ERB scale, referred to as the E-factor. If not None, the filter bandwidths are adjusted according to the scaled ERB scale.
- use_powerbool
If True, use the power spectrum instead of the amplitude spectrum.
- out_format[‘y’, ‘yE’, ‘y,E’]
y is mel filber bank output and E is energy. If this is yE, the two output tensors are concatenated and return the tensor instead of the tuple.
- Returns:
- yTensor [shape=(…, C)]
The mel filter bank output.
- ETensor [shape=(…, 1)] (optional)
The energy.