要在GitHub上执行或查看/下载此笔记本

语音特征

语音是一种非常高维度的信号。例如，当采样频率为16 kHz时，我们每秒有16000个样本。从机器学习的角度来看，处理如此高维度的数据可能非常关键。特征提取的目标是找到更紧凑的方式来表示语音。

几年前，研究适当的语音特征是一个非常活跃的研究领域。然而，随着深度学习的出现，趋势是向神经网络提供简单特征。然后我们让网络自己发现更高层次的表示。

在本教程中，我们将描述两种最流行的语音特征：

滤波器组 (FBANKs)
梅尔频率倒谱系数 (MFCCs)

然后我们将提到添加上下文信息的常见技术。

1. 滤波器组 (FBANKs)

FBANKs 是通过对语音信号的频谱图应用一组滤波器来计算的时间-频率表示。请查看本教程以获取关于傅里叶变换和频谱图的详细概述。

首先，让我们下载一些语音信号并安装SpeechBrain：

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

# Clone SpeechBrain repository
!git clone https://github.com/speechbrain/speechbrain/
%cd /content/speechbrain/

%%capture
!wget https://www.dropbox.com/s/u8qyvuyie2op286/spk1_snt1.wav

现在我们来计算语音信号的频谱图：

import torch
import matplotlib.pyplot as plt
from speechbrain.dataio.dataio import read_audio
from speechbrain.processing.features import STFT

signal = read_audio('spk1_snt1.wav').unsqueeze(0) # [batch, time]

compute_STFT = STFT(sample_rate=16000, win_length=25, hop_length=10, n_fft=400)
signal_STFT = compute_STFT(signal)

spectrogram = signal_STFT.pow(2).sum(-1) # Power spectrogram
spectrogram = spectrogram.squeeze(0).transpose(0,1)
spectrogram = torch.log(spectrogram)

plt.imshow(spectrogram.squeeze(0), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

压缩信号的一种方法是沿频率轴对频谱图进行平均。这是通过一组滤波器完成的：

从频谱图中，我们可以注意到大部分能量集中在频谱的低频部分。在频谱的低频部分分配更多的滤波器，而在高频部分分配较少的滤波器是方便的。这就是梅尔滤波器组所做的。

每个滤波器都是三角形的，并且在中心频率处的响应为1。响应线性递减至0，直到达到两个相邻滤波器的中心频率（见图）。因此，相邻滤波器之间存在一些重叠。

滤波器设计为在梅尔频率域中等距分布。可以通过以下非线性变换从线性频率域转换到梅尔频率域（反之亦然）：

\( m=2595log10(1+f/700)\)

\(f=700(10m/2595−1)\),

其中 \(m\) 是梅尔频率分量，\(f\) 是标准频率分量（以赫兹为单位）。梅尔频率域通过对数进行压缩。结果是，在梅尔域中等距的滤波器在目标线性域中将不会等距。我们确实在频谱的较低部分有更多的滤波器，而在较高部分有较少的滤波器，这是我们所期望的。

现在让我们使用 SpeechBrain 计算 FBANKs：

from speechbrain.processing.features import spectral_magnitude
from speechbrain.processing.features import Filterbank

compute_fbanks = Filterbank(n_mels=40)

STFT = compute_STFT(signal)
mag = spectral_magnitude(STFT)
fbanks = compute_fbanks(mag)

print(STFT.shape)
print(mag.shape)
print(fbanks.shape)

plt.imshow(fbanks.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

通常，计算40或80个FBANK。正如你可以从形状中观察到的，时间轴的维度是相同的。然而，频率轴的维度已经减少。你可以将FBANKs视为一种简单的方式来压缩嵌入在频谱图中的丰富信息。

SpeechBrain 实现的滤波器组旨在支持不同形状的滤波器（三角形、矩形、高斯形）。此外，当 freeze=False 时，滤波器不会被冻结，可以在训练过程中进行调整。

为了简化FBANKs的计算，我们创建了一个lobe，它在一个函数中执行所有必要的步骤：

SpeechBrain 实现的滤波器组旨在支持不同形状的滤波器（三角形、矩形、高斯形）。此外，当 freeze=False 时，滤波器不会被冻结，可以在训练过程中进行调整。

为了简化FBANKs的计算，我们创建了一个lobe，它在一个函数中执行所有必要的步骤：

from speechbrain.lobes.features import Fbank
fbank_maker = Fbank()
fbanks = fbank_maker(signal)

plt.imshow(fbanks.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

# Zoom of first 80 steps
plt.imshow(fbanks.squeeze(0).t()[:,0:80], cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

2. 梅尔频率倒谱系数 (MFCCs)

MFCCs是通过在FBANKs之上应用离散余弦变换（DCT）来计算的。DCT是一种去相关特征的变换，可以用于进一步压缩它们。

为了使MFCC的计算更容易，我们为此创建了一个lobe：

from speechbrain.lobes.features import MFCC
mfcc_maker = MFCC(n_mfcc=13, deltas=False, context=False)
mfccs = mfcc_maker(signal)

plt.imshow(mfccs.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

#Zoom of the first 25 steps
plt.imshow(mfccs.squeeze(0).t()[:,0:25], cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

在过去，处理去相关特征是至关重要的。过去的机器学习技术，如高斯混合模型（GMMS），不适合建模相关数据。而深度神经网络即使在相关数据上也能表现得非常好，因此FBANKs如今成为了首选。

3. 上下文信息

本地上下文的适当管理对于大多数语音处理任务至关重要。过去的主要解决方案是通过以下方法设置“手工制作”的上下文：

衍生品
上下文窗口

3.1 导数

导数的思想是通过简单地计算与相邻特征的差异来引入局部上下文。导数通常使用MFCCS系数计算：

from speechbrain.lobes.features import MFCC
mfcc_maker = MFCC(n_mfcc=13, deltas=True, context=False)
mfccs_with_deltas = mfcc_maker(signal)

print(mfccs.shape)
print(mfccs_with_deltas.shape)

plt.imshow(mfccs_with_deltas.squeeze(0).t(), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

一阶和二阶导数被称为delta和delta-delta系数，并与静态系数连接在一起。在示例中，维度因此为39（13个静态，13个delta，13个delta-delta）。

3.2 上下文窗口

上下文窗口通过简单地连接多个连续特征来添加局部上下文。结果是一个更大的特征向量，能够更好地“感知”局部信息。

让我们看一个例子：

from speechbrain.lobes.features import MFCC
mfcc_maker = MFCC(n_mfcc=13,
                  deltas=True,
                  context=True,
                  left_frames=5,
                  right_frames=5)
mfccs_with_context = mfcc_maker(signal)

print(mfccs.shape)
print(mfccs_with_deltas.shape)
print(mfccs_with_context.shape)

在这种情况下，我们将当前帧与5个过去帧和5个未来帧连接起来。因此，总维度为\(39 * (5+5+1)= 429\)

与使用上述解决方案不同，当前趋势是使用静态特征，并通过卷积神经网络（CNN）上的感受野逐步添加可学习的上下文。CNN通常用于神经语音处理系统的早期层，以得出鲁棒且上下文感知的表示。

4. 其他功能

最近的一个趋势是向神经网络提供原始数据。直接向神经网络提供**频谱图**甚至短时傅里叶变换（STFT）已经变得相当普遍。也可以直接向神经网络提供原始时域样本。通过适当设计的网络（如SincNEt）使这变得更容易。SincNEt使用一种称为SincConv的参数化卷积层，从原始样本中学习。SincNet在[本教程](add link)中有描述。

参考文献

[1] P. Mermelstein (1976), “语音识别的距离测量，心理和仪器，” 载于《模式识别与人工智能》。pdf (Web Archive)

[2] X. Huang, A. Acero (作者), H.-W. Hon, “口语处理：理论、算法与系统开发指南平装版 – 2001

[3] https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

[4] M. Ravanelli, M. Omologo, “远距离语音识别的自动上下文窗口组合”, 语音通信, 2018 ArXiv

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}