要在GitHub上执行或查看/下载此笔记本

傅里叶变换和频谱图

在语音和音频处理中，时域中的信号经常被转换到另一个域。好的，但为什么我们需要转换音频信号呢？

信号的一些语音特征/模式（例如，音高，共振峰）在时域中查看音频时可能不太明显。通过适当设计的变换，可能更容易从信号本身提取所需的信息。

最流行的变换是傅里叶变换，它将时域信号转换为频域中的等效表示。在接下来的部分中，我们将描述傅里叶变换以及其他相关变换，如短时傅里叶变换（STFT）和频谱图。

1. 傅里叶变换

时间离散序列\(f[n]={f[0],f[1],..f[N-1]}\)的傅里叶变换称为离散傅里叶变换（DFT），其定义如下：

\(F_{k} = \sum_{n=0}^{N-1} f_{n} e^{-j\frac{2\pi}{N}kn}\)

逆变换，称为逆离散傅里叶变换（IDFT），将频域信号 \(F_k\) 映射到时域信号 \(f_n\)：

\(f_{n} = \sum_{k=0}^{N-1} F_{k} e^{j\frac{2\pi}{N}kn}\)

这两种表示是等价的，我们在应用它们时不会丢失信息。这只是表示相同信号的不同方式。

直觉是什么？

傅里叶变换背后的思想是将信号表示为频率递增的复正弦波的加权和。例如，复指数\(e^{j\frac{2\pi}{N}kn}\)决定了这个“复正弦波”的频率：

\(e^{j\frac{2\pi}{N}kn} = cos(\frac{2\pi}{N}kn) +j sin(\frac{2\pi}{N}kn)\).

术语\(F_{k}\)，相反，是另一个复数，它决定了频率分量的幅度和偏移（相位）。可以证明，使用具有适当幅度和相位的N个复正弦波，我们可以模拟任何信号。换句话说，复正弦波是构成信号的基本构建块。如果你像乐高建筑一样正确地组合它们，你可以创建所有你想要的信号（无论是周期性的还是非周期性的）。

变换具有\(O(N^2)\)复杂度，因为对于频率表示\(F_k\)的每个元素k，我们必须遍历序列的所有N个元素。这使得计算长序列的DFT和IDFT变得不可能。

幸运的是，有一些称为快速傅里叶变换（FFT）的算法可以用\(O(Nlog(N))\)的时间复杂度来计算它。FFT将输入序列分割成小块，并组合它们的DTF。

“复杂正弦波”这一概念可能相当难以理解。不过，在网上你可以找到许多优秀的材料，里面充满了酷炫的图形动画来帮助你理解（参见参考资料中的教程）。现在，我们只需将傅里叶变换视为一种线性变换，它将实值序列映射为复值序列。

在计算一些DTFT之前，让我们下载一些语音信号并安装speechbrain：

%%capture
!wget https://www.dropbox.com/s/u8qyvuyie2op286/spk1_snt1.wav

%%capture
# Installing SpeechBrain
BRANCH = 'develop'
!git clone https://github.com/speechbrain/speechbrain.git -b $BRANCH
%cd /content/speechbrain/
!python -m pip install .

import torch
import matplotlib.pyplot as plt
from speechbrain.dataio.dataio import read_audio

signal = read_audio('/content/spk1_snt1.wav')
print(signal.shape)

# fft computation
fft = torch.fft.fft(signal.squeeze(), dim=0)
print(fft)
print(fft.shape)

如你所见，输入信号是实数（因此虚部填充为零）。DFT 是一个包含变换的实部和虚部的张量。

现在让我们计算DFT的幅度和相位并绘制它们：

# Real and Imaginary parts
real_fft = fft.real
img_fft = fft.imag

mag = torch.sqrt(torch.pow(real_fft,2) + torch.pow(img_fft,2))
phase = torch.arctan(img_fft/real_fft)

plt.subplot(211)
x_axis = torch.linspace(0, 16000, mag.shape[0])
plt.plot(x_axis, mag)

plt.subplot(212)
plt.plot(x_axis, phase)
plt.xlabel('Freq [Hz]')

从图中可以注意到一些有趣的事情：

幅度的图是对称的。x轴的最后一个元素对应于采样频率\(f_s\)，在这种情况下是16kHz。由于这种对称性，只需要绘制从0到\(fs/2\)的幅度。这个频率被称为奈奎斯特频率。
相位图非常嘈杂。这也是预料之中的。众所周知，相位不容易解释和估计。

让我们不要绘制从0到奈奎斯特频率的幅度：

half_point = mag[0:].shape[0]//2
x_axis = torch.linspace(0, 8000, half_point)
plt.plot(x_axis, mag[0:half_point])
plt.xlabel('Frequency')

我们可以看到，语音信号的大部分能量集中在频谱的较低部分。事实上，许多重要的音素，如元音，其大部分能量都集中在这一部分的频谱中。

此外，我们可以注意到幅度谱中的一些峰值。让我们放大以更清楚地看到它们：

plt.plot(mag[0:4000])
plt.xlabel('Frequency')

峰值对应于音高（即我们的声带振动的频率）和共振峰（对应于我们的声道的共振频率）。

现在让我们尝试回到时域：

signal_rec = torch.fft.ifft(fft, dim=0)
signal_rec = signal_rec # real part
signal_orig = signal

# Plots
plt.subplot(211)
plt.plot(signal_orig)

plt.subplot(212)
plt.plot(signal_rec)
plt.xlabel('Time')

print(signal_orig[0:10])
print(signal_rec[0:10])

从图中可以看出，信号可以在时域中重建。由于一些数值舍入误差，这两个信号非常相似但不完全相同（请参见前10个样本的打印）。

2. 短时傅里叶变换 (STFT)

语音是一种“动态”信号，随着时间的推移而变化。因此，引入一种混合时频表示法可能是有意义的，它可以显示语音的频率成分如何随时间变化。这种表示法被称为短时傅里叶变换。

SFTF 是这样计算的：

使用重叠滑动窗口（例如，汉明窗、汉宁窗、布莱克曼窗）将时间信号分割成多个块。
为每个小块计算DFT
将所有DFT合并为一个单一的表示

现在我们来计算一个语音信号的短时傅里叶变换：

from speechbrain.processing.features import STFT

signal = read_audio('/content/spk1_snt1.wav').unsqueeze(0) # [batch, time]

compute_STFT = STFT(sample_rate=16000, win_length=25, hop_length=10, n_fft=400) # 25 ms, 10 ms
signal_STFT = compute_STFT(signal)

print(signal.shape)
print(signal_STFT.shape)

STFT表示的第一个维度是批次轴（SpeechBrain期望它，因为它设计用于并行处理多个信号）。
第三是频率分辨率。它对应于fft点数的一半（\(n_{fft}\)），因为正如我们之前所见，fft是对称的。
最后一个维度收集了STFT表示的实部和虚部。

与傅里叶变换类似，STFT有一个逆变换，称为逆短时傅里叶变换（ISTFT）。通过适当设计的窗口，我们可以完美重建原始信号：

from speechbrain.processing.features import ISTFT

compute_ISTFT = ISTFT(sample_rate=16000, win_length=25, hop_length=10)
signal_rec = compute_ISTFT(signal_STFT)
signal_rec = signal_rec.squeeze() # remove batch axis for plotting

# Plots
plt.subplot(211)
plt.plot(signal_orig)

plt.subplot(212)
plt.plot(signal_rec)
plt.xlabel('Time')

3. 频谱图

正如我们之前所见，傅里叶变换的幅度比相位更具信息量。因此，我们可以取STFT表示的幅度，得到所谓的频谱图。频谱图是最流行的语音表示之一。

让我们看看频谱图是什么样子的：

spectrogram = signal_STFT.pow(2).sum(-1) # power spectrogram
spectrogram = spectrogram.squeeze(0).transpose(0,1)

spectrogram_log = torch.log(spectrogram) # for graphical convenience

plt.imshow(spectrogram_log.squeeze(0), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

频谱图是一种可以绘制为图像的二维表示（黄色区域对应于具有高幅度的时频点）。从频谱图中，您可以看到频率分量如何随时间变化。例如，您可以清楚地区分元音（其频率模式以对应于音高和共振峰的多条线为特征）和摩擦音（以存在连续高频分量为特征）。通常，我们绘制对应于STFT平方幅度的功率频谱图。

频谱图的时间和频率分辨率取决于用于计算STFT的窗口长度。

例如，如果我们增加窗口的长度，我们可以在频率上获得更高的分辨率（但在时间上分辨率较低）：

signal = read_audio('/content/spk1_snt1.wav').unsqueeze(0) # [batch, time]

compute_STFT = STFT(sample_rate=16000, win_length=50, hop_length=10, n_fft=800)
signal_STFT = compute_STFT(signal)

spectrogram = signal_STFT.pow(2).sum(-1)
spectrogram = spectrogram.squeeze(0).transpose(0,1)
spectrogram = torch.log(spectrogram)

plt.imshow(spectrogram.squeeze(0), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

反之，我们可以以降低频率分辨率为代价获得更大的时间分辨率：

signal = read_audio('/content/spk1_snt1.wav').unsqueeze(0) # [batch, time]

compute_STFT = STFT(sample_rate=16000, win_length=5, hop_length=5, n_fft=800)
signal_STFT = compute_STFT(signal)

spectrogram = signal_STFT.pow(2).sum(-1)
spectrogram = spectrogram.squeeze(0).transpose(0,1)
spectrogram = torch.log(spectrogram)

plt.imshow(spectrogram.squeeze(0), cmap='hot', interpolation='nearest', origin='lower')
plt.xlabel('Time')
plt.ylabel('Frequency')
plt.show()

尽管频谱图非常具有信息性，但它不可逆。实际上，在计算它时，我们只使用了STFT的幅度，而没有使用相位。

频谱图是计算一些流行的语音特征的起点，例如滤波器组（FBANKs）和梅尔频率倒谱系数（MFCCs），这些是另一个教程的主题。

参考文献

[1] L. R. Rabiner, Ronald W. Schafer, “语音信号的数字处理”, Prentice-Hall, 1978

[2] S. K. Mitra 数字信号处理：基于计算机的方法 slides

[3] https://betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/

[4] https://sites.northwestern.edu/elannesscohn/2019/07/30/developing-an-intuition-for-fourier-transforms/

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}