要在GitHub上执行或查看/下载此笔记本

复数和四元数神经网络

本教程演示了如何使用SpeechBrain实现的复值和四元值神经网络进行语音技术。它涵盖了高维表示的基础知识以及相关的神经层：线性、卷积、循环和归一化。

先决条件

介绍与背景

复数： 复数将实数的概念扩展到二维空间。复数由实部和虚部组成，通常表示为 z = r + ix，其中 r 是实部，ix 是虚部。这种数学扩展在现实世界中有多种应用，为处理二维空间中的概念（如旋转、平移和相位相关操作）提供了强大的代数框架。复数自然地表示语音信号，傅里叶变换就是一个显著的例子，它在复数空间中操作，捕捉幅度和相位信息。

四元数： 四元数将复数推广到三维空间，具有一个实部（r）和一个虚部，虚部是一个三维向量（ix + jy + kz）。一个四元数 q 可以表示为 q = r + ix + jy + kz。在实际应用中，四元数定义了三维旋转，并在物理学、计算机科学、计算机图形学和机器人学中有着广泛的应用。它们为构思和解释三维空间中的运动提供了一个稳定且自然的框架。

连接到神经网络:

随着现代深度学习的复兴势头增强，研究人员探索了将复数和四元数整合到神经网络中以解决特定任务。复值神经网络（CVNN）可以直接处理快速傅里叶变换（FFT）的输出，而四元数神经网络（QNN）可以实现以生成逼真的机器人运动。

除了它们自然适合某些表示之外，CVNN和QNN共享一个引人注目的特性：权重共享。支配复数和四元数的代数规则与实数的规则不同，这影响了四元数或复数的乘法。这种区别导致了Q-CVNN中独特的权重共享机制，与实值网络中的传统点积不同。这种机制已被证明对于学习多维输入的表示非常有用，同时保留了信号组件之间的内部关系，例如复数的幅度和相位。

在本教程中，由于这些属性的广泛性，我们不会深入探讨所有细节。相反，我们的目标是提供一个详细的指南，介绍如何在SpeechBrain中有效实现和利用CVNN和QNN。

SpeechBrain 复数和四元数的表示

在SpeechBrain中，代数操作被抽象在神经层中，使用户无需关注初始表示。这种抽象确保用户可以操作实值张量，而无需显式声明复数或四元数的特定张量类型。底层操作以张量/矩阵格式表示，便于与现代GPU架构无缝集成。

实际上，在您的配方中生成的任何PyTorch张量都可以解释为复数或四元数值张量，具体取决于处理它的层。例如：

如果通过torch.nn.Linear层处理，张量将是实数。
如果通过nnet.complex_networks.c_linear.CLinear层处理，张量将是复数。

张量是如何被解释和构建的？

让我们通过一个例子来说明。假设我们想要考虑一个包含3个复数或3个四元数的张量。数字的不同部分将按以下方式连接：

对于一个复杂的张量 (c_tensor): [r, r, r, x, x, x]

对于一个四元数张量 (q_tensor): [r, r, r, x, x, x, y, y, y, z, z, z]

这种灵活性允许在代码中声明的任何张量在通过SpeechBrain中的{C/Q}-Layer处理时被视为复数或四元数张量，只要特征维度可以被2整除（对于复数）或4整除（对于四元数）。

为了进一步探索这一点，让我们继续安装SpeechBrain。

%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

!git clone https://github.com/speechbrain/speechbrain.git

现在，让我们尝试操作一些张量，以更好地理解形式主义。我们首先实例化一个包含8个实数的张量。

import torch

T = torch.rand((1,8))
print(T)

然后，我们访问SpeechBrain库以操作复数，并简单地显示不同的部分（实部、虚部）。

from speechbrain.nnet.complex_networks.c_ops import get_real, get_imag

print(get_real(T))
print(get_imag(T))

如你所见，初始的 Tensor 被简单地分成两部分，同样的情况也发生在 4 和四元数上。

复数和四元数乘积

QNN和CVNN的核心是乘积。当然，还存在其他特定的内容，如权重初始化、特定的归一化、激活函数等。然而，基本的乘积是所有神经网络层的核心：一个权重矩阵乘以输入向量。

一个非常好的知识点是，复数可以用实值矩阵格式表示：

(1)\[\begin{equation} \left(\begin{array}{rr} a & -b \\ b & a \end{array}\right). \end{equation}\]

对于四元数也是如此：

(2)\[\begin{equation} \left(\begin{array}{cccc} a & -b & -c & -d \\ b & a & -d & c \\ c & d & a & -b \\ d & -c & b & a \end{array}\right). \end{equation}\]

更有趣的是，如果我们乘以两个这样的矩阵，那么我们得到对应于所考虑代数的乘积。例如，两个复数之间的复数乘积定义为：

(3)\[\begin{equation} \left(\begin{array}{rr} a & -b \\ b & a \end{array}\right)\left(\begin{array}{lr} c & -d \\ d & c \end{array}\right)=\left(\begin{array}{cc} a c-b d & -a d-b c \\ b c+a d & -b d+a c \end{array}\right), \end{equation}\]

这等同于正式定义：

(4)\[\begin{equation} (a+\mathrm{i} b)(c+\mathrm{i} d)=(a c-b d)+\mathrm{i}(a d+b c). \end{equation}\]

好的，那么这在 SpeechBrain 中是如何实现的？

你可以在复数或四元数库上调用的每一层都将遵循两个步骤：

init(): 将复数/四元数权重定义为 torch.Parameters 并使用适配的方案进行初始化。
forward(): 调用实现特定产品的相应操作。例如，一个复杂的线性层会调用complex_linear_op()从speechbrain.nnet.complex_networks.c_ops。

在实践中，speechbrain.nnet.complex_networks.c_ops.complex_linear_op 函数简单地：

获取层的权重并构建相应的实值矩阵。
在输入和此矩阵之间应用乘积以模拟复数/四元数乘积。

示例：

def complex_linear_op(input, real_weight, imag_weight, bias):
    """
    Applies a complex linear transformation to the incoming data.

    Arguments
    ---------
    input : torch.Tensor
        Complex input tensor to be transformed.
    real_weight : torch.Parameter
        Real part of the quaternion weight matrix of this layer.
    imag_weight : torch.Parameter
        First imaginary part of the quaternion weight matrix of this layer.
    bias : torch.Parameter
    """

    # Here we build the real-valued matrix as defined by the equations!
    cat_real = torch.cat([real_weight, -imag_weight], dim=0)
    cat_imag = torch.cat([imag_weight, real_weight], dim=0)
    cat_complex = torch.cat([cat_real, cat_imag], dim=1)

    # If the input is already [batch*time, N]

    # We do inputxconstructed_matrix to simulate the product

    if input.dim() == 2:
        if bias.requires_grad:
            return torch.addmm(bias, input, cat_complex)
        else:
            return torch.mm(input, cat_complex)
    else:
        output = torch.matmul(input, cat_complex)
        if bias.requires_grad:
            return output + bias
        else:
            return output

# We create a single complex number
complex_input = torch.rand(1, 2)

# We create two Tensors (not parameters here because we don't care about storing gradients)
# These tensors are the real_parts and imaginary_parts of the weight matrix.
# The real part is equivalent [nb_complex_numbers_in // 2, nb_complex_numbers_out // 2]
# The imag part is equivalent [nb_complex_numbers_in // 2, nb_complex_numbers_out // 2]
# Hence if we define a layer with 1 complex input and 2 complex outputs:
r_weight = torch.rand((1,2))
i_weight = torch.rand((1,2))

bias = torch.ones(4) # because we have 2 (complex) x times 2 = 4 real-values

# and we forward propagate!
print(complex_linear_op(complex_input, r_weight, i_weight, bias).shape)

需要注意的是，四元数的实现遵循完全相同的方法。

复数神经网络

一旦你熟悉了形式主义，你就可以轻松推导出speechbrain.nnet.complex_networks中给出的任何复值神经构建模块：

一维和二维卷积。
批处理和层归一化。
线性层。
循环单元（LSTM, LiGRU, RNN）。

根据文献，大多数复杂和四元神经网络依赖于分割激活函数（任何应用于复数/四元数值信号的实值激活函数）。目前，SpeechBrain 遵循这种方法，不提供任何完全复数或四元激活函数。

卷积层

首先，让我们定义一批输入（例如，这可能是FFT的输出）。

from speechbrain.nnet.complex_networks.c_CNN import CConv1d, CConv2d

# [batch, time, features]
T = torch.rand((8, 10, 32))

# We define our layer and we want 12 complex numbers as output.
cnn_1d = CConv1d( input_shape=T.shape, out_channels=12, kernel_size=3)

out_tensor = cnn_1d(T)
print(out_tensor.shape)

正如我们所看到的，我们在输入张量上应用了一个复数一维卷积，并得到了一个特征维度等于24的输出张量。实际上，我们请求了12个out_channels，这相当于24个实数值。请记住：我们总是使用实数，代数在层本身中被抽象化了！

同样的操作也可以用于2D卷积。

# [batch, time, fea, Channel]
T = torch.rand([10, 16, 30, 30])

cnn_2d = CConv2d( input_shape=T.shape, out_channels=12, kernel_size=3)

out_tensor = cnn_2d(T)
print(out_tensor.shape)

请注意，2D卷积是应用于时间和特征轴的。通道轴通常被视为实部和虚部：[10, 16, 30, 0:15] = real 和 [10, 16, 30, 15:30] = imag。

线性层

与卷积层相同的方式，我们只需要实例化正确的模块并使用它！

from speechbrain.nnet.complex_networks.c_linear import CLinear

# [batch, time, features]
T = torch.rand((8, 10, 32))

# We define our layer and we want 12 complex numbers as output.
lin = CLinear(12, input_shape=T.shape, init_criterion='glorot', weight_init='complex')

out_tensor = lin(T)
print(out_tensor.shape)

请注意，我们添加了init_criterion和weight_init参数。这两个参数存在于所有复杂和四元数层中，定义了权重的初始化方式。实际上，复杂和四元数值的权重需要一个仔细的初始化过程，正如Chiheb Trabelsy等人的深度复杂网络和Titouan Parcollet等人的四元数循环神经网络中所详述的那样。

归一化层

一组复数（例如复数值层的输出）的归一化方式与一组实数值的归一化方式不同。由于任务的复杂性，本教程不会深入细节。请注意，代码完全可在相应的SpeechBrain库中找到，并且严格遵循Chiheb Trabelsy等人在论文Deep Complex Networks中首次提出的描述。

SpeechBrain 支持复杂的批处理和层归一化：

from speechbrain.nnet.complex_networks.c_normalization import CBatchNorm,CLayerNorm

inp_tensor = torch.rand([10, 16, 30])

# Not that by default the complex axis is the last one, but it can be specified.
CBN = CBatchNorm(input_shape=inp_tensor.shape)
CLN = CLayerNorm(input_shape=inp_tensor.shape)

out_bn_tensor = CBN(inp_tensor)
out_ln_tensor = CLN(inp_tensor)

循环神经网络

递归神经网络单元不过是有时间连接的多个线性层。因此，SpeechBrain 提供了 LSTM、RNN 和 LiGRU 的复杂变体的实现。事实上，这些模型与实值模型严格等价，只是线性层被替换为 CLinear 层！

from speechbrain.nnet.complex_networks.c_RNN import CLiGRU, CLSTM, CRNN

inp_tensor = torch.rand([10, 16, 40])

lstm = CLSTM(hidden_size=12, input_shape=inp_tensor.shape, weight_init='complex', bidirectional=True)
rnn = CRNN(hidden_size=12, input_shape=inp_tensor.shape, weight_init='complex', bidirectional=True)
ligru = CLiGRU(hidden_size=12, input_shape=inp_tensor.shape, weight_init='complex', bidirectional=True)

print(lstm(inp_tensor).shape)
print(rnn(inp_tensor).shape)
print(ligru(inp_tensor).shape)

请注意，输出维度为48，因为我们有12个复数（24个值）乘以2个方向（双向RNN）。

四元数神经网络

幸运的是，SpeechBrain中的QNN遵循完全相同的形式。因此，你可以轻松地从speechbrain.nnet.quaternion_networks中提供的构建模块中推导出任何四元数值神经网络：

一维和二维卷积。
批处理和层归一化。
线性和旋量层。
循环单元（LSTM, LiGRU, RNN）。

根据文献，大多数复杂和四元神经网络依赖于分割激活函数（任何应用于复数/四元数值信号的实值激活函数）。目前，SpeechBrain 遵循这种方法，不提供任何完全复数或四元激活函数。

我们刚刚看到的所有关于复杂神经网络的内容仍然适用。因此，我们可以将所有内容总结在一个代码片段中：

from speechbrain.nnet.quaternion_networks.q_CNN import QConv1d, QConv2d
from speechbrain.nnet.quaternion_networks.q_linear import QLinear
from speechbrain.nnet.quaternion_networks.q_RNN import QLiGRU, QLSTM, QRNN

# [batch, time, features]
T = torch.rand((8, 10, 40))

# [batch, time, fea, Channel]
T_4d = torch.rand([10, 16, 30, 40])

# We define our layers and we want 12 quaternion numbers as output (12x4 = 48 output real-values).
cnn_1d = QConv1d( input_shape=T.shape, out_channels=12, kernel_size=3)
cnn_2d = QConv2d( input_shape=T_4d.shape, out_channels=12, kernel_size=3)

lin = QLinear(12, input_shape=T.shape, init_criterion='glorot', weight_init='quaternion')

lstm = QLSTM(hidden_size=12, input_shape=T.shape, weight_init='quaternion', bidirectional=True)
rnn = QRNN(hidden_size=12, input_shape=T.shape, weight_init='quaternion', bidirectional=True)
ligru = QLiGRU(hidden_size=12, input_shape=T.shape, weight_init='quaternion', bidirectional=True)

print(cnn_1d(T).shape)
print(cnn_2d(T_4d).shape)
print(lin(T).shape)
print(lstm(T)[0].shape) # RNNs return output + hidden so we need to filter !
print(ligru(T)[0].shape) # RNNs return output + hidden so we need to filter !
print(rnn(T)[0].shape) # RNNs return output + hidden so we need to filter !

四元数旋量神经网络

介绍： 四元数旋量神经网络（SNN）是四元数值神经网络（QNN）中的一个特殊类别。如前所述，四元数用于表示旋转。在QNN层中，基本操作涉及哈密尔顿积（inputs x weights），其中输入和权重是四元数的集合。这个积本质上创建了一个新的旋转，相当于第一个旋转后跟随第二个旋转的组合。

旋转组合： 将两个四元数相乘会得到一个旋转，该旋转结合了每个四元数所代表的单独旋转。例如，给定 q3 = q1 x q2，这意味着 q3 是一个旋转，相当于先由 q1 旋转，再由 q2 旋转。在旋量神经网络的背景下，这一概念被用来组合新的旋转，不是为了物理上旋转物体，而是为了预测连续的旋转。例如，预测机器人的下一个动作涉及使用前一个动作（表示为四元数）作为输入，生成一个新的四元数作为输出，捕捉预期的下一个动作。

使用SNN建模旋转： 旋量神经网络（SNN）专门设计用于建模旋转。在机器人运动等场景中，SNN将物体运动前的3D坐标（x, y, z）作为输入，并预测其运动后的坐标作为输出。

正式旋转方程： 为了实现这一点，网络所有层中的标准乘积被替换为以下方程：

(5)\[\begin{equation} \vec{v_{output}} = q_{weight} \vec{v_{input}} q^{-1}_{weight}. \end{equation}\]

这个方程正式定义了一个向量\(\vec{v}\)通过单位四元数\(q_{weight}\)（范数为1）的旋转，其中\(q^{-1}\)表示四元数的共轭。这个方程中的左右乘积都是哈密顿乘积。

总之，四元数旋量神经网络专门用于建模旋转，使其特别适用于预测连续旋转或运动至关重要的应用，例如在机器人或动画中。

好的，那么这在SpeechBrain中是如何实现的呢？

与标准哈密顿积完全相同的方式！事实上，这样的旋转也可以表示为矩阵乘积：

(6)\[\begin{equation} \left(\begin{array}{ccc} a^{2}+b^{2}-c^{2}-d^{2} & 2 b c-2 a d & 2 a c+2 b d \\ 2 a d+2 b c & a^{2}-b^{2}+c^{2}-d^{2} & 2 c d-2 a b \\ 2 b d-2 a c & 2 a b+2 c d & a^{2}-b^{2}-c^{2}+d^{2} \end{array}\right). \end{equation}\]

因此，我们只需要定义遵循相同常规过程的quaternion_op：

从不同的权重分量组成一个实值矩阵
在输入和这个旋转矩阵之间应用矩阵乘积！

Check the code!

将四元数层转换为旋量层

旋量层可以通过所有四元数层中的布尔参数激活。以下是几个示例：

from speechbrain.nnet.quaternion_networks.q_CNN import QConv1d
from speechbrain.nnet.quaternion_networks.q_linear import QLinear

# [batch, time, features]
T = torch.rand((8, 80, 16))

#
# NOTE: in this case the real components must be zero as spinor neural networks
# only input and output 3D vectors ! We don't do it here for the sake of compactness
#

# We define our layers and we want 12 quaternion numbers as output (12x4 = 48 output real-values).
cnn_1d = QConv1d( input_shape=T.shape, out_channels=12, kernel_size=3, spinor=True, vector_scale=True)
lin = QLinear(12, input_shape=T.shape, spinor=True, vector_scale=True)

print(cnn_1d(T).shape)
print(lin(T).shape)

关于旋量层的两点说明：

我们需要设置一个vector_scale来训练深度模型。vector scale只是另一组torch.Parameters，它将缩小每个Spinor层的输出。实际上，SNN层的输出是一组3D向量，这些向量是旋转后的3D向量的和。四元数旋转不会影响旋转向量的大小。因此，通过不断累加旋转后的3D向量，我们可能会很快得到非常大的值（即训练会爆炸）。
你可能会考虑使用weight_init='unitary'。实际上，四元数旋转只有在所考虑的四元数是单位四元数时才有效。因此，从单位权重开始可能会促进学习阶段！

将所有内容整合在一起！

我们为复数神经网络和四元数神经网络提供了一个最小示例：

speechbrain/tests/integration/ASR_CTC/example_asr_ctc_experiment_complex_net.yaml.
speechbrain/tests/integration/ASR_CTC/example_asr_ctc_experiment_quaternion_net.yaml.

如果我们看一下这些YAML参数文件中的一个，我们可以很容易地分辨出如何从不同的模块中构建我们的模型！

yaml_params = """
model: !new:speechbrain.nnet.containers.Sequential
    input_shape: [!ref <N_batch>, null, 660]  # input_size
    conv1: !name:speechbrain.nnet.quaternion_networks.q_CNN.QConv1d
        out_channels: 16
        kernel_size: 3
    act1: !ref <activation>
    conv2: !name:speechbrain.nnet.quaternion_networks.q_CNN.QConv1d
        out_channels: 32
        kernel_size: 3
    nrm2: !name:speechbrain.nnet.quaternion_networks.q_CNN.QConv1d
    act2: !ref <activation>
    pooling: !new:speechbrain.nnet.pooling.Pooling1d
        pool_type: "avg"
        kernel_size: 3
    RNN: !name:speechbrain.nnet.quaternion_networks.q_RNN.QLiGRU
        hidden_size: 64
        bidirectional: True
    linear: !name:speechbrain.nnet.linear.Linear
        n_neurons: 43  # 42 phonemes + 1 blank
        bias: False
    softmax: !new:speechbrain.nnet.activations.Softmax
        apply_log: True
        """

这里，我们有一个非常基础的四元数值CNN-LiGRU模型，可以用于执行端到端的CTC ASR！

%cd /content/speechbrain/tests/integration/ASR_CTC/
!python example_asr_ctc_experiment.py example_asr_ctc_experiment_quaternion_net.yaml

引用SpeechBrain

如果您在研究中或业务中使用SpeechBrain，请使用以下BibTeX条目引用它：

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}