注意

点击这里下载完整的示例代码

词嵌入：编码词汇语义¶

创建日期：2017年4月8日 | 最后更新：2021年9月14日 | 最后验证：2024年11月5日

词嵌入是实数的密集向量，每个词汇对应一个。在自然语言处理中，特征几乎总是词汇！但是你如何在计算机中表示一个词呢？你可以存储它的ASCII字符表示，但这只能告诉你这个词是什么，并不能说明它的含义（你可能能够从词缀中推导出它的词性，或者从大小写中推导出它的属性，但并不多）。更重要的是，你如何组合这些表示？我们通常希望从神经网络中得到密集的输出，其中输入是\(|V|\)维的，其中\(V\)是我们的词汇表，但输出通常只有几维（例如，如果我们只预测少量标签）。我们如何从高维空间转换到低维空间？

与其使用ASCII表示法，我们不如使用独热编码？也就是说，我们用\(w\)来表示这个词

\[\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements} \]

其中1位于\(w\)的独特位置。任何其他单词将在其他位置有一个1，其他地方都是0。

这种表示方法除了其庞大的规模外，还有一个巨大的缺点。它基本上将所有单词视为相互独立的实体，没有任何关联。我们真正需要的是单词之间某种相似性的概念。为什么呢？让我们来看一个例子。

假设我们正在构建一个语言模型。假设我们已经看到了这些句子

数学家跑去了商店。
物理学家跑到了商店。
数学家解决了这个开放问题。

在我们的训练数据中。现在假设我们得到一个在训练数据中从未见过的新句子：

物理学家解决了这个开放问题。

我们的语言模型可能在这句话上表现不错，但如果我们能使用以下两个事实，会不会更好：

我们已经看到数学家和物理学家在同一个句子中扮演相同的角色。不知何故，他们之间存在语义关系。
我们已经看到数学家在这个新出现的句子中扮演着与我们现在看到的物理学家相同的角色。

然后推断物理学家实际上非常适合新的未见过的句子？这就是我们所说的相似性概念：我们指的是语义相似性，而不仅仅是具有相似的拼写表示。这是一种通过连接我们所见和未见之间的点来对抗语言数据稀疏性的技术。这个例子当然依赖于一个基本的语言学假设：出现在相似上下文中的词在语义上是相关的。这被称为分布假设。

获取密集词嵌入¶

我们如何解决这个问题？也就是说，我们如何实际编码单词的语义相似性？也许我们可以想出一些语义属性。例如，我们看到数学家和物理学家都能跑步，所以也许我们可以给这些单词在“能够跑步”这个语义属性上打一个高分。想一些其他属性，并想象你可能会在一些常见单词上给这些属性打什么分数。

如果每个属性都是一个维度，那么我们可能会给每个单词一个向量，像这样：

\[ q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run}, \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\]

\[ q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run}, \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\]

然后我们可以通过以下操作来获取这些单词之间的相似度度量：

\[\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician} \]

虽然更常见的是通过长度进行归一化：

\[ \text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}} {\| q_\text{physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\]

其中 \(\phi\) 是两个向量之间的角度。这样，非常相似的词（嵌入指向相同方向的词）的相似度为1。非常不相似的词的相似度应为-1。

你可以将本节开头的稀疏独热向量视为我们定义的新向量的特例，其中每个词基本上具有相似度0，并且我们为每个词赋予了一些独特的语义属性。这些新向量是密集的，也就是说它们的条目（通常）是非零的。

但这些新向量非常麻烦：你可以想到成千上万种可能相关的语义属性，这些属性可能与确定相似性有关，但你究竟如何设置这些不同属性的值呢？深度学习的核心思想是神经网络学习特征的表示，而不是要求程序员自己设计它们。那么为什么不直接让词嵌入成为我们模型中的参数，然后在训练过程中进行更新呢？这正是我们将要做的。我们将有一些潜在语义属性，网络原则上可以学习这些属性。请注意，词嵌入可能无法解释。也就是说，尽管我们上面手工制作的向量可以看到数学家和物理学家都喜欢咖啡，但如果我们允许神经网络学习嵌入并看到数学家和物理学家在第二个维度上都有很大的值，这并不清楚这意味着什么。它们在某个潜在语义维度上是相似的，但这可能对我们来说没有解释。

总之，词嵌入是对单词*语义*的一种表示，有效地编码可能与当前任务相关的语义信息。你也可以嵌入其他东西：词性标签、解析树，任何东西！特征嵌入的概念是该领域的核心。

Pytorch中的词嵌入¶

在我们进入一个实际示例和练习之前，先简要说明一下如何在Pytorch和深度学习编程中使用嵌入。类似于我们在制作独热向量时为每个单词定义唯一索引的方式，我们在使用嵌入时也需要为每个单词定义一个索引。这些索引将作为查找表的键。也就是说，嵌入存储为一个\(|V| \times D\)矩阵，其中\(D\)是嵌入的维度，这样分配给索引\(i\)的单词的嵌入存储在矩阵的第\(i\)行。在我的所有代码中，从单词到索引的映射是一个名为word_to_ix的字典。

允许你使用嵌入的模块是torch.nn.Embedding，它接受两个参数：词汇表大小和嵌入的维度。

要索引此表，您必须使用 torch.LongTensor（因为索引是整数，而不是浮点数）。

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator object at 0x7f8848d25b30>

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)

一个示例：N-Gram 语言建模¶

回想一下，在一个n-gram语言模型中，给定一个单词序列 \(w\)，我们想要计算

\[P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} ) \]

其中 \(w_i\) 是序列的第i个单词。

在这个例子中，我们将计算一些训练样本的损失函数，并通过反向传播更新参数。

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]
# Print the first 3, just so you can see what they look like.
print(ngrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

[(['forty', 'When'], 'winters'), (['winters', 'forty'], 'shall'), (['shall', 'winters'], 'besiege')]
[518.2337408065796, 515.4880437850952, 512.7639863491058, 510.0590965747833, 507.3713676929474, 504.69813990592957, 502.03808975219727, 499.3909890651703, 496.7594289779663, 494.1408133506775]
tensor([-1.3866, -0.1536, -1.1605,  2.7781,  0.4523,  1.0135, -0.0849, -1.2198,
        -1.7912,  0.7016], grad_fn=<SelectBackward0>)

练习：计算词嵌入：连续词袋模型¶

连续词袋模型（CBOW）经常用于自然语言处理（NLP）深度学习。它是一种尝试根据目标词前后的几个词的上下文来预测单词的模型。这与语言建模不同，因为CBOW不是顺序的，也不一定是概率的。通常，CBOW用于快速训练词嵌入，这些嵌入用于初始化一些更复杂模型的嵌入。通常，这被称为预训练嵌入。它几乎总是有助于提高几个百分点的性能。

CBOW模型如下。给定一个目标词\(w_i\)和每边一个\(N\)的上下文窗口，\(w_{i-1}, \dots, w_{i-N}\)和\(w_{i+1}, \dots, w_{i+N}\)，将所有上下文词统称为\(C\)，CBOW试图最小化

\[-\log p(w_i | C) = -\log \text{Softmax}\left(A(\sum_{w \in C} q_w) + b\right) \]

其中 \(q_w\) 是单词 \(w\) 的嵌入。

通过填写下面的类在Pytorch中实现这个模型。一些提示：

考虑你需要定义哪些参数。
确保你知道每个操作期望的形状。如果需要重塑，请使用.view()。

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# Create your model and train. Here are some functions to help you make
# the data ready for use by your module.


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(data[0][0], word_to_ix)  # example

[(['are', 'We', 'to', 'study'], 'about'), (['about', 'are', 'study', 'the'], 'to'), (['to', 'about', 'the', 'idea'], 'study'), (['study', 'to', 'idea', 'of'], 'the'), (['the', 'study', 'of', 'a'], 'idea')]

tensor([21, 12, 38, 24])

脚本总运行时间： ( 0 分钟 0.814 秒)

Gallery generated by Sphinx-Gallery