使用UMAP进行文档嵌入

这是一个使用UMAP嵌入文本的教程（但这可以扩展到任何标记集合）。我们将使用20个新闻组数据集，这是一个按主题标记的论坛帖子集合。我们将嵌入这些文档，并看到相似的文档（即同一子论坛中的帖子）最终会靠得很近。您可以将此嵌入用于其他下游任务，例如可视化您的语料库，或运行聚类算法（例如HDBSCAN）。我们将使用词袋模型，并在计数向量以及TF-IDF向量上使用UMAP。

首先，让我们加载相关的库。这需要UMAP版本 >= 0.4.0。

import pandas as pd
import umap
import umap.plot

# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)

接下来，让我们下载并探索20个新闻组数据集。

%%time
dataset = fetch_20newsgroups(subset='all',
                             shuffle=True, random_state=42)

CPU times: user 280 ms, sys: 52 ms, total: 332 ms
Wall time: 460 ms

让我们看看语料库的大小：

print(f'{len(dataset.data)} documents')
print(f'{len(dataset.target_names)} categories')

18846 documents
20 categories

以下是文档的类别。正如你所看到的，许多类别是相互关联的（例如‘comp.sys.ibm.pc.hardware’和‘comp.sys.mac.hardware’），但它们并不都是相关的（例如‘sci.med’和‘rec.sport.baseball’）。

dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

让我们来看几个示例文档：

for idx, document in enumerate(dataset.data[:3]):
    category = dataset.target_names[dataset.target[idx]]

    print(f'Category: {category}')
    print('---------------------------')
    # Print the first 500 characters of the post
    print(document[:500])
    print('---------------------------')

Category: rec.sport.hockey
---------------------------
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killin
---------------------------
Category: comp.sys.ibm.pc.hardware
---------------------------
From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local
---------------------------
Category: talk.politics.mideast
---------------------------
From: hilmi-er@dsv.su.se (Hilmi Eren)
Subject: Re: ARMENIA SAYS IT COULD SHOOT DOWN TURKISH PLANES (Henrik)
Lines: 95
Nntp-Posting-Host: viktoria.dsv.su.se
Reply-To: hilmi-er@dsv.su.se (Hilmi Eren)
Organization: Dept. of Computer and Systems Sciences, Stockholm University




|>The student of "regional killings" alias Davidian (not the Davidian religios sect) writes:


|>Greater Armenia would stretch from Karabakh, to the Black Sea, to the
|>Mediterranean, so if you use the term "Greater Armenia
---------------------------

现在我们将创建一个包含目标标签的dataframe，用于绘图。这将允许我们在悬停在绘制的点上时看到新闻组（如果使用交互式绘图）。这将帮助我们（通过视觉）评估嵌入的效果。

category_labels = [dataset.target_names[x] for x in dataset.target]
hover_df = pd.DataFrame(category_labels, columns=['category'])

使用原始计数

接下来，我们将使用词袋方法（即词序不重要）并构建一个词文档矩阵。在这个矩阵中，行将对应于一个文档（即帖子），每列将对应于一个特定的词。值将是给定词在特定文档中出现的次数。

我们将使用sklearn的CountVectorizer函数以及一些其他预处理步骤来完成这项工作：

将文本通过空格分割成标记（即单词）
移除英文停用词（the, and, 等）
移除在整个语料库中出现次数少于5次的所有单词（通过min_df参数）

vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)

这给了我们一个18846x34880的矩阵，其中有18846个文档（与上面相同）和34880个独特的标记。这个矩阵是稀疏的，因为大多数单词不会出现在大多数文档中。

word_doc_matrix

<18846x34880 sparse matrix of type '<class 'numpy.int64'>'
    with 1939023 stored elements in Compressed Sparse Row format>

现在我们将使用UMAP进行降维，将矩阵从34880维降至2维（因为n_components=2）。我们需要一个距离度量，并将使用Hellinger距离，它衡量两个概率分布之间的相似性。每个文档都有一组由多项式分布生成的计数，我们可以使用Hellinger距离来衡量这些分布的相似性。

%%time
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)

CPU times: user 2min 24s, sys: 1.18 s, total: 2min 25s
Wall time: 2min 3s

现在我们有一个18846x2的嵌入。

embedding.embedding_.shape

(18846, 2)

让我们绘制嵌入图。如果你在笔记本中运行这个，你应该使用交互式绘图方法，因为它允许你将鼠标悬停在点上并查看它们属于哪个类别。

# For interactive plotting use
# f = umap.plot.interactive(embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
# show(f)
f = umap.plot.points(embedding, labels=hover_df['category'])

_images/20newsgroups_hellinger_counts.png

使用TF-IDF

为了进行TF-IDF加权，我们将使用sklearn的TfidfVectorizer，并使用与上述CountVectorizer相同的参数。

tfidf_vectorizer = TfidfVectorizer(min_df=5, stop_words='english')
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(dataset.data)

我们得到一个与之前大小相同的矩阵：

tfidf_word_doc_matrix

<18846x34880 sparse matrix of type '<class 'numpy.float64'>'
    with 1939023 stored elements in Compressed Sparse Row format>

%%time
tfidf_embedding = umap.UMAP(metric='hellinger').fit(tfidf_word_doc_matrix)

CPU times: user 2min 19s, sys: 1.27 s, total: 2min 20s
Wall time: 1min 57s

# For interactive plotting use
# fig = umap.plot.interactive(tfidf_embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
# show(fig)
fig = umap.plot.points(tfidf_embedding, labels=hover_df['category'])

_images/20newsgroups_hellinger_tfidf.png

结果看起来与之前非常相似，但这可能是一个有用的技巧，可以放在你的工具箱中。

潜在应用

探索/可视化您的语料库以识别主题/趋势
对嵌入进行聚类以找到相关文档的组
寻找最近的邻居以找到相关文档
寻找异常文档