使用UMAP进行文档嵌入

这是一个使用UMAP嵌入文本的教程(但这可以扩展到任何标记集合)。我们将使用20个新闻组数据集,这是一个按主题标记的论坛帖子集合。我们将嵌入这些文档,并看到相似的文档(即同一子论坛中的帖子)最终会靠得很近。您可以将此嵌入用于其他下游任务,例如可视化您的语料库,或运行聚类算法(例如HDBSCAN)。我们将使用词袋模型,并在计数向量以及TF-IDF向量上使用UMAP。

首先,让我们加载相关的库。这需要UMAP版本 >= 0.4.0。

import pandas as pd
import umap
import umap.plot

# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)

接下来,让我们下载并探索20个新闻组数据集。

%%time
dataset = fetch_20newsgroups(subset='all',
                             shuffle=True, random_state=42)
CPU times: user 280 ms, sys: 52 ms, total: 332 ms
Wall time: 460 ms

让我们看看语料库的大小:

print(f'{len(dataset.data)} documents')
print(f'{len(dataset.target_names)} categories')
18846 documents
20 categories

以下是文档的类别。正如你所看到的,许多类别是相互关联的(例如‘comp.sys.ibm.pc.hardware’和‘comp.sys.mac.hardware’),但它们并不都是相关的(例如‘sci.med’和‘rec.sport.baseball’)。

dataset.target_names
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

让我们来看几个示例文档:

for idx, document in enumerate(dataset.data[:3]):
    category = dataset.target_names[dataset.target[idx]]

    print(f'Category: {category}')
    print('---------------------------')
    # Print the first 500 characters of the post
    print(document[:500])
    print('---------------------------')
Category: rec.sport.hockey
---------------------------
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killin
---------------------------
Category: comp.sys.ibm.pc.hardware
---------------------------
From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local
---------------------------
Category: talk.politics.mideast
---------------------------
From: hilmi-er@dsv.su.se (Hilmi Eren)
Subject: Re: ARMENIA SAYS IT COULD SHOOT DOWN TURKISH PLANES (Henrik)
Lines: 95
Nntp-Posting-Host: viktoria.dsv.su.se
Reply-To: hilmi-er@dsv.su.se (Hilmi Eren)
Organization: Dept. of Computer and Systems Sciences, Stockholm University




|>The student of "regional killings" alias Davidian (not the Davidian religios sect) writes:


|>Greater Armenia would stretch from Karabakh, to the Black Sea, to the
|>Mediterranean, so if you use the term "Greater Armenia
---------------------------

现在我们将创建一个包含目标标签的dataframe,用于绘图。这将允许我们在悬停在绘制的点上时看到新闻组(如果使用交互式绘图)。这将帮助我们(通过视觉)评估嵌入的效果。

category_labels = [dataset.target_names[x] for x in dataset.target]
hover_df = pd.DataFrame(category_labels, columns=['category'])

使用原始计数

接下来,我们将使用词袋方法(即词序不重要)并构建一个词文档矩阵。在这个矩阵中,行将对应于一个文档(即帖子),每列将对应于一个特定的词。值将是给定词在特定文档中出现的次数。

我们将使用sklearn的CountVectorizer函数以及一些其他预处理步骤来完成这项工作:

  1. 将文本通过空格分割成标记(即单词)

  2. 移除英文停用词(the, and, 等)

  3. 移除在整个语料库中出现次数少于5次的所有单词(通过min_df参数)

vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)

这给了我们一个18846x34880的矩阵,其中有18846个文档(与上面相同)和34880个独特的标记。这个矩阵是稀疏的,因为大多数单词不会出现在大多数文档中。

word_doc_matrix
<18846x34880 sparse matrix of type '<class 'numpy.int64'>'
    with 1939023 stored elements in Compressed Sparse Row format>

现在我们将使用UMAP进行降维,将矩阵从34880维降至2维(因为n_components=2)。我们需要一个距离度量,并将使用Hellinger距离,它衡量两个概率分布之间的相似性。每个文档都有一组由多项式分布生成的计数,我们可以使用Hellinger距离来衡量这些分布的相似性。

%%time
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
CPU times: user 2min 24s, sys: 1.18 s, total: 2min 25s
Wall time: 2min 3s

现在我们有一个18846x2的嵌入。

embedding.embedding_.shape
(18846, 2)

让我们绘制嵌入图。如果你在笔记本中运行这个,你应该使用交互式绘图方法,因为它允许你将鼠标悬停在点上并查看它们属于哪个类别。

# For interactive plotting use
# f = umap.plot.interactive(embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
# show(f)
f = umap.plot.points(embedding, labels=hover_df['category'])
_images/20newsgroups_hellinger_counts.png

使用TF-IDF

为了进行TF-IDF加权,我们将使用sklearn的TfidfVectorizer,并使用与上述CountVectorizer相同的参数。

tfidf_vectorizer = TfidfVectorizer(min_df=5, stop_words='english')
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(dataset.data)

我们得到一个与之前大小相同的矩阵:

tfidf_word_doc_matrix
<18846x34880 sparse matrix of type '<class 'numpy.float64'>'
    with 1939023 stored elements in Compressed Sparse Row format>
%%time
tfidf_embedding = umap.UMAP(metric='hellinger').fit(tfidf_word_doc_matrix)
CPU times: user 2min 19s, sys: 1.27 s, total: 2min 20s
Wall time: 1min 57s
# For interactive plotting use
# fig = umap.plot.interactive(tfidf_embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
# show(fig)
fig = umap.plot.points(tfidf_embedding, labels=hover_df['category'])
_images/20newsgroups_hellinger_tfidf.png

结果看起来与之前非常相似,但这可能是一个有用的技巧,可以放在你的工具箱中。

潜在应用

  • 探索/可视化您的语料库以识别主题/趋势

  • 对嵌入进行聚类以找到相关文档的组

  • 寻找最近的邻居以找到相关文档

  • 寻找异常文档