使用UMAP进行文档嵌入
这是一个使用UMAP嵌入文本的教程(但这可以扩展到任何标记集合)。我们将使用20个新闻组数据集,这是一个按主题标记的论坛帖子集合。我们将嵌入这些文档,并看到相似的文档(即同一子论坛中的帖子)最终会靠得很近。您可以将此嵌入用于其他下游任务,例如可视化您的语料库,或运行聚类算法(例如HDBSCAN)。我们将使用词袋模型,并在计数向量以及TF-IDF向量上使用UMAP。
首先,让我们加载相关的库。这需要UMAP版本 >= 0.4.0。
import pandas as pd
import umap
import umap.plot
# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)
接下来,让我们下载并探索20个新闻组数据集。
%%time
dataset = fetch_20newsgroups(subset='all',
shuffle=True, random_state=42)
CPU times: user 280 ms, sys: 52 ms, total: 332 ms
Wall time: 460 ms
让我们看看语料库的大小:
print(f'{len(dataset.data)} documents')
print(f'{len(dataset.target_names)} categories')
18846 documents
20 categories
以下是文档的类别。正如你所看到的,许多类别是相互关联的(例如‘comp.sys.ibm.pc.hardware’和‘comp.sys.mac.hardware’),但它们并不都是相关的(例如‘sci.med’和‘rec.sport.baseball’)。
dataset.target_names
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
让我们来看几个示例文档:
for idx, document in enumerate(dataset.data[:3]):
category = dataset.target_names[dataset.target[idx]]
print(f'Category: {category}')
print('---------------------------')
# Print the first 500 characters of the post
print(document[:500])
print('---------------------------')
Category: rec.sport.hockey --------------------------- From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu> Subject: Pens fans reactions Organization: Post Office, Carnegie Mellon, Pittsburgh, PA Lines: 12 NNTP-Posting-Host: po4.andrew.cmu.edu I am sure some bashers of Pens fans are pretty confused about the lack of any kind of posts about the recent Pens massacre of the Devils. Actually, I am bit puzzled too and a bit relieved. However, I am going to put an end to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they are killin --------------------------- Category: comp.sys.ibm.pc.hardware --------------------------- From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson) Subject: Which high-performance VLB video card? Summary: Seek recommendations for VLB video card Nntp-Posting-Host: midway.ecn.uoknor.edu Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA Keywords: orchid, stealth, vlb Lines: 21 My brother is in the market for a high-performance video card that supports VESA local bus with 1-2MB RAM. Does anyone have suggestions/ideas on: - Diamond Stealth Pro Local --------------------------- Category: talk.politics.mideast --------------------------- From: hilmi-er@dsv.su.se (Hilmi Eren) Subject: Re: ARMENIA SAYS IT COULD SHOOT DOWN TURKISH PLANES (Henrik) Lines: 95 Nntp-Posting-Host: viktoria.dsv.su.se Reply-To: hilmi-er@dsv.su.se (Hilmi Eren) Organization: Dept. of Computer and Systems Sciences, Stockholm University |>The student of "regional killings" alias Davidian (not the Davidian religios sect) writes: |>Greater Armenia would stretch from Karabakh, to the Black Sea, to the |>Mediterranean, so if you use the term "Greater Armenia ---------------------------
现在我们将创建一个包含目标标签的dataframe,用于绘图。这将允许我们在悬停在绘制的点上时看到新闻组(如果使用交互式绘图)。这将帮助我们(通过视觉)评估嵌入的效果。
category_labels = [dataset.target_names[x] for x in dataset.target]
hover_df = pd.DataFrame(category_labels, columns=['category'])
使用原始计数
接下来,我们将使用词袋方法(即词序不重要)并构建一个词文档矩阵。在这个矩阵中,行将对应于一个文档(即帖子),每列将对应于一个特定的词。值将是给定词在特定文档中出现的次数。
我们将使用sklearn的CountVectorizer函数以及一些其他预处理步骤来完成这项工作:
将文本通过空格分割成标记(即单词)
移除英文停用词(the, and, 等)
移除在整个语料库中出现次数少于5次的所有单词(通过min_df参数)
vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)
这给了我们一个18846x34880的矩阵,其中有18846个文档(与上面相同)和34880个独特的标记。这个矩阵是稀疏的,因为大多数单词不会出现在大多数文档中。
word_doc_matrix
<18846x34880 sparse matrix of type '<class 'numpy.int64'>'
with 1939023 stored elements in Compressed Sparse Row format>
现在我们将使用UMAP进行降维,将矩阵从34880维降至2维(因为n_components=2)。我们需要一个距离度量,并将使用Hellinger距离,它衡量两个概率分布之间的相似性。每个文档都有一组由多项式分布生成的计数,我们可以使用Hellinger距离来衡量这些分布的相似性。
%%time
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
CPU times: user 2min 24s, sys: 1.18 s, total: 2min 25s
Wall time: 2min 3s
现在我们有一个18846x2的嵌入。
embedding.embedding_.shape
(18846, 2)
让我们绘制嵌入图。如果你在笔记本中运行这个,你应该使用交互式绘图方法,因为它允许你将鼠标悬停在点上并查看它们属于哪个类别。
# For interactive plotting use
# f = umap.plot.interactive(embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
# show(f)
f = umap.plot.points(embedding, labels=hover_df['category'])
使用TF-IDF
为了进行TF-IDF加权,我们将使用sklearn的TfidfVectorizer,并使用与上述CountVectorizer相同的参数。
tfidf_vectorizer = TfidfVectorizer(min_df=5, stop_words='english')
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(dataset.data)
我们得到一个与之前大小相同的矩阵:
tfidf_word_doc_matrix
<18846x34880 sparse matrix of type '<class 'numpy.float64'>'
with 1939023 stored elements in Compressed Sparse Row format>
%%time
tfidf_embedding = umap.UMAP(metric='hellinger').fit(tfidf_word_doc_matrix)
CPU times: user 2min 19s, sys: 1.27 s, total: 2min 20s
Wall time: 1min 57s
# For interactive plotting use
# fig = umap.plot.interactive(tfidf_embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
# show(fig)
fig = umap.plot.points(tfidf_embedding, labels=hover_df['category'])
结果看起来与之前非常相似,但这可能是一个有用的技巧,可以放在你的工具箱中。
潜在应用
探索/可视化您的语料库以识别主题/趋势
对嵌入进行聚类以找到相关文档的组
寻找最近的邻居以找到相关文档
寻找异常文档