在5分钟内构建您的第一个语义搜索引擎
| 时间: 5 - 15 分钟 | 级别: 初学者 |
|---|
概述
如果你是向量数据库的新手,本教程适合你。在5分钟内,你将构建一个科幻书籍的语义搜索引擎。设置完成后,你将向引擎询问即将到来的外星威胁。你的创作将推荐书籍作为应对潜在太空攻击的准备。
在开始之前,你需要安装一个最新版本的Python。如果你不知道如何在虚拟环境中运行此代码,请先按照Python文档中的创建虚拟环境进行操作。
本教程假设您正在使用 bash shell。使用 Python 文档来激活虚拟环境,命令如下:
source tutorial-env/bin/activate
1. 安装
您需要处理您的数据,以便搜索引擎能够处理它。句子变换器框架为您提供了访问常见的大型语言模型的途径,这些模型将原始数据转换为嵌入。
pip install -U sentence-transformers
一旦编码,这些数据需要被保存在某个地方。Qdrant 允许你将数据存储为嵌入。你也可以使用 Qdrant 对这些数据运行搜索查询。这意味着你可以要求引擎提供远远超出关键词匹配的相关答案。
pip install -U qdrant-client
导入模型
一旦定义了两个主要框架,您需要指定此引擎将使用的确切模型。在此之前,请使用python命令激活Python提示符(>>>)。
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
句子变换器 框架包含许多嵌入模型。然而,all-MiniLM-L6-v2 是本教程中最快的编码器。
encoder = SentenceTransformer("all-MiniLM-L6-v2")
2. 添加数据集
all-MiniLM-L6-v2 将编码您提供的数据。在这里,您将列出图书馆中所有的科幻书籍。每本书都有元数据,包括书名、作者、出版年份和简短描述。
documents = [
{
"name": "The Time Machine",
"description": "A man travels through time and witnesses the evolution of humanity.",
"author": "H.G. Wells",
"year": 1895,
},
{
"name": "Ender's Game",
"description": "A young boy is trained to become a military leader in a war against an alien race.",
"author": "Orson Scott Card",
"year": 1985,
},
{
"name": "Brave New World",
"description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.",
"author": "Aldous Huxley",
"year": 1932,
},
{
"name": "The Hitchhiker's Guide to the Galaxy",
"description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.",
"author": "Douglas Adams",
"year": 1979,
},
{
"name": "Dune",
"description": "A desert planet is the site of political intrigue and power struggles.",
"author": "Frank Herbert",
"year": 1965,
},
{
"name": "Foundation",
"description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.",
"author": "Isaac Asimov",
"year": 1951,
},
{
"name": "Snow Crash",
"description": "A futuristic world where the internet has evolved into a virtual reality metaverse.",
"author": "Neal Stephenson",
"year": 1992,
},
{
"name": "Neuromancer",
"description": "A hacker is hired to pull off a near-impossible hack and gets pulled into a web of intrigue.",
"author": "William Gibson",
"year": 1984,
},
{
"name": "The War of the Worlds",
"description": "A Martian invasion of Earth throws humanity into chaos.",
"author": "H.G. Wells",
"year": 1898,
},
{
"name": "The Hunger Games",
"description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.",
"author": "Suzanne Collins",
"year": 2008,
},
{
"name": "The Andromeda Strain",
"description": "A deadly virus from outer space threatens to wipe out humanity.",
"author": "Michael Crichton",
"year": 1969,
},
{
"name": "The Left Hand of Darkness",
"description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.",
"author": "Ursula K. Le Guin",
"year": 1969,
},
{
"name": "The Three-Body Problem",
"description": "Humans encounter an alien civilization that lives in a dying system.",
"author": "Liu Cixin",
"year": 2008,
},
]
3. 定义存储位置
你需要告诉Qdrant在哪里存储嵌入。这是一个基本的演示,所以你的本地计算机将使用其内存作为临时存储。
client = QdrantClient(":memory:")
4. 创建一个集合
Qdrant中的所有数据都是按集合组织的。在这种情况下,您正在存储书籍,所以我们将其称为my_books。
client.create_collection(
collection_name="my_books",
vectors_config=models.VectorParams(
size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
distance=models.Distance.COSINE,
),
)
vector_size参数定义了特定集合中向量的大小。如果它们的大小不同,则无法计算它们之间的距离。384 是编码器的输出维度。你也可以使用 model.get_sentence_embedding_dimension() 来获取你正在使用的模型的维度。distance参数允许您指定用于测量两点之间距离的函数。
5. 上传数据到集合
告诉数据库将documents上传到my_books集合。这将为每条记录分配一个ID和一个有效载荷。有效载荷只是数据集中的元数据。
client.upload_points(
collection_name="my_books",
points=[
models.PointStruct(
id=idx, vector=encoder.encode(doc["description"]).tolist(), payload=doc
)
for idx, doc in enumerate(documents)
],
)
6. 向引擎提问
现在数据已经存储在Qdrant中,您可以向它提问并获得语义相关的结果。
hits = client.query_points(
collection_name="my_books",
query=encoder.encode("alien invasion").tolist(),
limit=3,
).points
for hit in hits:
print(hit.payload, "score:", hit.score)
响应:
搜索引擎显示了三个最可能与外星人入侵相关的回答。每个回答都被分配了一个分数,以显示该回答与原始查询的接近程度。
{'name': 'The War of the Worlds', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'author': 'H.G. Wells', 'year': 1898} score: 0.570093257022374
{'name': "The Hitchhiker's Guide to the Galaxy", 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'author': 'Douglas Adams', 'year': 1979} score: 0.5040468703143637
{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216
缩小查询范围
2000年代初的最新书怎么样?
hits = client.query_points(
collection_name="my_books",
query=encoder.encode("alien invasion").tolist(),
query_filter=models.Filter(
must=[models.FieldCondition(key="year", range=models.Range(gte=2000))]
),
limit=1,
).points
for hit in hits:
print(hit.payload, "score:", hit.score)
响应:
查询结果已缩小到2008年的一个结果。
{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008} score: 0.45902943411768216
下一步
恭喜你,你已经创建了你的第一个搜索引擎!相信我们,Qdrant的其余部分也不会那么复杂。在你的下一个教程中,你应该尝试构建一个实际的具有完整API和数据集的神经搜索服务。
返回到bash shell
返回到bash提示符:
- Press Ctrl+D to exit the Python prompt (
>>>). - Enter the
deactivatecommand to deactivate the virtual environment.
