Using MyScale for embeddings search

本笔记本将引导您完成一个简单流程：下载一些数据，进行嵌入处理，然后使用选定的向量数据库进行索引和搜索。对于希望在安全环境中存储和搜索我们的嵌入数据以支持生产用例（如聊天机器人、主题建模等）的客户来说，这是一个常见需求。

什么是向量数据库

向量数据库是一种专为存储、管理和搜索嵌入向量而设计的数据库。近年来，由于AI在解决涉及自然语言、图像识别和其他非结构化数据用例方面的效果日益显著，使用嵌入技术将非结构化数据（文本、音频、视频等）编码为向量供机器学习模型使用的做法呈现爆发式增长。向量数据库已成为企业实现和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够利用本代码库中分享的众多嵌入应用场景（例如问答系统、聊天机器人和推荐服务），并在安全、可扩展的环境中实现这些应用。许多客户在小规模场景下已通过嵌入技术解决问题，但性能和安全性阻碍了其投入生产环境——我们认为向量数据库是解决这一问题的关键组件。本指南将带您了解文本数据嵌入的基础知识，将其存储于向量数据库，并用于语义搜索。

演示流程

演示流程如下：

设置: 导入包并设置任何需要的变量
加载数据: 加载数据集并使用OpenAI嵌入进行编码
MyScale
- 设置: 配置MyScale Python客户端。更多详情请点击此处
- 索引数据: 我们将创建一个表并为其内容建立索引。
- 搜索数据: 带着不同目标运行几个示例查询。

完成本笔记本的学习后，您应该对如何设置和使用向量数据库有了基本理解，可以继续探索利用我们嵌入功能的更复杂用例。

import openai from typing import List, Iterator import pandas as pd import numpy as np import os import wget from ast import literal_eval # MyScale's client library for Python import clickhouse_connect # I've set this to our new embeddings model, this can be changed to the embedding model of your choice EMBEDDING_MODEL = "text-embedding-3-small" # Ignore unclosed SSL socket warnings - optional in case you get these errors import warnings warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning) warnings.filterwarnings("ignore", category=DeprecationWarning)

加载数据

在本节中，我们将加载在此会话之前准备好的嵌入数据。

	id	url	标题	文本内容	标题向量	内容向量	向量id
0	1	https://simple.wikipedia.org/wiki/April	四月	四月是公历年中第四个月份...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	八月(Aug.)是一年中的第八个月份...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	艺术	艺术是一种表达想象力的创造性活动...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A或a是英语字母表中的第一个字母...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	空气指的是地球的大气层。空气是一种...	[0.02224554680287838, -0.02044147066771984, -...	[0.021524671465158463, 0.018522677943110466, -...	4

url

标题

文本内容

标题向量

内容向量

向量id

https://simple.wikipedia.org/wiki/April

四月

四月是公历年中第四个月份...

[0.001009464613161981, -0.020700545981526375, ...

[-0.011253940872848034, -0.013491976074874401,...

https://simple.wikipedia.org/wiki/August

August

八月(Aug.)是一年中的第八个月份...

[0.0009286514250561595, 0.000820168002974242, ...

[0.0003609954728744924, 0.007262262050062418, ...

https://simple.wikipedia.org/wiki/Art

艺术

艺术是一种表达想象力的创造性活动...

[0.003393713850528002, 0.0061537534929811954, ...

[-0.004959689453244209, 0.015772193670272827, ...

https://simple.wikipedia.org/wiki/A

A或a是英语字母表中的第一个字母...

[0.0153952119871974, -0.013759135268628597, 0....

[0.024894846603274345, -0.022186409682035446, ...

https://simple.wikipedia.org/wiki/Air

Air

空气指的是地球的大气层。空气是一种...

[0.02224554680287838, -0.02044147066771984, -...

[0.021524671465158463, 0.018522677943110466, -...

# Read vectors from strings back into a list article_df['title_vector'] = article_df.title_vector.apply(literal_eval) article_df['content_vector'] = article_df.content_vector.apply(literal_eval) # Set vector_id to be a string article_df['vector_id'] = article_df['vector_id'].apply(str)

<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000 entries, 0 to 24999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 25000 non-null int64 1 url 25000 non-null object 2 title 25000 non-null object 3 text 25000 non-null object 4 title_vector 25000 non-null object 5 content_vector 25000 non-null object 6 vector_id 25000 non-null object dtypes: int64(1), object(6) memory usage: 1.3+ MB

MyScale

我们将考虑的下一个向量数据库是MyScale。

MyScale 是一个基于Clickhouse构建的数据库，它结合了向量搜索和SQL分析功能，提供高性能、精简且完全托管的体验。该数据库旨在促进对结构化和向量数据的联合查询与分析，并全面支持所有数据处理的SQL操作。

通过使用MyScale控制台，在您的集群上部署并执行基于SQL的向量搜索，仅需两分钟。

连接MyScale

按照连接详情部分的指引，从MyScale控制台获取集群主机、用户名和密码信息，并如下所示创建到集群的连接：

索引数据

我们将在MyScale中创建一个名为articles的SQL表来存储嵌入数据。该表将包含一个使用余弦距离度量的向量索引，以及对嵌入长度的约束。使用以下代码创建并向articles表中插入数据：

# create articles table with vector index embedding_len=len(article_df['content_vector'][0]) # 1536 client.command(f""" CREATE TABLE IF NOT EXISTS default.articles ( id UInt64, url String, title String, text String, content_vector Array(Float32), CONSTRAINT cons_vector_len CHECK length(content_vector) = {embedding_len}, VECTOR INDEX article_content_index content_vector TYPE HNSWFLAT('metric_type=Cosine') ) ENGINE = MergeTree ORDER BY id """) # insert data into the table in batches from tqdm.auto import tqdm batch_size = 100 total_records = len(article_df) # we only need subset of columns article_df = article_df[['id', 'url', 'title', 'text', 'content_vector']] # upload data in batches data = article_df.to_records(index=False).tolist() column_names = article_df.columns.tolist() for i in tqdm(range(0, total_records, batch_size)): i_end = min(i + batch_size, total_records) client.insert("default.articles", data[i:i_end], column_names=column_names)

在进行搜索之前，我们需要检查向量索引的构建状态，因为它是自动在后台构建的。

# check count of inserted data print(f"articles count: {client.command('SELECT count(*) FROM default.articles')}") # check the status of the vector index, make sure vector index is ready with 'Built' status get_index_status="SELECT status FROM system.vector_indices WHERE name='article_content_index'" print(f"index build status: {client.command(get_index_status)}")

搜索数据

在MyScale中建立索引后，我们可以执行向量搜索来查找相似内容。首先，我们将使用OpenAI API为查询生成嵌入向量。然后，我们将使用MyScale执行向量搜索。

query = "Famous battles in Scottish history" # creates embedding vector from user query embed = openai.Embedding.create( input=query, model="text-embedding-3-small", )["data"][0]["embedding"] # query the database to find the top K similar content to the given query top_k = 10 results = client.query(f""" SELECT id, url, title, distance(content_vector, {embed}) as dist FROM default.articles ORDER BY dist LIMIT {top_k} """) # display results for i, r in enumerate(results.named_results()): print(i+1, r['title'])

2023年6月28日

使用MyScale进行嵌入搜索