向量化器
支持的向量化器
在本文档中,您将学习如何使用RedisVL通过内置的文本嵌入向量化器创建嵌入。RedisVL支持:
- OpenAI
- HuggingFace
- Vertex AI
- Cohere
在开始之前,请确保以下事项:
- 您已经安装了RedisVL并激活了该环境。
- 您有一个运行中的Redis实例,具备Redis查询引擎功能。
# import necessary modules
import os
创建文本嵌入
这个例子将展示如何在RedisVL中使用多种不同的文本向量化器从三个简单的句子创建嵌入。
- "那是一只快乐的狗"
- "那是一个快乐的人"
- "今天是个好日子"
OpenAI
OpenAITextVectorizer
使得使用 RedisVL 与 OpenAI 的嵌入模型变得容易。为此,您需要安装 openai
。
pip install openai
import getpass
# setup the API Key
api_key = os.environ.get("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
from redisvl.utils.vectorize import OpenAITextVectorizer
# create a vectorizer
oai = OpenAITextVectorizer(
model="text-embedding-ada-002",
api_config={"api_key": api_key},
)
test = oai.embed("This is a test sentence.")
print("Vector dimensions: ", len(test))
test[:10]
Vector dimensions: 1536
[-0.001025049015879631,
-0.0030993607360869646,
0.0024536605924367905,
-0.004484387580305338,
-0.010331203229725361,
0.012700922787189484,
-0.005368996877223253,
-0.0029411641880869865,
-0.0070833307690918446,
-0.03386051580309868]
# Create many embeddings at once
sentences = [
"That is a happy dog",
"That is a happy person",
"Today is a sunny day"
]
embeddings = oai.embed_many(sentences)
embeddings[0][:10]
[-0.01747742109000683,
-5.228330701356754e-05,
0.0013870716793462634,
-0.025637786835432053,
-0.01985435001552105,
0.016117358580231667,
-0.0037306349258869886,
0.0008945261361077428,
0.006577865686267614,
-0.025091219693422318]
# openai also supports asyncronous requests, which you can use to speed up the vectorization process.
embeddings = await oai.aembed_many(sentences)
print("Number of Embeddings:", len(embeddings))
Number of Embeddings: 3
Huggingface
Huggingface 是一个流行的自然语言处理(NLP)平台,它有许多预训练模型可以直接使用。RedisVL 支持使用 Huggingface 的 "Sentence Transformers" 从文本创建嵌入。要使用 Huggingface,你需要安装 sentence-transformers
库。
pip install sentence-transformers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer
# create a vectorizer
# choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")
# embed a sentence
test = hf.embed("This is a test sentence.")
test[:10]
[0.00037810884532518685,
-0.05080341175198555,
-0.03514723479747772,
-0.02325104922056198,
-0.044158220291137695,
0.020487844944000244,
0.0014617963461205363,
0.031261757016181946,
0.05605152249336243,
0.018815357238054276]
# You can also create many embeddings at once
embeddings = hf.embed_many(sentences, as_buffer=True)
VertexAI
VertexAI 是 GCP 的全功能 AI 平台,其中包括许多预训练的 LLM。RedisVL 支持使用 VertexAI 从这些模型创建嵌入。要使用 VertexAI,您首先需要安装 google-cloud-aiplatform
库。
pip install google-cloud-aiplatform>=1.26
然后你需要获得一个Google Cloud Project的访问权限,并提供访问凭证。这是通过将GOOGLE_APPLICATION_CREDENTIALS
环境变量设置为从GCP上的服务账户下载的JSON密钥文件的路径来实现的。
最后,您需要找到您的项目ID和VertexAI的地理区域。
确保设置了以下环境变量:
GOOGLE_APPLICATION_CREDENTIALS=<path to your gcp JSON creds>
GCP_PROJECT_ID=<your gcp project id>
GCP_LOCATION=<your gcp geo region for vertex ai>
from redisvl.utils.vectorize import VertexAITextVectorizer
# create a vectorizer
vtx = VertexAITextVectorizer(api_config={
"project_id": os.environ.get("GCP_PROJECT_ID") or getpass.getpass("Enter your GCP Project ID: "),
"location": os.environ.get("GCP_LOCATION") or getpass.getpass("Enter your GCP Location: "),
"google_application_credentials": os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") or getpass.getpass("Enter your Google App Credentials path: ")
})
# embed a sentence
test = vtx.embed("This is a test sentence.")
test[:10]
[0.04373306408524513,
-0.05040992051362991,
-0.011946038343012333,
-0.043528858572244644,
0.021510830149054527,
0.028604144230484962,
0.014770914800465107,
-0.01610461436212063,
-0.0036560404114425182,
0.013746795244514942]
Cohere
Cohere 允许你在产品中实现语言AI。使用 CohereTextVectorizer
可以简化与Cohere的嵌入模型一起使用RedisVL的过程。为此,你需要安装 cohere
。
pip install cohere
import getpass
# set up the API Key
api_key = os.environ.get("COHERE_API_KEY") or getpass.getpass("Enter your Cohere API key: ")
需要特别注意每个embed
调用的input_type
参数。例如,对于嵌入查询,您应该设置input_type='search_query'
。对于嵌入文档,设置input_type='search_document'
。查看更多信息这里。
from redisvl.utils.vectorize import CohereTextVectorizer
# create a vectorizer
co = CohereTextVectorizer(
model="embed-english-v3.0",
api_config={"api_key": api_key},
)
# embed a search query
test = co.embed("This is a test sentence.", input_type='search_query')
print("Vector dimensions: ", len(test))
print(test[:10])
# embed a document
test = co.embed("This is a test sentence.", input_type='search_document')
print("Vector dimensions: ", len(test))
print(test[:10])
Vector dimensions: 1024
[-0.010856628, -0.019683838, -0.0062179565, 0.003545761, -0.047943115, 0.0009365082, -0.005924225, 0.016174316, -0.03289795, 0.049194336]
Vector dimensions: 1024
[-0.009712219, -0.016036987, 2.8073788e-05, -0.022491455, -0.041259766, 0.002281189, -0.033294678, -0.00057029724, -0.026260376, 0.0579834]
了解更多关于如何一起使用RedisVL和Cohere的信息,请参阅此专用用户指南。
使用提供者嵌入进行搜索
现在你已经创建了嵌入向量,你可以使用它们来搜索相似的句子。你将使用上面的三个句子来搜索相似的句子。
首先,为您的索引创建模式。
以下是HuggingFace向量化器的示例模式在YAML中的样子:
version: '0.1.0'
index:
name: vectorizers
prefix: doc
storage_type: hash
fields:
- name: sentence
type: text
- name: embedding
type: vector
attrs:
dims: 768
algorithm: flat
distance_metric: cosine
from redisvl.index import SearchIndex
# construct a search index from the schema
index = SearchIndex.from_yaml("./schema.yaml")
# connect to local redis instance
index.connect("redis://localhost:6379")
# create the index (no data yet)
index.create(overwrite=True)
# use the CLI to see the created index
!rvl index listall
22:02:27 [RedisVL] INFO Indices:
22:02:27 [RedisVL] INFO 1. vectorizers
# load expects an iterable of dictionaries where
# the vector is stored as a bytes buffer
data = [{"text": t,
"embedding": v}
for t, v in zip(sentences, embeddings)]
index.load(data)
['doc:17c401b679ce43cb82f3ab2280ad02f2',
'doc:3fc0502bec434b17a3f06e20824b2e59',
'doc:199f17b0e5d24dcaa1fd4fb41558150c']
from redisvl.query import VectorQuery
# use the HuggingFace vectorizer again to create a query embedding
query_embedding = hf.embed("That is a happy cat")
query = VectorQuery(
vector=query_embedding,
vector_field_name="embedding",
return_fields=["text"],
num_results=3
)
results = index.query(query)
for doc in results:
print(doc["text"], doc["vector_distance"])
That is a happy dog 0.160862326622
That is a happy person 0.273598492146
Today is a sunny day 0.744559407234
# cleanup
index.delete()