语义搜索
edit语义搜索
edit语义搜索是一种搜索方法,它帮助您基于搜索查询的意图和上下文意义来查找数据,而不是基于查询词的匹配(词汇搜索)。
Elasticsearch 提供了各种使用 自然语言处理 (NLP) 和向量搜索的语义搜索功能。 使用 NLP 模型使您能够从文本中提取文本嵌入。 嵌入是提供文本的数值表示的向量。 具有相似含义的内容具有相似的表示。
在Elastic Stack中使用NLP模型有几种选择:
-
使用
semantic_text工作流(推荐) - 使用推理 API 工作流
- 直接在 Elasticsearch 中部署模型
请参考本节以选择您的工作流程。
您还可以将您自己的嵌入向量存储在 Elasticsearch 中。 请参阅 本节以获取有关用于语义搜索的查询类型的指导。
在查询时,Elasticsearch可以使用相同的NLP模型将查询转换为嵌入,使您能够找到具有相似文本嵌入的文档。
选择一个语义搜索工作流程
editsemantic_text 工作流程
edit在 Elastic Stack 中使用 NLP 模型的最简单方法是通过 semantic_text 工作流。
我们推荐使用这种方法,因为它抽象了很多手动工作。
您只需要创建一个推理端点和索引映射即可开始摄取、嵌入和查询数据。
无需定义与模型相关的设置和参数,也无需创建推理摄取管道。
请参阅 创建推理端点 API 文档以获取支持的服务列表。
《使用semantic_text进行语义搜索》教程展示了整个流程。
推理 API 工作流程
editThe inference API workflow 更加复杂,但提供了对推理端点配置的更大控制。 您需要创建一个推理端点,提供各种与模型相关的设置和参数,定义索引映射,并使用适当的设置设置推理摄取管道。
使用推理 API 的语义搜索教程展示了整个过程。
模型部署工作流程
edit您也可以手动在Elasticsearch中部署NLP,而不使用推理端点。 这是在Elastic Stack中执行语义搜索的最复杂且劳动密集的工作流程。 您需要从支持的密集和稀疏向量模型列表中选择一个NLP模型,使用Eland客户端部署它,创建索引映射,并设置合适的摄取管道以开始摄取和查询数据。
使用部署在 Elasticsearch 中的模型的语义搜索教程展示了整个过程。
使用正确的查询
edit构建正确的查询对于语义搜索至关重要。
您使用的查询和在查询中定位的字段取决于您选择的流程。
如果您使用的是semantic_text流程,这非常简单。
如果不是,则取决于您使用的是哪种类型的嵌入。
| Field type to query | Query to use | Notes |
|---|---|---|
在索引时间和查询时间, |
||
The |
||
The |
如果您希望 Elasticsearch 在索引和查询时生成嵌入,请使用 semantic_text 字段和 semantic 查询。
如果您想使用自己的嵌入,请使用 sparse_vector 或 dense_vector 字段类型以及根据您用于生成嵌入的 NLP 模型相关的查询。
要了解在 Elastic Stack 中执行语义搜索的最简单方法,请参考 semantic_text 端到端教程。
了解更多
edit-
教程:
-
交互式示例:
-
elasticsearch-labs仓库包含了许多以可执行 Python 笔记本形式呈现的交互式语义搜索示例,使用了 Elasticsearch Python 客户端 - 使用 ELSER 的语义搜索 使用模型部署工作流程
-
使用
semantic_text的语义搜索
-
-
博客:
教程:使用semantic_text进行语义搜索
edit此功能处于测试阶段,可能会发生变化。设计和代码不如正式发布的功能成熟,并且是按原样提供的,不提供任何保证。测试功能不受正式发布功能的支持服务级别协议的约束。
本教程向您展示如何使用语义文本功能对您的数据执行语义搜索。
语义文本通过在摄取时提供推理和自动设置合理的默认值,简化了推理工作流程。 您不需要定义与模型相关的设置和参数,也不需要创建推理摄取管道。
在 Elastic Stack 中使用 语义搜索 的推荐方法是遵循 semantic_text 工作流程。
当你需要更多控制索引和查询设置时,你仍然可以使用完整的推理工作流程(参考 本教程 以查看该过程)。
本教程使用elser服务进行演示,但您可以使用Inference API提供的任何服务及其支持的模型。
要求
edit要使用 semantic_text 字段类型,您必须在集群中部署一个推理端点,使用 创建推理 API。
创建推理端点
edit使用创建推理API创建一个推理端点:
PUT _inference/sparse_embedding/my-elser-endpoint { "service": "elser", "service_settings": { "adaptive_allocations": { "enabled": true, "min_number_of_allocations": 3, "max_number_of_allocations": 10 }, "num_threads": 1 } }
|
任务类型是 |
|
|
在这个示例中使用了 |
|
|
此设置启用并配置自适应分配。 自适应分配使ELSER能够根据当前进程的负载自动扩展或缩减资源。 |
在使用Kibana控制台时,您可能会在响应中看到502错误网关错误。
这个错误通常只是反映了超时,而模型在后台下载。
您可以在机器学习UI中检查下载进度。
如果使用Python客户端,可以将timeout参数设置为更高的值。
创建索引映射
edit目标索引的映射 - 包含推理端点将根据您的输入文本生成的嵌入的索引 - 必须被创建。
目标索引必须有一个字段,该字段具有semantic_text字段类型,以索引所使用的推理端点的输出。
PUT semantic-embeddings
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": "my-elser-endpoint"
}
}
}
}
|
包含生成的嵌入向量的字段名称。 |
|
|
包含嵌入的字段是一个 |
|
|
The |
如果您使用网络爬虫或连接器来生成索引,您必须
更新这些索引的映射以
包含semantic_text字段。更新映射后,您需要运行
一次完整的网络爬取或完整的连接器同步。这确保了所有现有
文档都被重新处理并更新为新的语义嵌入,
从而在更新的数据上启用语义搜索。
加载数据
edit在这一步中,您加载稍后用于创建嵌入的数据。
使用 msmarco-passagetest2019-top1000 数据集,这是 MS MARCO 段落排序数据集的一个子集。它包含 200 个查询,每个查询都附有一系列相关的文本段落。所有唯一的段落及其 ID 已从该数据集中提取并编译成一个 tsv 文件。
下载文件并使用机器学习UI中的数据可视化工具将其上传到您的集群。
在您的数据被分析后,点击覆盖设置。
在编辑字段名称下,将id分配给第一列,将content分配给第二列。
点击应用,然后点击导入。
将索引命名为test-data,然后点击导入。
上传完成后,您将看到一个包含182,469个文档的名为test-data的索引。
重新索引数据
edit通过从test-data索引重新索引数据到semantic-embeddings索引,创建文本的嵌入。
content字段中的数据将被重新索引到目标索引的content语义文本字段中。
重新索引的数据将由与content语义文本字段关联的推理端点处理。
此步骤使用 reindex API 来模拟数据摄取。如果您正在处理已经索引的数据,而不是使用 test-data 数据集,则需要重新索引以确保数据由推理端点处理并生成必要的嵌入。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 10
},
"dest": {
"index": "semantic-embeddings"
}
}
调用返回一个任务ID以监控进度:
GET _tasks/<task_id>
重新索引大型数据集可能需要很长时间。 您可以使用数据集的一个子集来测试此工作流程。 通过取消重新索引过程,并且仅生成已重新索引的子集的嵌入来实现这一点。 以下 API 请求将取消重新索引任务:
POST _tasks/<task_id>/_cancel
语义搜索
edit在数据集经过嵌入丰富后,您可以使用语义搜索来查询数据。
提供semantic_text字段名称和查询文本在semantic查询类型中。
用于生成semantic_text字段嵌入的推理端点将被用来处理查询文本。
GET semantic-embeddings/_search
{
"query": {
"semantic": {
"field": "content",
"query": "How to avoid muscle soreness while running?"
}
}
}
因此,您会收到与查询在语义上最接近的semantic-embedding索引中的前10个文档:
"hits": [
{
"_index": "semantic-embeddings",
"_id": "Jy5065EBBFPLbFsdh_f9",
"_score": 21.487484,
"_source": {
"id": 8836652,
"content": {
"text": "There are a few foods and food groups that will help to fight inflammation and delayed onset muscle soreness (both things that are inevitable after a long, hard workout) when you incorporate them into your postworkout eats, whether immediately after your run or at a meal later in the day. Advertisement. Advertisement.",
"inference": {
"inference_id": "my-elser-endpoint",
"model_settings": {
"task_type": "sparse_embedding"
},
"chunks": [
{
"text": "There are a few foods and food groups that will help to fight inflammation and delayed onset muscle soreness (both things that are inevitable after a long, hard workout) when you incorporate them into your postworkout eats, whether immediately after your run or at a meal later in the day. Advertisement. Advertisement.",
"embeddings": {
(...)
}
}
]
}
}
}
},
{
"_index": "semantic-embeddings",
"_id": "Ji5065EBBFPLbFsdh_f9",
"_score": 18.211695,
"_source": {
"id": 8836651,
"content": {
"text": "During Your Workout. There are a few things you can do during your workout to help prevent muscle injury and soreness. According to personal trainer and writer for Iron Magazine, Marc David, doing warm-ups and cool-downs between sets can help keep muscle soreness to a minimum.",
"inference": {
"inference_id": "my-elser-endpoint",
"model_settings": {
"task_type": "sparse_embedding"
},
"chunks": [
{
"text": "During Your Workout. There are a few things you can do during your workout to help prevent muscle injury and soreness. According to personal trainer and writer for Iron Magazine, Marc David, doing warm-ups and cool-downs between sets can help keep muscle soreness to a minimum.",
"embeddings": {
(...)
}
}
]
}
}
}
},
{
"_index": "semantic-embeddings",
"_id": "Wi5065EBBFPLbFsdh_b9",
"_score": 13.089405,
"_source": {
"id": 8800197,
"content": {
"text": "This is especially important if the soreness is due to a weightlifting routine. For this time period, do not exert more than around 50% of the level of effort (weight, distance and speed) that caused the muscle groups to be sore.",
"inference": {
"inference_id": "my-elser-endpoint",
"model_settings": {
"task_type": "sparse_embedding"
},
"chunks": [
{
"text": "This is especially important if the soreness is due to a weightlifting routine. For this time period, do not exert more than around 50% of the level of effort (weight, distance and speed) that caused the muscle groups to be sore.",
"embeddings": {
(...)
}
}
]
}
}
}
}
]
更多示例和阅读
edit教程:使用semantic_text进行混合搜索
edit本教程演示了如何执行混合搜索,结合语义搜索与传统的全文搜索。
在混合搜索中,语义搜索根据文本的含义检索结果,而全文搜索则专注于精确的词语匹配。通过结合这两种方法,混合搜索能够提供更相关的结果,特别是在仅依赖单一方法可能不足的情况下。
在Elastic Stack中使用混合搜索的推荐方法是遵循semantic_text工作流程。本教程使用elser服务进行演示,但您可以使用Inference API提供的任何服务及其支持的模型。
创建推理端点
edit使用创建推理API创建一个推理端点:
PUT _inference/sparse_embedding/my-elser-endpoint { "service": "elser", "service_settings": { "adaptive_allocations": { "enabled": true, "min_number_of_allocations": 3, "max_number_of_allocations": 10 }, "num_threads": 1 } }
|
任务类型是 |
|
|
在这个示例中使用了 |
|
|
此设置启用并配置自适应分配。 自适应分配使ELSER能够根据当前进程负载自动扩展或缩减资源。 |
在使用 Kibana 控制台时,您可能会在响应中看到 502 错误网关错误。 此错误通常只是反映了超时,而模型在后台下载。 您可以在机器学习 UI 中检查下载进度。
创建混合搜索的索引映射
edit目标索引将包含用于语义搜索的嵌入和用于全文搜索的原始文本字段。这种结构使得语义搜索和全文搜索的结合成为可能。
PUT semantic-embeddings
{
"mappings": {
"properties": {
"semantic_text": {
"type": "semantic_text",
"inference_id": "my-elser-endpoint"
},
"content": {
"type": "text",
"copy_to": "semantic_text"
}
}
}
}
|
用于存储生成的嵌入以进行语义搜索的字段名称。 |
|
|
基于输入文本生成嵌入的推理端点的标识符。 |
|
|
包含用于词汇搜索的原始文本的字段名称。 |
|
|
存储在 |
如果您想对由网络爬虫或连接器填充的索引进行搜索,您必须为这些索引更新索引映射,以包括semantic_text字段。更新映射后,您需要运行一次完整的网络爬取或完整的连接器同步。这确保了所有现有文档都被重新处理并更新为新的语义嵌入,从而在更新后的数据上启用混合搜索。
加载数据
edit在这一步中,您加载稍后用于创建嵌入的数据。
使用 msmarco-passagetest2019-top1000 数据集,这是 MS MARCO 段落排序数据集的一个子集。它包含 200 个查询,每个查询都附有一系列相关的文本段落。所有唯一的段落及其 ID 已从该数据集中提取并编译成一个 tsv 文件。
下载文件并使用机器学习 UI 中的 数据可视化工具 将其上传到您的集群。在您的数据被分析后,点击 覆盖设置。在 编辑字段名称 下,将 id 分配给第一列,将 content 分配给第二列。点击 应用,然后点击 导入。将索引命名为 test-data,然后点击 导入。上传完成后,您将看到一个包含 182,469 个文档的名为 test-data 的索引。
为混合搜索重新索引数据
edit将数据从 test-data 索引重新索引到 semantic-embeddings 索引。
源索引中 content 字段的数据被复制到目标索引的 content 字段中。
在索引映射创建时设置的 copy_to 参数确保内容被复制到 semantic_text 字段中。数据在摄取时由推理端点处理以生成嵌入。
此步骤使用 reindex API 来模拟数据摄取。如果您正在处理已经索引的数据,而不是使用 test-data 数据集,仍然需要重新索引以确保数据由推理端点处理并生成必要的嵌入。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 10
},
"dest": {
"index": "semantic-embeddings"
}
}
调用返回一个任务ID以监控进度:
GET _tasks/<task_id>
重新索引大型数据集可能需要很长时间。您可以使用数据集的一个子集来测试此工作流程。
要取消重新索引过程并为已重新索引的子集生成嵌入:
POST _tasks/<task_id>/_cancel
执行混合搜索
edit将数据重新索引到semantic-embeddings索引后,您可以通过使用互惠排名融合(RRF)来执行混合搜索。RRF是一种技术,它将语义查询和词汇查询的排名合并,对在任一搜索中排名较高的结果给予更多权重。这确保了最终结果是平衡且相关的。
GET semantic-embeddings/_search
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"content": "How to avoid muscle soreness while running?"
}
}
}
},
{
"standard": {
"query": {
"semantic": {
"field": "semantic_text",
"query": "How to avoid muscle soreness while running?"
}
}
}
}
]
}
}
}
执行混合搜索后,查询将返回符合语义和词汇搜索标准的10个最佳文档。结果包括每个文档的详细信息:
{
"took": 107,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 473,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "semantic-embeddings",
"_id": "wv65epIBEMBRnhfTsOFM",
"_score": 0.032786883,
"_rank": 1,
"_source": {
"semantic_text": {
"inference": {
"inference_id": "my-elser-endpoint",
"model_settings": {
"task_type": "sparse_embedding"
},
"chunks": [
{
"text": "What so many out there do not realize is the importance of what you do after you work out. You may have done the majority of the work, but how you treat your body in the minutes and hours after you exercise has a direct effect on muscle soreness, muscle strength and growth, and staying hydrated. Cool Down. After your last exercise, your workout is not over. The first thing you need to do is cool down. Even if running was all that you did, you still should do light cardio for a few minutes. This brings your heart rate down at a slow and steady pace, which helps you avoid feeling sick after a workout.",
"embeddings": {
"exercise": 1.571044,
"after": 1.3603843,
"sick": 1.3281639,
"cool": 1.3227621,
"muscle": 1.2645415,
"sore": 1.2561599,
"cooling": 1.2335974,
"running": 1.1750668,
"hours": 1.1104802,
"out": 1.0991782,
"##io": 1.0794281,
"last": 1.0474665,
(...)
}
}
]
}
},
"id": 8408852,
"content": "What so many out there do not realize is the importance of (...)"
}
}
]
}
}
教程:使用推理 API 进行语义搜索
edit本教程中的说明向您展示了如何使用推理 API 工作流与各种服务来对您的数据执行语义搜索。
要了解在 Elastic Stack 中执行语义搜索的最简单方法,请参考 semantic_text 端到端教程。
以下示例使用:
-
embed-english-v3.0模型用于 Cohere -
all-mpnet-base-v2模型来自 HuggingFace -
text-embedding-ada-002OpenAI 的第二代嵌入模型 - 通过 Azure AI Studio 或 Azure OpenAI 提供的模型
-
text-embedding-004模型用于 Google Vertex AI -
mistral-embed模型用于 Mistral -
amazon.titan-embed-text-v1模型用于 Amazon Bedrock -
ops-text-embedding-zh-001模型用于 AlibabaCloud AI
您可以使用任何Cohere和OpenAI模型,它们都受到推理API的支持。 有关HuggingFace上可用的推荐模型列表,请参阅支持的模型列表。
点击下方任意小部件中您想使用的服务的名称,以查看相应的说明。
要求
edit使用Cohere服务进行推理API需要一个Cohere账户。
ELSER 是由 Elastic 训练的模型。如果您有一个 Elasticsearch 部署,使用 elser 服务的推理 API 不需要额外的要求。
使用HuggingFace服务的推理API需要一个HuggingFace账户。
使用OpenAI服务与推理API需要一个OpenAI账户。
- 一个Azure订阅
- 在所需的Azure订阅中授予对Azure OpenAI的访问权限。 您可以通过填写https://aka.ms/oai/access上的表格来申请访问Azure OpenAI。
- 在Azure OpenAI Studio中部署的嵌入模型。
- 一个Azure订阅
- 访问Azure AI Studio
- 一个已部署的嵌入或聊天完成模型。
- 一个Google Cloud账户
- Google Cloud中的一个项目
- 在您的项目中启用的Vertex AI API
- Google Vertex AI API的有效服务账户
-
该服务账户必须具有Vertex AI用户角色和
aiplatform.endpoints.predict权限。
- 在La Plateforme上的Mistral账户
- 为您的账户生成的API密钥
- 一个具有Amazon Bedrock访问权限的AWS账户
- 用于访问Amazon Bedrock的一对访问密钥和秘密密钥
创建一个推理端点
edit使用创建推理API创建一个推理端点:
PUT _inference/text_embedding/cohere_embeddings { "service": "cohere", "service_settings": { "api_key": "<api_key>", "model_id": "embed-english-v3.0", "embedding_type": "byte" } }
|
任务类型是路径中的 |
|
|
您的Cohere账户的API密钥。您可以在您的Cohere仪表板中的 API密钥部分找到您的API密钥。您只需提供一次API密钥。获取推理API不会返回您的API密钥。 |
|
|
要使用的嵌入模型的名称。您可以在这里找到Cohere嵌入模型的列表。 |
在使用此模型时,建议在dense_vector字段映射中使用的相似性度量是dot_product。在Cohere模型的情况下,嵌入被归一化为单位长度,在这种情况下,dot_product和cosine度量是等效的。
PUT _inference/sparse_embedding/elser_embeddings { "service": "elser", "service_settings": { "num_allocations": 1, "num_threads": 1 } }
您不需要提前下载和部署ELSER模型,上述API请求将在模型尚未下载时下载模型,然后进行部署。
在使用Kibana控制台时,您可能会在响应中看到502错误网关错误。
这个错误通常只是反映了超时,而模型在后台下载。
您可以在机器学习UI中检查下载进度。
如果使用Python客户端,可以将timeout参数设置为更高的值。
首先,您需要在
Hugging Face 端点页面上创建一个新的推理端点以获取端点 URL。在新端点创建页面上选择模型 all-mpnet-base-v2,然后在高级配置部分下选择 句子嵌入 任务。创建端点。复制端点初始化完成后生成的 URL,您需要在接下来的推理 API 调用中使用该 URL。
PUT _inference/text_embedding/hugging_face_embeddings { "service": "hugging_face", "service_settings": { "api_key": "<access_token>", "url": "<url_endpoint>" } }
|
任务类型是路径中的 |
|
|
一个有效的 HuggingFace 访问令牌。你可以在 你的账户设置页面找到。 |
|
|
您在 Hugging Face 上创建的推理端点 URL。 |
PUT _inference/text_embedding/openai_embeddings { "service": "openai", "service_settings": { "api_key": "<api_key>", "model_id": "text-embedding-ada-002" } }
|
任务类型是路径中的 |
|
|
您的OpenAI账户的API密钥。您可以在您的OpenAI账户的 API密钥部分找到您的OpenAI API密钥。您只需提供一次API密钥。获取推理API不会返回您的API密钥。 |
|
|
要使用的嵌入模型的名称。您可以在这里找到OpenAI嵌入模型的列表。 |
在使用此模型时,建议在dense_vector字段映射中使用的相似性度量是dot_product。对于OpenAI模型,嵌入向量被归一化为单位长度,在这种情况下,dot_product和cosine度量是等价的。
PUT _inference/text_embedding/azure_openai_embeddings { "service": "azureopenai", "service_settings": { "api_key": "<api_key>", "resource_name": "<resource_name>", "deployment_id": "<deployment_id>", "api_version": "2024-02-01" } }
|
任务类型是路径中的 |
|
|
用于访问您的 Azure OpenAI 服务的 API 密钥。
或者,您可以在此处提供一个 |
|
|
您的 Azure 资源的名称。 |
|
|
您部署的模型的ID。 |
创建模型后,您的模型部署可能需要几分钟才能变为可用状态。如果您如上所述尝试创建模型并收到404错误消息,请等待几分钟后再试。此外,在使用此模型时,建议在dense_vector字段映射中使用的相似性度量是dot_product。对于Azure OpenAI模型,嵌入向量被归一化为单位长度,在这种情况下,dot_product和cosine度量是等效的。
PUT _inference/text_embedding/azure_ai_studio_embeddings { "service": "azureaistudio", "service_settings": { "api_key": "<api_key>", "target": "<target_uri>", "provider": "<provider>", "endpoint_type": "<endpoint_type>" } }
|
任务类型是路径中的 |
|
|
用于访问您在 Azure AI Studio 中部署的模型的 API 密钥。您可以在模型部署的概述页面上找到此信息。 |
|
|
访问您在 Azure AI Studio 中部署的模型的目标 URI。您可以在模型部署的概述页面上找到此信息。 |
|
|
模型提供者,例如 |
|
|
部署的端点类型。这可以是 |
创建模型后,您的模型部署可能需要几分钟才能变得可用。如果您如上所述尝试创建模型并收到404错误消息,请等待几分钟后再试。此外,在使用此模型时,建议在dense_vector字段映射中使用的相似性度量是dot_product。
PUT _inference/text_embedding/google_vertex_ai_embeddings { "service": "googlevertexai", "service_settings": { "service_account_json": "<service_account_json>", "model_id": "text-embedding-004", "location": "<location>", "project_id": "<project_id>" } }
|
任务类型是 |
|
|
Google Vertex AI API 的有效 JSON 格式服务帐户。 |
|
|
有关可用模型的列表,请参阅文本嵌入API页面。 |
|
|
用于推理任务的位置名称。请参阅Vertex AI 上的生成式 AI 位置以获取可用位置。 |
|
|
用于推理任务的项目名称。 |
创建索引映射
edit目标索引的映射 - 即包含模型将根据您的输入文本创建的嵌入的索引 - 必须被创建。
目标索引必须有一个字段,该字段具有dense_vector字段类型用于大多数模型,以及sparse_vector字段类型用于稀疏向量模型,例如在elser服务的情况下,用于索引所使用模型的输出。
PUT cohere-embeddings
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 1024,
"element_type": "byte"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的令牌的字段名称。它必须在下一步的推理管道配置中引用。 |
|
|
包含标记的字段是一个 |
|
|
模型的输出维度。在您使用的模型的Cohere文档中找到此值。 |
|
|
用于创建密集向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是文本。 |
PUT hugging-face-embeddings
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 768,
"element_type": "float"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的令牌的字段名称。它必须在下一步的推理管道配置中引用。 |
|
|
包含标记的字段是一个 |
|
|
模型的输出维度。在HuggingFace 模型文档中找到此值。 |
|
|
用于创建密集向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是文本。 |
PUT azure-openai-embeddings
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 1536,
"element_type": "float",
"similarity": "dot_product"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的令牌的字段名称。它必须在下一步的推理管道配置中引用。 |
|
|
包含标记的字段是一个 |
|
|
模型的输出维度。请在您使用的模型对应的 Azure OpenAI 文档 中找到此值。 |
|
|
对于 Azure OpenAI 嵌入,应使用 |
|
|
用于创建密集向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是文本。 |
PUT google-vertex-ai-embeddings
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 768,
"element_type": "float",
"similarity": "dot_product"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的嵌入向量的字段名称。它必须在下一步的推理管道配置中被引用。 |
|
|
包含嵌入的字段是一个 |
|
|
模型的输出维度。此值可以在Google Vertex AI模型参考中找到。
推理API尝试在未指定 |
|
|
对于Google Vertex AI嵌入,应使用 |
|
|
用于创建密集向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是 |
PUT mistral-embeddings
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 1024,
"element_type": "float",
"similarity": "dot_product"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的令牌的字段名称。它必须在下一步的推理管道配置中引用。 |
|
|
包含标记的字段是一个 |
|
|
模型的输出维度。此值可以在Mistral模型参考中找到。 |
|
|
对于Mistral嵌入,应使用 |
|
|
用于创建密集向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是文本。 |
PUT amazon-bedrock-embeddings
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 1024,
"element_type": "float",
"similarity": "dot_product"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的令牌的字段名称。它必须在下一步的推理管道配置中引用。 |
|
|
包含标记的字段是一个 |
|
|
模型的输出维度。此值可能因所使用的底层模型而异。 请参阅Amazon Titan 模型或Cohere Embeddings 模型文档。 |
|
|
对于 Amazon Bedrock 嵌入,应使用 |
|
|
用于创建密集向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是文本。 |
PUT alibabacloud-ai-search-embeddings
{
"mappings": {
"properties": {
"content_embedding": {
"type": "dense_vector",
"dims": 1024,
"element_type": "float"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的令牌的字段名称。它必须在下一步的推理管道配置中引用。 |
|
|
包含标记的字段是一个 |
|
|
模型的输出维度。此值可能因所使用的底层模型而异。 请参阅阿里云AI搜索嵌入模型文档。 |
|
|
用于创建密集向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是文本。 |
使用推理处理器创建摄取管道
edit创建一个带有推理处理器的摄取管道,并使用您在上文中创建的模型对管道中正在摄取的数据进行推理。
PUT _ingest/pipeline/cohere_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "cohere_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/elser_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "elser_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/hugging_face_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "hugging_face_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/openai_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "openai_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/azure_openai_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "azure_openai_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/azure_ai_studio_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "azure_ai_studio_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/google_vertex_ai_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "google_vertex_ai_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/mistral_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "mistral_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/amazon_bedrock_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "amazon_bedrock_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
PUT _ingest/pipeline/alibabacloud_ai_search_embeddings_pipeline
{
"processors": [
{
"inference": {
"model_id": "alibabacloud_ai_search_embeddings",
"input_output": {
"input_field": "content",
"output_field": "content_embedding"
}
}
}
]
}
|
您通过使用创建推理API创建的推理端点的名称,在该步骤中称为 |
|
|
配置对象,用于定义推理过程的 |
加载数据
edit在这一步中,您加载数据,稍后在推理摄取管道中使用这些数据来创建嵌入。
使用 msmarco-passagetest2019-top1000 数据集,这是 MS MARCO 段落排序数据集的一个子集。
它由 200 个查询组成,每个查询都附有一系列相关的文本段落。
所有唯一的段落及其 ID 已从该数据集中提取并编译成一个
tsv 文件。
下载文件并使用机器学习UI中的数据可视化工具将其上传到您的集群。
在您的数据被分析后,点击覆盖设置。
在编辑字段名称下,将id分配给第一列,将content分配给第二列。
点击应用,然后点击导入。
将索引命名为test-data,然后点击导入。
上传完成后,您将看到一个包含182,469个文档的名为test-data的索引。
通过推理摄取管道摄取数据
edit通过使用您选择的模型通过推理管道重新索引数据,从文本中创建嵌入。 此步骤使用reindex API来模拟通过管道进行数据摄取。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 50
},
"dest": {
"index": "cohere-embeddings",
"pipeline": "cohere_embeddings_pipeline"
}
}
您的 Cohere 账户的速率限制 可能会影响重新索引过程的吞吐量。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 50
},
"dest": {
"index": "openai-embeddings",
"pipeline": "openai_embeddings_pipeline"
}
}
您的
OpenAI 账户的速率限制
可能会影响重新索引过程的吞吐量。如果发生这种情况,请将
size 更改为 3 或类似数量级的值。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 50
},
"dest": {
"index": "azure-openai-embeddings",
"pipeline": "azure_openai_embeddings_pipeline"
}
}
您的
Azure OpenAI 账户的速率限制
可能会影响重新索引过程的吞吐量。如果发生这种情况,请将
size 更改为 3 或类似大小的值。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 50
},
"dest": {
"index": "azure-ai-studio-embeddings",
"pipeline": "azure_ai_studio_embeddings_pipeline"
}
}
您的 Azure AI Studio 模型部署可能设置了速率限制,这可能会影响重新索引过程的吞吐量。如果发生这种情况,请将 size 更改为 3 或类似数量级的值。
调用返回一个任务ID以监控进度:
GET _tasks/<task_id>
重新索引大型数据集可能需要很长时间。 您可以使用数据集的一个子集来测试此工作流程。 通过取消重新索引过程,并且仅生成已重新索引的子集的嵌入来实现这一点。 以下 API 请求将取消重新索引任务:
POST _tasks/<task_id>/_cancel
语义搜索
edit在数据集通过嵌入丰富之后,您可以使用语义搜索查询数据。
对于密集向量模型,将query_vector_builder传递给k近邻(kNN)向量搜索API,并提供查询文本和用于创建嵌入的模型。
对于像ELSER这样的稀疏向量模型,使用sparse_vector查询,并提供带有用于创建嵌入的模型的查询文本。
如果您取消了重新索引过程,您运行的查询只会处理部分数据,这会影响您的结果质量。
GET cohere-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "cohere_embeddings",
"model_text": "Muscles in human body"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自cohere-embeddings索引,并按其与查询的接近程度排序:
"hits": [
{
"_index": "cohere-embeddings",
"_id": "-eFWCY4BECzWLnMZuI78",
"_score": 0.737484,
"_source": {
"id": 1690948,
"content": "Oxygen is supplied to the muscles via red blood cells. Red blood cells carry hemoglobin which oxygen bonds with as the hemoglobin rich blood cells pass through the blood vessels of the lungs.The now oxygen rich blood cells carry that oxygen to the cells that are demanding it, in this case skeletal muscle cells.ther ways in which muscles are supplied with oxygen include: 1 Blood flow from the heart is increased. 2 Blood flow to your muscles in increased. 3 Blood flow from nonessential organs is transported to working muscles."
}
},
{
"_index": "cohere-embeddings",
"_id": "HuFWCY4BECzWLnMZuI_8",
"_score": 0.7176013,
"_source": {
"id": 1692482,
"content": "The thoracic cavity is separated from the abdominal cavity by the diaphragm. This is a broad flat muscle. (muscular) diaphragm The diaphragm is a muscle that separat…e the thoracic from the abdominal cavity. The pelvis is the lowest part of the abdominal cavity and it has no physical separation from it Diaphragm."
}
},
{
"_index": "cohere-embeddings",
"_id": "IOFWCY4BECzWLnMZuI_8",
"_score": 0.7154432,
"_source": {
"id": 1692489,
"content": "Muscular Wall Separating the Abdominal and Thoracic Cavities; Thoracic Cavity of a Fetal Pig; In Mammals the Diaphragm Separates the Abdominal Cavity from the"
}
},
{
"_index": "cohere-embeddings",
"_id": "C-FWCY4BECzWLnMZuI_8",
"_score": 0.695313,
"_source": {
"id": 1691493,
"content": "Burning, aching, tenderness and stiffness are just some descriptors of the discomfort you may feel in the muscles you exercised one to two days ago.For the most part, these sensations you experience after exercise are collectively known as delayed onset muscle soreness.urning, aching, tenderness and stiffness are just some descriptors of the discomfort you may feel in the muscles you exercised one to two days ago."
}
},
(...)
]
GET elser-embeddings/_search
{
"query":{
"sparse_vector":{
"field": "content_embedding",
"inference_id": "elser_embeddings",
"query": "How to avoid muscle soreness after running?"
}
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自cohere-embeddings索引,并按与查询的接近程度排序:
"hits": [
{
"_index": "elser-embeddings",
"_id": "ZLGc_pABZbBmsu5_eCoH",
"_score": 21.472063,
"_source": {
"id": 2258240,
"content": "You may notice some muscle aches while you are exercising. This is called acute soreness. More often, you may begin to feel sore about 12 hours after exercising, and the discomfort usually peaks at 48 to 72 hours after exercise. This is called delayed-onset muscle soreness.It is thought that, during this time, your body is repairing the muscle, making it stronger and bigger.You may also notice the muscles feel better if you exercise lightly. This is normal.his is called delayed-onset muscle soreness. It is thought that, during this time, your body is repairing the muscle, making it stronger and bigger. You may also notice the muscles feel better if you exercise lightly. This is normal."
}
},
{
"_index": "elser-embeddings",
"_id": "ZbGc_pABZbBmsu5_eCoH",
"_score": 21.421381,
"_source": {
"id": 2258242,
"content": "Photo Credit Jupiterimages/Stockbyte/Getty Images. That stiff, achy feeling you get in the days after exercise is a normal physiological response known as delayed onset muscle soreness. You can take it as a positive sign that your muscles have felt the workout, but the pain may also turn you off to further exercise.ou are more likely to develop delayed onset muscle soreness if you are new to working out, if you’ve gone a long time without exercising and start up again, if you have picked up a new type of physical activity or if you have recently boosted the intensity, length or frequency of your exercise sessions."
}
},
{
"_index": "elser-embeddings",
"_id": "ZrGc_pABZbBmsu5_eCoH",
"_score": 20.542095,
"_source": {
"id": 2258248,
"content": "They found that stretching before and after exercise has no effect on muscle soreness. Exercise might cause inflammation, which leads to an increase in the production of immune cells (comprised mostly of macrophages and neutrophils). Levels of these immune cells reach a peak 24-48 hours after exercise.These cells, in turn, produce bradykinins and prostaglandins, which make the pain receptors in your body more sensitive. Whenever you move, these pain receptors are stimulated.hey found that stretching before and after exercise has no effect on muscle soreness. Exercise might cause inflammation, which leads to an increase in the production of immune cells (comprised mostly of macrophages and neutrophils). Levels of these immune cells reach a peak 24-48 hours after exercise."
}
},
(...)
]
GET hugging-face-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "hugging_face_embeddings",
"model_text": "What's margin of error?"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自hugging-face-embeddings索引,并根据它们与查询的接近程度进行排序:
"hits": [
{
"_index": "hugging-face-embeddings",
"_id": "ljEfo44BiUQvMpPgT20E",
"_score": 0.8522128,
"_source": {
"id": 7960255,
"content": "The margin of error can be defined by either of the following equations. Margin of error = Critical value x Standard deviation of the statistic. Margin of error = Critical value x Standard error of the statistic. If you know the standard deviation of the statistic, use the first equation to compute the margin of error. Otherwise, use the second equation. Previously, we described how to compute the standard deviation and standard error."
}
},
{
"_index": "hugging-face-embeddings",
"_id": "lzEfo44BiUQvMpPgT20E",
"_score": 0.7865497,
"_source": {
"id": 7960259,
"content": "1 y ou are told only the size of the sample and are asked to provide the margin of error for percentages which are not (yet) known. 2 This is typically the case when you are computing the margin of error for a survey which is going to be conducted in the future."
}
},
{
"_index": "hugging-face-embeddings1",
"_id": "DjEfo44BiUQvMpPgT20E",
"_score": 0.6229427,
"_source": {
"id": 2166183,
"content": "1. In general, the point at which gains equal losses. 2. In options, the market price that a stock must reach for option buyers to avoid a loss if they exercise. For a call, it is the strike price plus the premium paid. For a put, it is the strike price minus the premium paid."
}
},
{
"_index": "hugging-face-embeddings1",
"_id": "VzEfo44BiUQvMpPgT20E",
"_score": 0.6034223,
"_source": {
"id": 2173417,
"content": "How do you find the area of a circle? Can you measure the area of a circle and use that to find a value for Pi?"
}
},
(...)
]
GET openai-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "openai_embeddings",
"model_text": "Calculate fuel cost"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自openai-embeddings索引,并按与查询的接近程度排序:
"hits": [
{
"_index": "openai-embeddings",
"_id": "DDd5OowBHxQKHyc3TDSC",
"_score": 0.83704096,
"_source": {
"id": 862114,
"body": "How to calculate fuel cost for a road trip. By Tara Baukus Mello • Bankrate.com. Dear Driving for Dollars, My family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost.It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes.y family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost. It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes."
}
},
{
"_index": "openai-embeddings",
"_id": "ajd5OowBHxQKHyc3TDSC",
"_score": 0.8345704,
"_source": {
"id": 820622,
"body": "Home Heating Calculator. Typically, approximately 50% of the energy consumed in a home annually is for space heating. When deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important.This calculator can help you estimate the cost of fuel for different heating appliances.hen deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important. This calculator can help you estimate the cost of fuel for different heating appliances."
}
},
{
"_index": "openai-embeddings",
"_id": "Djd5OowBHxQKHyc3TDSC",
"_score": 0.8327426,
"_source": {
"id": 8202683,
"body": "Fuel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel.If you are paying $4 per gallon, the trip would cost you $200.Most boats have much larger gas tanks than cars.uel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel."
}
},
(...)
]
GET azure-openai-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "azure_openai_embeddings",
"model_text": "Calculate fuel cost"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自azure-openai-embeddings索引,并按与查询的接近程度排序:
"hits": [
{
"_index": "azure-openai-embeddings",
"_id": "DDd5OowBHxQKHyc3TDSC",
"_score": 0.83704096,
"_source": {
"id": 862114,
"body": "How to calculate fuel cost for a road trip. By Tara Baukus Mello • Bankrate.com. Dear Driving for Dollars, My family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost.It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes.y family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost. It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes."
}
},
{
"_index": "azure-openai-embeddings",
"_id": "ajd5OowBHxQKHyc3TDSC",
"_score": 0.8345704,
"_source": {
"id": 820622,
"body": "Home Heating Calculator. Typically, approximately 50% of the energy consumed in a home annually is for space heating. When deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important.This calculator can help you estimate the cost of fuel for different heating appliances.hen deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important. This calculator can help you estimate the cost of fuel for different heating appliances."
}
},
{
"_index": "azure-openai-embeddings",
"_id": "Djd5OowBHxQKHyc3TDSC",
"_score": 0.8327426,
"_source": {
"id": 8202683,
"body": "Fuel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel.If you are paying $4 per gallon, the trip would cost you $200.Most boats have much larger gas tanks than cars.uel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel."
}
},
(...)
]
GET azure-ai-studio-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "azure_ai_studio_embeddings",
"model_text": "Calculate fuel cost"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自按与查询的接近程度排序的azure-ai-studio-embeddings索引:
"hits": [
{
"_index": "azure-ai-studio-embeddings",
"_id": "DDd5OowBHxQKHyc3TDSC",
"_score": 0.83704096,
"_source": {
"id": 862114,
"body": "How to calculate fuel cost for a road trip. By Tara Baukus Mello • Bankrate.com. Dear Driving for Dollars, My family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost.It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes.y family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost. It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes."
}
},
{
"_index": "azure-ai-studio-embeddings",
"_id": "ajd5OowBHxQKHyc3TDSC",
"_score": 0.8345704,
"_source": {
"id": 820622,
"body": "Home Heating Calculator. Typically, approximately 50% of the energy consumed in a home annually is for space heating. When deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important.This calculator can help you estimate the cost of fuel for different heating appliances.hen deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important. This calculator can help you estimate the cost of fuel for different heating appliances."
}
},
{
"_index": "azure-ai-studio-embeddings",
"_id": "Djd5OowBHxQKHyc3TDSC",
"_score": 0.8327426,
"_source": {
"id": 8202683,
"body": "Fuel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel.If you are paying $4 per gallon, the trip would cost you $200.Most boats have much larger gas tanks than cars.uel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel."
}
},
(...)
]
GET google-vertex-ai-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "google_vertex_ai_embeddings",
"model_text": "Calculate fuel cost"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自mistral-embeddings索引,并按与查询的接近程度排序:
"hits": [
{
"_index": "google-vertex-ai-embeddings",
"_id": "Ryv0nZEBBFPLbFsdCbGn",
"_score": 0.86815524,
"_source": {
"id": 3041038,
"content": "For example, the cost of the fuel could be 96.9, the amount could be 10 pounds, and the distance covered could be 80 miles. To convert between Litres per 100KM and Miles Per Gallon, please provide a value and click on the required button.o calculate how much fuel you'll need for a given journey, please provide the distance in miles you will be covering on your journey, and the estimated MPG of your vehicle. To work out what MPG you are really getting, please provide the cost of the fuel, how much you spent on the fuel, and how far it took you."
}
},
{
"_index": "google-vertex-ai-embeddings",
"_id": "w4j0nZEBZ1nFq1oiHQvK",
"_score": 0.8676357,
"_source": {
"id": 1541469,
"content": "This driving cost calculator takes into consideration the fuel economy of the vehicle that you are travelling in as well as the fuel cost. This road trip gas calculator will give you an idea of how much would it cost to drive before you actually travel.his driving cost calculator takes into consideration the fuel economy of the vehicle that you are travelling in as well as the fuel cost. This road trip gas calculator will give you an idea of how much would it cost to drive before you actually travel."
}
},
{
"_index": "google-vertex-ai-embeddings",
"_id": "Hoj0nZEBZ1nFq1oiHQjJ",
"_score": 0.80510974,
"_source": {
"id": 7982559,
"content": "What's that light cost you? 1 Select your electric rate (or click to enter your own). 2 You can calculate results for up to four types of lights. 3 Select the type of lamp (i.e. 4 Select the lamp wattage (lamp lumens). 5 Enter the number of lights in use. 6 Select how long the lamps are in use (or click to enter your own; enter hours on per year). 7 Finally, ..."
}
},
(...)
]
GET mistral-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "mistral_embeddings",
"model_text": "Calculate fuel cost"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自mistral-embeddings索引,并按其与查询的接近程度排序:
"hits": [
{
"_index": "mistral-embeddings",
"_id": "DDd5OowBHxQKHyc3TDSC",
"_score": 0.83704096,
"_source": {
"id": 862114,
"body": "How to calculate fuel cost for a road trip. By Tara Baukus Mello • Bankrate.com. Dear Driving for Dollars, My family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost.It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes.y family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost. It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes."
}
},
{
"_index": "mistral-embeddings",
"_id": "ajd5OowBHxQKHyc3TDSC",
"_score": 0.8345704,
"_source": {
"id": 820622,
"body": "Home Heating Calculator. Typically, approximately 50% of the energy consumed in a home annually is for space heating. When deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important.This calculator can help you estimate the cost of fuel for different heating appliances.hen deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important. This calculator can help you estimate the cost of fuel for different heating appliances."
}
},
{
"_index": "mistral-embeddings",
"_id": "Djd5OowBHxQKHyc3TDSC",
"_score": 0.8327426,
"_source": {
"id": 8202683,
"body": "Fuel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel.If you are paying $4 per gallon, the trip would cost you $200.Most boats have much larger gas tanks than cars.uel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel."
}
},
(...)
]
GET amazon-bedrock-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "amazon_bedrock_embeddings",
"model_text": "Calculate fuel cost"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询含义最接近的10个文档,这些文档来自amazon-bedrock-embeddings索引,并按与查询的接近程度排序:
"hits": [
{
"_index": "amazon-bedrock-embeddings",
"_id": "DDd5OowBHxQKHyc3TDSC",
"_score": 0.83704096,
"_source": {
"id": 862114,
"body": "How to calculate fuel cost for a road trip. By Tara Baukus Mello • Bankrate.com. Dear Driving for Dollars, My family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost.It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes.y family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost. It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes."
}
},
{
"_index": "amazon-bedrock-embeddings",
"_id": "ajd5OowBHxQKHyc3TDSC",
"_score": 0.8345704,
"_source": {
"id": 820622,
"body": "Home Heating Calculator. Typically, approximately 50% of the energy consumed in a home annually is for space heating. When deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important.This calculator can help you estimate the cost of fuel for different heating appliances.hen deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important. This calculator can help you estimate the cost of fuel for different heating appliances."
}
},
{
"_index": "amazon-bedrock-embeddings",
"_id": "Djd5OowBHxQKHyc3TDSC",
"_score": 0.8327426,
"_source": {
"id": 8202683,
"body": "Fuel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel.If you are paying $4 per gallon, the trip would cost you $200.Most boats have much larger gas tanks than cars.uel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel."
}
},
(...)
]
GET alibabacloud-ai-search-embeddings/_search
{
"knn": {
"field": "content_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "alibabacloud_ai_search_embeddings",
"model_text": "Calculate fuel cost"
}
},
"k": 10,
"num_candidates": 100
},
"_source": [
"id",
"content"
]
}
因此,您会收到与查询在语义上最接近的10个文档,这些文档来自alibabacloud-ai-search-embeddings索引,并按其与查询的接近程度排序:
"hits": [
{
"_index": "alibabacloud-ai-search-embeddings",
"_id": "DDd5OowBHxQKHyc3TDSC",
"_score": 0.83704096,
"_source": {
"id": 862114,
"body": "How to calculate fuel cost for a road trip. By Tara Baukus Mello • Bankrate.com. Dear Driving for Dollars, My family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost.It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes.y family is considering taking a long road trip to finish off the end of the summer, but I'm a little worried about gas prices and our overall fuel cost. It doesn't seem easy to calculate since we'll be traveling through many states and we are considering several routes."
}
},
{
"_index": "alibabacloud-ai-search-embeddings",
"_id": "ajd5OowBHxQKHyc3TDSC",
"_score": 0.8345704,
"_source": {
"id": 820622,
"body": "Home Heating Calculator. Typically, approximately 50% of the energy consumed in a home annually is for space heating. When deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important.This calculator can help you estimate the cost of fuel for different heating appliances.hen deciding on a heating system, many factors will come into play: cost of fuel, installation cost, convenience and life style are all important. This calculator can help you estimate the cost of fuel for different heating appliances."
}
},
{
"_index": "alibabacloud-ai-search-embeddings",
"_id": "Djd5OowBHxQKHyc3TDSC",
"_score": 0.8327426,
"_source": {
"id": 8202683,
"body": "Fuel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel.If you are paying $4 per gallon, the trip would cost you $200.Most boats have much larger gas tanks than cars.uel is another important cost. This cost will depend on your boat, how far you travel, and how fast you travel. A 33-foot sailboat traveling at 7 knots should be able to travel 300 miles on 50 gallons of diesel fuel."
}
},
(...)
]
交互式教程
edit您还可以使用Elasticsearch Python客户端在交互式Colab笔记本格式中找到教程:
教程:使用ELSER进行语义搜索
editElastic 学习型稀疏编码器 - 或 ELSER - 是由 Elastic 训练的自然语言处理模型,它使您能够通过使用稀疏向量表示来执行语义搜索。 与基于搜索词的字面匹配不同,语义搜索基于搜索查询的意图和上下文含义来检索结果。
本教程中的说明向您展示了如何使用ELSER对您的数据执行语义搜索。
要了解在 Elastic Stack 中执行语义搜索的最简单方法,请参考 semantic_text 端到端教程。
在ELSER的语义搜索过程中,每个字段仅考虑前512个提取的标记。 有关更多信息,请参阅此页面。
要求
edit要使用ELSER执行语义搜索,您必须在集群中部署NLP模型。 请参阅ELSER文档以了解如何下载和部署模型。
在 Elasticsearch Service 中部署和使用 ELSER 模型的最小专用 ML 节点大小为 4 GB,前提是 部署自动扩展 已关闭。 建议启用自动扩展,因为它允许您的部署根据需求动态调整资源。 通过使用更多的分配或每个分配更多的线程,可以实现更好的性能,这需要更大的 ML 节点。 自动扩展在需要时提供更大的节点。 如果自动扩展已关闭,您必须自行提供适当大小的节点。
创建索引映射
edit首先,必须创建目标索引的映射 - 该索引包含模型根据您的文本创建的标记。
目标索引必须有一个字段,该字段具有sparse_vector或rank_features字段类型,以索引ELSER输出。
ELSER 输出必须被摄取到一个具有 sparse_vector 或 rank_features 字段类型的字段中。
否则,Elasticsearch 会将令牌-权重对解释为文档中大量字段。
如果你遇到类似这样的错误:"Limit of total fields [1000] has been exceeded while adding new fields",那么 ELSER 输出字段没有正确映射,并且它的字段类型不同于 sparse_vector 或 rank_features。
PUT my-index
{
"mappings": {
"properties": {
"content_embedding": {
"type": "sparse_vector"
},
"content": {
"type": "text"
}
}
}
}
|
包含生成的令牌的字段名称。 它必须在下一步的推理管道配置中引用。 |
|
|
包含标记的字段是一个 |
|
|
用于创建稀疏向量表示的字段名称。
在本例中,字段名称为 |
|
|
在这个例子中,字段类型是文本。 |
要了解如何优化空间,请参阅通过排除文档源中的ELSER令牌来节省磁盘空间部分。
使用推理处理器创建摄取管道
edit创建一个带有推理处理器的摄取管道,以使用ELSER对正在摄取的数据进行推理。
PUT _ingest/pipeline/elser-v2-test
{
"processors": [
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [
{
"input_field": "content",
"output_field": "content_embedding"
}
]
}
}
]
}
加载数据
edit在这一步中,您加载数据,稍后在推理摄取管道中使用这些数据来提取令牌。
使用 msmarco-passagetest2019-top1000 数据集,这是 MS MARCO 段落排序数据集的一个子集。
它由 200 个查询组成,每个查询都附有一系列相关的文本段落。
所有唯一的段落及其 ID 已从该数据集中提取并编译成一个
tsv 文件。
未使用msmarco-passagetest2019-top1000数据集来训练模型。
我们在本教程中使用此示例数据集,因为其易于访问,适合用于演示目的。
您可以使用不同的数据集来测试工作流程并熟悉它。
下载文件并使用界面中的文件上传器将其上传到您的集群。
数据分析完成后,点击覆盖设置。
在编辑字段名称下,将id分配给第一列,将content分配给第二列。
点击应用,然后点击导入。
将索引命名为test-data,然后点击导入。
上传完成后,您将看到一个包含182,469个文档的名为test-data的索引。
通过推理摄取管道摄取数据
edit通过使用ELSER作为推理模型的推理管道重新索引数据,从文本中创建令牌。
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 50
},
"dest": {
"index": "my-index",
"pipeline": "elser-v2-test"
}
}
调用返回一个任务ID以监控进度:
GET _tasks/<task_id>
您也可以打开训练模型用户界面,在ELSER下的管道选项卡中选择以跟踪进度。
重新索引大型数据集可能需要很长时间。 您可以使用数据集的一个子集来测试此工作流程。 通过取消重新索引过程,并且仅生成已重新索引的子集的嵌入来实现这一点。 以下 API 请求将取消重新索引任务:
POST _tasks/<task_id>/_cancel
使用sparse_vector查询进行语义搜索
edit要执行语义搜索,请使用sparse_vector查询,并提供查询文本和与您的ELSER模型关联的推理ID。
下面的示例使用了查询文本“如何避免跑步后肌肉酸痛?”,content_embedding字段包含生成的ELSER输出:
GET my-index/_search
{
"query":{
"sparse_vector":{
"field": "content_embedding",
"inference_id": "my-elser-endpoint",
"query": "How to avoid muscle soreness after running?"
}
}
}
结果是与您的查询文本在语义上最接近的来自 my-index 索引的前10个文档,按相关性排序。
结果还包含每个相关搜索结果的提取令牌及其权重。
令牌是捕捉相关性的学习关联,它们不是同义词。
要了解更多关于令牌的信息,请参阅此页面。
可以从源中排除令牌,请参阅此部分以了解更多信息。
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 26.199875,
"hits": [
{
"_index": "my-index",
"_id": "FPr9HYsBag9jXmT8lEpI",
"_score": 26.199875,
"_source": {
"content_embedding": {
"muscular": 0.2821541,
"bleeding": 0.37929374,
"foods": 1.1718726,
"delayed": 1.2112266,
"cure": 0.6848574,
"during": 0.5886185,
"fighting": 0.35022718,
"rid": 0.2752442,
"soon": 0.2967024,
"leg": 0.37649947,
"preparation": 0.32974035,
"advance": 0.09652356,
(...)
},
"id": 1713868,
"model_id": ".elser_model_2",
"content": "For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development."
}
},
(...)
]
}
结合语义搜索与其他查询
edit您可以将 sparse_vector 与其他查询结合在一个 复合查询 中。
例如,在 布尔查询 或全文查询中使用过滤子句,并使用与 sparse_vector 查询相同的(或不同的)查询文本。
这使您能够结合两个查询的搜索结果。
来自 sparse_vector 查询的搜索命中通常比其他 Elasticsearch 查询得分更高。
可以通过使用 boost 参数来增加或减少每个查询的相关性分数,从而对这些分数进行正则化。
在 sparse_vector 查询中,当存在大量不太相关的结果时,召回率可能会很高。
使用 min_score 参数来修剪那些不太相关的文档。
GET my-index/_search
{
"query": {
"bool": {
"should": [
{
"sparse_vector": {
"field": "content_embedding",
"inference_id": "my-elser-endpoint",
"query": "How to avoid muscle soreness after running?",
"boost": 1
}
},
{
"query_string": {
"query": "toxins",
"boost": 4
}
}
]
}
},
"min_score": 10
}
|
Both the |
|
|
对于 |
|
|
对于 |
|
|
仅显示得分等于或高于 |
优化性能
edit通过排除文档源中的ELSER令牌来节省磁盘空间
editELSER生成的令牌必须被索引以便在稀疏向量查询中使用。 然而,没有必要在文档源中保留这些术语。 您可以通过使用源排除映射来从文档源中移除ELSER术语,从而节省磁盘空间。
重新索引使用文档源来填充目标索引。
一旦从源中排除了ELSER术语,它们将无法通过重新索引恢复。
从源中排除标记是一种节省空间的优化,只有在确定将来不需要重新索引时才应应用!
仔细考虑这一权衡并确保从源中排除ELSER术语符合您的特定需求和使用场景非常重要。
请仔细查看
禁用_source字段和包含/排除_source中的字段部分,以了解更多关于从_source中排除标记的可能后果。
可以通过以下 API 调用来创建排除 content_embedding 的 _source 字段映射:
PUT my-index
{
"mappings": {
"_source": {
"excludes": [
"content_embedding"
]
},
"properties": {
"content_embedding": {
"type": "sparse_vector"
},
"content": {
"type": "text"
}
}
}
}
Depending on your data, the sparse_vector query may be faster with track_total_hits: false.
进一步阅读
edit交互式示例
edit-
elasticsearch-labs仓库有一个使用 Elasticsearch Python 客户端运行 ELSER 驱动的语义搜索 的交互式示例。
教程:使用 Cohere 与 Elasticsearch
edit本教程中的说明将向您展示如何使用推理 API 通过 Cohere 计算嵌入,并将它们存储在 Elasticsearch 中以进行高效的向量或混合搜索。本教程将使用 Python Elasticsearch 客户端来执行操作。
你将学习如何:
- 使用Cohere服务为文本嵌入创建一个推理端点,
- 为Elasticsearch索引创建必要的索引映射,
- 构建一个推理管道,将文档与嵌入一起摄取到索引中,
- 对数据执行混合搜索,
- 使用Cohere的重新排序模型对搜索结果进行重新排序,
- 使用Cohere的Chat API设计一个RAG系统。
本教程使用了SciFact数据集。
请参考Cohere的教程 以获取使用不同数据集的示例。
您还可以查看本教程的Colab笔记本版本。
要求
edit- 使用Cohere服务与Inference API需要一个付费的Cohere账户,因为Cohere的免费试用API使用是有限的,
- 一个Elastic Cloud账户,
- Python 3.7或更高版本。
安装所需的包
edit安装 Elasticsearch 和 Cohere:
!pip install elasticsearch !pip install cohere
导入所需的包:
from elasticsearch import Elasticsearch, helpers import cohere import json import requests
创建 Elasticsearch 客户端
edit要创建您的 Elasticsearch 客户端,您需要:
ELASTICSEARCH_ENDPOINT = "elastic_endpoint" ELASTIC_API_KEY = "elastic_api_key" client = Elasticsearch( cloud_id=ELASTICSEARCH_ENDPOINT, api_key=ELASTIC_API_KEY ) # Confirm the client has connected print(client.info())
创建推理端点
edit首先创建推理端点。在这个例子中,推理端点使用Cohere的embed-english-v3.0模型,并且embedding_type设置为byte。
COHERE_API_KEY = "cohere_api_key"
client.inference.put_model(
task_type="text_embedding",
inference_id="cohere_embeddings",
body={
"service": "cohere",
"service_settings": {
"api_key": COHERE_API_KEY,
"model_id": "embed-english-v3.0",
"embedding_type": "byte"
}
},
)
您可以在Cohere仪表板的 API密钥部分找到您的API密钥。
创建索引映射
edit创建包含嵌入的索引的索引映射。
client.indices.create(
index="cohere-embeddings",
settings={"index": {"default_pipeline": "cohere_embeddings"}},
mappings={
"properties": {
"text_embedding": {
"type": "dense_vector",
"dims": 1024,
"element_type": "byte",
},
"text": {"type": "text"},
"id": {"type": "integer"},
"title": {"type": "text"}
}
},
)
创建推理管道
edit现在你已经有一个推理端点和准备存储嵌入的索引。下一步是创建一个摄取管道,并使用一个推理处理器,该处理器将使用推理端点创建嵌入并将其存储在索引中。
client.ingest.put_pipeline(
id="cohere_embeddings",
description="Ingest pipeline for Cohere inference.",
processors=[
{
"inference": {
"model_id": "cohere_embeddings",
"input_output": {
"input_field": "text",
"output_field": "text_embedding",
},
}
}
],
)
准备数据并插入文档
edit此示例使用了您可以在 HuggingFace 上找到的 SciFact 数据集。
url = 'https://huggingface.co/datasets/mteb/scifact/raw/main/corpus.jsonl'
# Fetch the JSONL data from the URL
response = requests.get(url)
response.raise_for_status() # Ensure noticing bad responses
# Split the content by new lines and parse each line as JSON
data = [json.loads(line) for line in response.text.strip().split('\n') if line]
# Now data is a list of dictionaries
# Change `_id` key to `id` as `_id` is a reserved key in Elasticsearch.
for item in data:
if '_id' in item:
item['id'] = item.pop('_id')
# Prepare the documents to be indexed
documents = []
for line in data:
data_dict = line
documents.append({
"_index": "cohere-embeddings",
"_source": data_dict,
}
)
# Use the bulk endpoint to index
helpers.bulk(client, documents)
print("Data ingestion completed, text embeddings generated!")
您的索引已填充了SciFact数据和文本字段的文本嵌入。
混合搜索
edit让我们开始查询索引!
下面的代码执行混合搜索。kNN 查询基于向量相似性使用 text_embedding 字段计算搜索结果的相关性,词汇搜索查询使用 BM25 检索在 title 和 text 字段上计算关键词相似性。
query = "What is biosimilarity?"
response = client.search(
index="cohere-embeddings",
size=100,
knn={
"field": "text_embedding",
"query_vector_builder": {
"text_embedding": {
"model_id": "cohere_embeddings",
"model_text": query,
}
},
"k": 10,
"num_candidates": 50,
},
query={
"multi_match": {
"query": query,
"fields": ["text", "title"]
}
}
)
raw_documents = response["hits"]["hits"]
# Display the first 10 results
for document in raw_documents[0:10]:
print(f'Title: {document["_source"]["title"]}\nText: {document["_source"]["text"]}\n')
# Format the documents for ranking
documents = []
for hit in response["hits"]["hits"]:
documents.append(hit["_source"]["text"])
重新排序搜索结果
edit为了更有效地结合结果,请使用 Cohere的Rerank v3模型通过推理API来提供更精确的语义重新排序结果。
使用您的 Cohere API 密钥和所使用的模型名称创建一个推理端点,作为 model_id(在本例中为 rerank-english-v3.0)。
client.inference.put_model(
task_type="rerank",
inference_id="cohere_rerank",
body={
"service": "cohere",
"service_settings":{
"api_key": COHERE_API_KEY,
"model_id": "rerank-english-v3.0"
},
"task_settings": {
"top_n": 10,
},
}
)
使用新的推理端点重新排序结果。
# Pass the query and the search results to the service
response = client.inference.inference(
inference_id="cohere_rerank",
body={
"query": query,
"input": documents,
"task_settings": {
"return_documents": False
}
}
)
# Reconstruct the input documents based on the index provided in the rereank response
ranked_documents = []
for document in response.body["rerank"]:
ranked_documents.append({
"title": raw_documents[int(document["index"])]["_source"]["title"],
"text": raw_documents[int(document["index"])]["_source"]["text"]
})
# Print the top 10 results
for document in ranked_documents[0:10]:
print(f"Title: {document['title']}\nText: {document['text']}\n")
响应是一个按相关性降序排列的文档列表。每个文档都有一个对应的索引,反映了文档被发送到推理端点时的顺序。
使用 Cohere 和 Elasticsearch 的检索增强生成 (RAG)
editRAG 是一种使用从外部数据源获取的附加信息生成文本的方法。 通过排序的结果,您可以在之前创建的内容之上构建一个 RAG 系统,方法是使用 Cohere 的 Chat API。
传入检索到的文档和查询,使用 Cohere 最新的生成模型 Command R+ 接收基于事实的响应。
然后传入查询和文档到Chat API,并打印出响应。
response = co.chat(message=query, documents=ranked_documents, model='command-r-plus')
source_documents = []
for citation in response.citations:
for document_id in citation.document_ids:
if document_id not in source_documents:
source_documents.append(document_id)
print(f"Query: {query}")
print(f"Response: {response.text}")
print("Sources:")
for document in response.documents:
if document['id'] in source_documents:
print(f"{document['title']}: {document['text']}")
响应将类似于以下内容:
Query: What is biosimilarity? Response: Biosimilarity is based on the comparability concept, which has been used successfully for several decades to ensure close similarity of a biological product before and after a manufacturing change. Over the last 10 years, experience with biosimilars has shown that even complex biotechnology-derived proteins can be copied successfully. Sources: Interchangeability of Biosimilars: A European Perspective: (...)
教程:使用已部署模型的语义搜索
edit-
对于在Elastic Stack中执行语义搜索的最简单方法,请参考
semantic_text端到端教程。 -
本教程是在推理端点和
semantic_text字段类型引入之前编写的。 如今,我们有更简单的选项来执行语义搜索。
本指南向您展示如何在部署在 Elasticsearch 中的模型上实现语义搜索:从选择 NLP 模型到编写查询。
选择一个NLP模型
editElasticsearch 提供了使用 多种NLP模型 的功能,包括密集向量模型和稀疏向量模型。 您选择的语言模型对于成功实现语义搜索至关重要。
虽然可以引入自己的文本嵌入模型,但通过模型调优来实现良好的搜索结果是具有挑战性的。 从我们的第三方模型列表中选择一个合适的模型是第一步。 在自己的数据上训练模型对于确保比仅使用BM25获得更好的搜索结果至关重要。 然而,模型训练过程需要一支数据科学家和机器学习专家团队,这使得它既昂贵又耗时。
为了解决这个问题,Elastic 提供了一个预训练的表示模型,称为 Elastic Learned Sparse EncodeR (ELSER)。 ELSER 目前仅适用于英语,是一个无需微调的域外稀疏向量模型。 这种适应性使其适合开箱即用地用于各种 NLP 用例。 除非您拥有一个 ML 专家团队,否则强烈建议使用 ELSER 模型。
在稀疏向量表示的情况下,向量主要由零值组成,只有一小部分包含非零值。 这种表示通常用于文本数据。 在ELSER的情况下,索引中的每个文档和查询文本本身都由高维稀疏向量表示。 向量中的每个非零元素对应于模型词汇表中的一个术语。 ELSER词汇表包含大约30000个术语,因此ELSER创建的稀疏向量包含大约30000个值,其中大多数是零。 实际上,ELSER模型正在用其他术语替换原始查询中的术语,这些术语在学习到的训练数据集中与原始搜索词最匹配的文档中存在,并带有权重以控制每个术语的重要性。
部署模型
edit在决定使用哪个模型来实现语义搜索后,您需要在Elasticsearch中部署该模型。
映射一个字段用于文本嵌入
edit在使用部署的模型基于您的输入文本生成嵌入之前,您需要先准备好索引映射。 索引的映射取决于模型的类型。
ELSER 从输入文本和查询中生成令牌-权重对作为输出。
Elasticsearch 的 sparse_vector 字段类型可以存储这些
令牌-权重对作为数值特征向量。索引必须有一个字段
具有 sparse_vector 字段类型,以索引ELSER生成的令牌。
要为您的ELSER索引创建映射,请参考教程中的创建索引映射部分。示例展示了如何为my-index创建索引映射,该映射将包含ELSER输出的my_embeddings.tokens字段定义为sparse_vector字段。
与 Elasticsearch NLP 兼容的模型生成密集向量作为输出。dense_vector 字段类型适用于存储密集的数值向量。索引必须有一个 dense_vector 字段类型来索引您选择的受支持的第三方模型生成的嵌入。请记住,模型生成的嵌入具有一定的维度数量。dense_vector 字段必须使用 dims 选项配置相同的维度数量。请参阅相应模型的文档以获取有关嵌入维度的信息。
要查看NLP模型的索引映射,请参考教程中
将文本嵌入模型添加到摄取推理管道
部分的映射代码片段。该示例展示了如何创建一个索引映射,该映射将包含模型输出的my_embeddings.predicted_value字段定义为dense_vector字段。
生成文本嵌入
edit一旦为索引创建了映射,您就可以从输入文本生成文本嵌入。 这可以通过使用带有摄取管道和推理处理器来完成。 摄取管道处理输入数据并将其索引到目标索引中。 在索引时,推理摄取处理器使用训练好的模型对通过管道摄取的数据进行推理。 在您使用推理处理器创建了摄取管道后,您可以通过它摄取数据以生成模型输出。
这是如何创建使用ELSER模型的摄取管道的示例:
PUT _ingest/pipeline/my-text-embeddings-pipeline
{
"description": "Text embedding pipeline",
"processors": [
{
"inference": {
"model_id": ".elser_model_2",
"input_output": [
{
"input_field": "my_text_field",
"output_field": "my_tokens"
}
]
}
}
]
}
要通过管道摄取数据以使用ELSER生成令牌,请参阅教程的通过推理摄取管道摄取数据部分。成功使用管道摄取文档后,您的索引将包含ELSER生成的令牌。令牌是捕捉相关性的学习关联,它们不是同义词。要了解更多关于令牌的信息,请参阅此页面。
这是如何创建一个使用文本嵌入模型的摄取管道的:
PUT _ingest/pipeline/my-text-embeddings-pipeline
{
"description": "Text embedding pipeline",
"processors": [
{
"inference": {
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
"target_field": "my_embeddings",
"field_map": {
"my_text_field": "text_field"
}
}
}
]
}
要通过管道摄取数据以使用您选择的模型生成文本嵌入,请参阅将文本嵌入模型添加到推理摄取管道部分。该示例展示了如何使用推理处理器创建管道并通过管道重新索引您的数据。成功使用管道摄取文档后,您的索引将包含模型生成的文本嵌入。
现在是进行语义搜索的时候了!
搜索数据
edit根据您部署的模型类型,您可以使用稀疏向量查询来查询排名特征,或者使用kNN搜索来查询密集向量。
ELSER 文本嵌入可以使用稀疏向量查询进行查询。稀疏向量查询使您能够通过提供与您要使用的NLP模型关联的推理ID和查询文本,来查询稀疏向量字段:
GET my-index/_search
{
"query":{
"sparse_vector": {
"field": "my_tokens",
"inference_id": "my-elser-endpoint",
"query": "the query string"
}
}
}
密集向量模型生成的文本嵌入可以通过kNN搜索进行查询。在knn子句中,提供密集向量字段的名称,以及包含模型ID和查询文本的query_vector_builder子句。
GET my-index/_search
{
"knn": {
"field": "my_embeddings.predicted_value",
"k": 10,
"num_candidates": 100,
"query_vector_builder": {
"text_embedding": {
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
"model_text": "the query string"
}
}
}
}
超越语义搜索的混合搜索
edit在某些情况下,词汇搜索可能比语义搜索表现更好。 例如,当搜索单个单词或ID时,如产品编号。
将语义搜索和词汇搜索结合到一个使用互惠排名融合的混合搜索请求中,可以同时获得两者的优势。 不仅如此,使用互惠排名融合的混合搜索已被证明在一般情况下表现更好。
通过在搜索请求中使用rrf检索器,可以实现语义和词汇查询的混合搜索。为rrf检索器提供sparse_vector查询和全文查询作为标准检索器。rrf检索器使用倒数排名融合来对顶级文档进行排名。
GET my-index/_search
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"my_text_field": "the query string"
}
}
}
},
{
"standard": {
"query": {
"sparse_vector": {
"field": "my_tokens",
"inference_id": "my-elser-endpoint",
"query": "the query string"
}
}
}
}
]
}
}
}
通过提供以下内容,可以实现语义和词汇查询的混合搜索:
-
一个
rrf检索器,用于使用 互惠排名融合 对顶级文档进行排名 -
一个
standard检索器作为子检索器,带有query子句用于全文查询 -
一个
knn检索器作为子检索器,使用 kNN 搜索查询密集向量字段
GET my-index/_search
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"my_text_field": "the query string"
}
}
}
},
{
"knn": {
"field": "text_embedding.predicted_value",
"k": 10,
"num_candidates": 100,
"query_vector_builder": {
"text_embedding": {
"model_id": "sentence-transformers__msmarco-minilm-l-12-v3",
"model_text": "the query string"
}
}
}
}
]
}
}
}