Azure AI Search as a vector database for OpenAI embeddings

本笔记本提供了逐步指导，介绍如何将Azure AI搜索（原名Azure认知搜索）作为向量数据库与OpenAI嵌入结合使用。Azure AI搜索是一项云搜索服务，为开发者提供基础设施、API和工具，用于在Web、移动和企业应用程序中对私有异构内容构建丰富的搜索体验。

先决条件:

为了完成本次练习，您必须具备以下条件：

! pip install wget
! pip install azure-search-documents 
! pip install azure-identity
! pip install openai

导入所需库

import json  
import wget
import pandas as pd
import zipfile
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient, SearchIndexingBufferedSender  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.models import (
    QueryAnswerType,
    QueryCaptionType,
    QueryType,
    VectorizedQuery,
)
from azure.search.documents.indexes.models import (
    HnswAlgorithmConfiguration,
    HnswParameters,
    SearchField,
    SearchableField,
    SearchFieldDataType,
    SearchIndex,
    SemanticConfiguration,
    SemanticField,
    SemanticPrioritizedFields,
    SemanticSearch,
    SimpleField,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
    VectorSearchProfile,
)

配置OpenAI设置

本节将指导您为Azure OpenAI设置身份验证，使您能够使用Azure Active Directory (AAD)或API密钥安全地与服务交互。在继续之前，请确保您已准备好Azure OpenAI端点和凭据。有关使用Azure OpenAI设置AAD的详细说明，请参阅官方文档。

endpoint: str = "YOUR_AZURE_OPENAI_ENDPOINT"
api_key: str = "YOUR_AZURE_OPENAI_KEY"
api_version: str = "2023-05-15"
deployment = "YOUR_AZURE_OPENAI_DEPLOYMENT_NAME"
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default"
)

# Set this flag to True if you are using Azure Active Directory
use_aad_for_aoai = True 

if use_aad_for_aoai:
    # Use Azure Active Directory (AAD) authentication
    client = AzureOpenAI(
        azure_endpoint=endpoint,
        api_version=api_version,
        azure_ad_token_provider=token_provider,
    )
else:
    # Use API key authentication
    client = AzureOpenAI(
        api_key=api_key,
        api_version=api_version,
        azure_endpoint=endpoint,
    )

配置Azure AI搜索向量存储设置

本节介绍如何设置Azure AI搜索客户端以集成向量存储功能。您可以在Azure门户中或通过搜索管理SDK以编程方式找到您的Azure AI搜索服务详细信息。

# Configuration
search_service_endpoint: str = "YOUR_AZURE_SEARCH_ENDPOINT"
search_service_api_key: str = "YOUR_AZURE_SEARCH_ADMIN_KEY"
index_name: str = "azure-ai-search-openai-cookbook-demo"

# Set this flag to True if you are using Azure Active Directory
use_aad_for_search = True  

if use_aad_for_search:
    # Use Azure Active Directory (AAD) authentication
    credential = DefaultAzureCredential()
else:
    # Use API key authentication
    credential = AzureKeyCredential(search_service_api_key)

# Initialize the SearchClient with the selected authentication method
search_client = SearchClient(
    endpoint=search_service_endpoint, index_name=index_name, credential=credential
)

加载数据

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip", "r") as zip_ref:
    zip_ref.extractall("../../data")

article_df = pd.read_csv("../../data/vector_database_wikipedia_articles_embedded.csv")

# Read vectors from strings back into a list using json.loads
article_df["title_vector"] = article_df.title_vector.apply(json.loads)
article_df["content_vector"] = article_df.content_vector.apply(json.loads)
article_df["vector_id"] = article_df["vector_id"].apply(str)
article_df.head()

	id	url	标题	文本内容	标题向量	内容向量	向量id
0	1	https://simple.wikipedia.org/wiki/April	四月	四月是公历年中第四个月份...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	八月(Aug.)是一年中的第八个月份...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	艺术	艺术是一种表达想象力的创造性活动...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A或a是英语字母表中的第一个字母...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	空气指的是地球的大气层。空气是一种...	[0.02224554680287838, -0.02044147066771984, -...	[0.021524671465158463, 0.018522677943110466, -...	4

创建索引

这段代码示例展示了如何使用Azure AI搜索Python SDK中的SearchIndexClient来定义和创建搜索索引。该索引同时集成了向量搜索和语义排序功能。更多详情，请参阅我们关于创建向量索引的文档。

# Initialize the SearchIndexClient
index_client = SearchIndexClient(
    endpoint=search_service_endpoint, credential=credential
)

# Define the fields for the index
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String),
    SimpleField(name="vector_id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="url", type=SearchFieldDataType.String),
    SearchableField(name="title", type=SearchFieldDataType.String),
    SearchableField(name="text", type=SearchFieldDataType.String),
    SearchField(
        name="title_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        vector_search_dimensions=1536,
        vector_search_profile_name="my-vector-config",
    ),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        vector_search_dimensions=1536,
        vector_search_profile_name="my-vector-config",
    ),
]

# Configure the vector search configuration
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="my-hnsw",
            kind=VectorSearchAlgorithmKind.HNSW,
            parameters=HnswParameters(
                m=4,
                ef_construction=400,
                ef_search=500,
                metric=VectorSearchAlgorithmMetric.COSINE,
            ),
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="my-vector-config",
            algorithm_configuration_name="my-hnsw",
        )
    ],
)

# Configure the semantic search configuration
semantic_search = SemanticSearch(
    configurations=[
        SemanticConfiguration(
            name="my-semantic-config",
            prioritized_fields=SemanticPrioritizedFields(
                title_field=SemanticField(field_name="title"),
                keywords_fields=[SemanticField(field_name="url")],
                content_fields=[SemanticField(field_name="text")],
            ),
        )
    ]
)

# Create the search index with the vector search and semantic search configurations
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search,
    semantic_search=semantic_search,
)

# Create or update the index
result = index_client.create_or_update_index(index)
print(f"{result.name} created")

azure-ai-search-openai-cookbook-demo created

上传数据到Azure AI搜索索引

以下代码片段概述了将一批文档（特别是带有预计算嵌入向量的维基百科文章）从pandas DataFrame上传到Azure AI搜索索引的过程。如需了解数据导入策略和最佳实践的详细指南，请参阅Azure AI搜索中的数据导入。

from azure.core.exceptions import HttpResponseError

# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field
article_df["id"] = article_df["id"].astype(str)
article_df["vector_id"] = article_df["vector_id"].astype(str)
# Convert the DataFrame to a list of dictionaries
documents = article_df.to_dict(orient="records")

# Create a SearchIndexingBufferedSender
batch_client = SearchIndexingBufferedSender(
    search_service_endpoint, index_name, credential
)

try:
    # Add upload actions for all documents in a single call
    batch_client.upload_documents(documents=documents)

    # Manually flush to send any remaining documents in the buffer
    batch_client.flush()
except HttpResponseError as e:
    print(f"An error occurred: {e}")
finally:
    # Clean up resources
    batch_client.close()

print(f"Uploaded {len(documents)} documents in total")

Uploaded 25000 documents in total

如果你的数据集尚未包含预计算的嵌入向量，你可以使用以下函数通过openai的Python库来创建嵌入向量。你会注意到相同的函数和模型也被用于生成查询嵌入向量以执行向量搜索。

# Example function to generate document embedding
def generate_embeddings(text, model):
    # Generate embeddings for the provided text using the specified model
    embeddings_response = client.embeddings.create(model=model, input=text)
    # Extract the embedding data from the response
    embedding = embeddings_response.data[0].embedding
    return embedding


first_document_content = documents[0]["text"]
print(f"Content: {first_document_content[:100]}")

content_vector = generate_embeddings(first_document_content, deployment)
print("Content vector generated")

Content: April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March
Content vector generated

执行向量相似性搜索

# Pure Vector Search
query = "modern art in Europe"
  
search_client = SearchClient(search_service_endpoint, index_name, credential)  
vector_query = VectorizedQuery(vector=generate_embeddings(query, deployment), k_nearest_neighbors=3, fields="content_vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query], 
    select=["title", "text", "url"] 
)
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")

Title: Documenta
Score: 0.8599451
URL: https://simple.wikipedia.org/wiki/Documenta

Title: Museum of Modern Art
Score: 0.85260946
URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art

Title: Expressionism
Score: 0.852354
URL: https://simple.wikipedia.org/wiki/Expressionism

执行混合搜索

混合搜索结合了传统基于关键词的搜索和基于向量的相似性搜索的能力，以提供更相关且符合上下文的结果。这种方法在处理复杂查询时特别有用，因为这些查询能从理解文本背后的语义含义中受益。

提供的代码片段展示了如何执行混合搜索查询：

# Hybrid Search
query = "Famous battles in Scottish history"  
  
search_client = SearchClient(search_service_endpoint, index_name, credential)  
vector_query = VectorizedQuery(vector=generate_embeddings(query, deployment), k_nearest_neighbors=3, fields="content_vector")
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query], 
    select=["title", "text", "url"],
    top=3
)
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")

Title: Wars of Scottish Independence
Score: 0.03306011110544205
URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence

Title: Battle of Bannockburn
Score: 0.022253260016441345
URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn

Title: Scottish
Score: 0.016393441706895828
URL: https://simple.wikipedia.org/wiki/Scottish

执行混合搜索并重新排序（由Bing提供支持）

Semantic ranker 通过语言理解对搜索结果进行重新排序，显著提升搜索相关性。此外，您还可以获得提取式摘要、答案和高亮内容。

# Semantic Hybrid Search
query = "What were the key technological advancements during the Industrial Revolution?"

search_client = SearchClient(search_service_endpoint, index_name, credential)
vector_query = VectorizedQuery(
    vector=generate_embeddings(query, deployment),
    k_nearest_neighbors=3,
    fields="content_vector",
)

results = search_client.search(
    search_text=query,
    vector_queries=[vector_query],
    select=["title", "text", "url"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name="my-semantic-config",
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=3,
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Title: {result['title']}")
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"URL: {result['url']}")
    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")

Semantic Answer: Advancements During the industrial revolution, new technology brought many changes. For example: Canals were built to allow heavy goods to be moved easily where they were needed. The steam engine became the main source of power. It replaced horses and human labor. Cheap iron and steel became mass-produced.
Semantic Answer Score: 0.90478515625

Title: Industrial Revolution
Reranker Score: 3.408700942993164
URL: https://simple.wikipedia.org/wiki/Industrial%20Revolution
Caption: Advancements During the industrial revolution, new technology brought many changes. For example: Canals were built to allow heavy goods to be moved easily where they were needed. The steam engine became the main source of power. It replaced horses and human labor. Cheap iron and steel became mass-produced.

Title: Printing
Reranker Score: 1.603400707244873
URL: https://simple.wikipedia.org/wiki/Printing
Caption: Machines to speed printing, cheaper paper, automatic stitching and binding all arrived in the 19th century during the industrial revolution. What had once been done by a few men by hand was now done by limited companies on huge machines. The result was much lower prices, and a much wider readership.

Title: Industrialisation
Reranker Score: 1.3238357305526733
URL: https://simple.wikipedia.org/wiki/Industrialisation
Caption: Industrialisation (or industrialization) is a process that happens in countries when they start to use machines to do work that was once done by people. Industrialisation changes the things people do. Industrialisation caused towns to grow larger. Many people left farming to take higher paid jobs in factories in towns.

2023年9月11日

Azure AI Search 作为 OpenAI 嵌入向量的向量数据库