使用语义搜索和Qdrant导航您的代码库

时间: 45 分钟	级别: 中级

你也可以通过Qdrant语义搜索来丰富你的应用程序。在本教程中，我们将描述如何使用Qdrant来导航代码库，以帮助你找到相关的代码片段。作为示例，我们将使用Qdrant的源代码本身，它主要是用Rust编写的。

方法

我们希望通过自然语义查询来搜索代码库，并基于相似的逻辑来搜索代码。您可以使用嵌入来设置这些任务：

General usage neural encoder for Natural Language Processing (NLP), in our case all-MiniLM-L6-v2 from the 句子转换器 library.
Specialized embeddings for code-to-code similarity search. We use the jina-embeddings-v2-base-code model.

为了为我们的代码准备all-MiniLM-L6-v2，我们将代码预处理为更接近自然语言的文本。Jina嵌入模型支持多种标准编程语言，因此无需对代码片段进行预处理。我们可以直接使用代码。

基于NLP的搜索是基于函数签名的，但代码搜索可能会返回较小的片段，例如循环。因此，如果我们从NLP模型接收到特定的函数签名，并从代码模型中接收到其部分实现，我们会合并结果并突出显示重叠部分。

数据准备

将应用程序源代码分割成较小的部分是一项不简单的任务。通常，函数、类方法、结构体、枚举以及所有其他特定于语言的构造都是分块的理想候选。它们足够大以包含一些有意义的信息，但又足够小，可以被具有有限上下文窗口的嵌入模型处理。你也可以使用文档字符串、注释和其他元数据来丰富分块，添加额外的信息。

解析代码库

虽然我们的示例使用了Rust，但你可以将我们的方法用于任何其他语言。你可以使用与语言服务器协议（LSP）兼容的工具来解析代码。你可以使用LSP来构建代码库的图，然后提取代码块。我们使用rust-analyzer完成了我们的工作。我们将解析后的代码库导出为LSIF格式，这是一种代码智能数据的标准。接下来，我们使用LSIF数据来导航代码库并提取代码块。有关详细信息，请参阅我们的代码搜索演示。

然后我们将代码块导出为JSON文档，不仅包含代码本身，还包含代码在项目中的位置上下文。例如，参见common模块中IsReady结构体的await_ready_for_timeout函数的描述：

{
   "name":"await_ready_for_timeout",
   "signature":"fn await_ready_for_timeout (& self , timeout : Duration) -> bool",
   "code_type":"Function",
   "docstring":"= \" Return `true` if ready, `false` if timed out.\"",
   "line":44,
   "line_from":43,
   "line_to":51,
   "context":{
      "module":"common",
      "file_path":"lib/collection/src/common/is_ready.rs",
      "file_name":"is_ready.rs",
      "struct_name":"IsReady",
      "snippet":"    /// Return `true` if ready, `false` if timed out.\n    pub fn await_ready_for_timeout(&self, timeout: Duration) -> bool {\n        let mut is_ready = self.value.lock();\n        if !*is_ready {\n            !self.condvar.wait_for(&mut is_ready, timeout).timed_out()\n        } else {\n            true\n        }\n    }\n"
   }
}

您可以检查以JSON格式解析的Qdrant结构，这些结构位于我们Google Cloud Storage存储桶中的structures.jsonl文件中。下载它并将其用作我们代码搜索的数据源。

wget https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl

接下来，加载文件并将行解析为字典列表：

import json

structures = []
with open("structures.jsonl", "r") as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

代码到自然语言的转换

每种编程语言都有其自己的语法，这不是自然语言的一部分。因此，通用模型可能无法直接理解代码。然而，我们可以通过移除代码的具体细节并包含额外的上下文（如模块、类、函数和文件名）来规范化数据。我们采取了以下步骤：

Extract the signature of the function, method, or other code construct.
Divide camel case and snake case names into separate words.
Take the docstring, comments, and other important metadata.
Build a sentence from the extracted data using a predefined template.
Remove the special characters and replace them with spaces.

作为输入，期望字典具有相同的结构。定义一个textify函数来进行转换。我们将使用一个inflection库来根据不同的命名约定进行转换。

pip install inflection

一旦所有依赖项都安装完毕，我们定义textify函数：

import inflection
import re

from typing import Dict, Any

def textify(chunk: Dict[str, Any]) -> str:
    # Get rid of all the camel case / snake case
    # - inflection.underscore changes the camel case to snake case
    # - inflection.humanize converts the snake case to human readable form
    name = inflection.humanize(inflection.underscore(chunk["name"]))
    signature = inflection.humanize(inflection.underscore(chunk["signature"]))

    # Check if docstring is provided
    docstring = ""
    if chunk["docstring"]:
        docstring = f"that does {chunk['docstring']} "

    # Extract the location of that snippet of code
    context = (
        f"module {chunk['context']['module']} "
        f"file {chunk['context']['file_name']}"
    )
    if chunk["context"]["struct_name"]:
        struct_name = inflection.humanize(
            inflection.underscore(chunk["context"]["struct_name"])
        )
        context = f"defined in struct {struct_name} {context}"

    # Combine all the bits and pieces together
    text_representation = (
        f"{chunk['code_type']} {name} "
        f"{docstring}"
        f"defined as {signature} "
        f"{context}"
    )

    # Remove any special characters and concatenate the tokens
    tokens = re.split(r"\W", text_representation)
    tokens = filter(lambda x: x, tokens)
    return " ".join(tokens)

现在我们可以使用textify将所有块转换为文本表示：

text_representations = list(map(textify, structures))

这是await_ready_for_timeout函数描述的样子：

Function Await ready for timeout that does Return true if ready false if timed out defined as Fn await ready for timeout self timeout duration bool defined in struct Is ready module common file is_ready rs

数据摄取管道

接下来，我们构建代码搜索引擎以向量化数据，并为两种嵌入模型设置语义搜索机制。

自然语言嵌入

我们可以通过sentence-transformers中的all-MiniLM-L6-v2模型来编码文本表示。使用以下命令，我们安装sentence-transformers及其依赖项：

pip install sentence-transformers optimum onnx

然后我们可以使用模型来编码文本表示：

from sentence_transformers import SentenceTransformer

nlp_model = SentenceTransformer("all-MiniLM-L6-v2")
nlp_embeddings = nlp_model.encode(
    text_representations, show_progress_bar=True,
)

代码嵌入

jina-embeddings-v2-base-code 模型是此任务的一个很好的候选者。你也可以从sentence-transformers库中获取它，但有条件。访问模型页面，接受规则，并在你的账户设置中生成访问令牌。一旦你有了令牌，你可以如下使用模型：

HF_TOKEN = "THIS_IS_YOUR_TOKEN"

# Extract the code snippets from the structures to a separate list
code_snippets = [
    structure["context"]["snippet"] for structure in structures
]

code_model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-code",
    token=HF_TOKEN,
    trust_remote_code=True
)
code_model.max_seq_length = 8192  # increase the context length window
code_embeddings = code_model.encode(
    code_snippets, batch_size=4, show_progress_bar=True,
)

记得将trust_remote_code参数设置为True。否则，模型不会生成有意义的向量。设置此参数允许库下载并可能在您的机器上运行一些代码，因此请确保信任来源。

通过自然语言和代码嵌入，我们可以将它们存储在Qdrant集合中。

构建Qdrant集合

我们使用qdrant-client库来与Qdrant服务器进行交互。让我们安装这个客户端：

pip install qdrant-client

当然，我们需要一个运行的Qdrant服务器来进行向量搜索。如果你需要一个，你可以使用本地Docker容器或通过Qdrant Cloud部署。你可以使用其中任何一种来跟随本教程。配置连接参数：

QDRANT_URL = "https://my-cluster.cloud.qdrant.io:6333" # http://localhost:6333 for local instance
QDRANT_API_KEY = "THIS_IS_YOUR_API_KEY" # None for local instance

然后使用库创建一个集合：

from qdrant_client import QdrantClient, models

client = QdrantClient(QDRANT_URL, api_key=QDRANT_API_KEY)
client.create_collection(
    "qdrant-sources",
    vectors_config={
        "text": models.VectorParams(
            size=nlp_embeddings.shape[1],
            distance=models.Distance.COSINE,
        ),
        "code": models.VectorParams(
            size=code_embeddings.shape[1],
            distance=models.Distance.COSINE,
        ),
    }
)

我们新创建的集合已准备好接收数据。让我们上传嵌入：

import uuid

points = [
    models.PointStruct(
        id=uuid.uuid4().hex,
        vector={
            "text": text_embedding,
            "code": code_embedding,
        },
        payload=structure,
    )
    for text_embedding, code_embedding, structure in zip(nlp_embeddings, code_embeddings, structures)
]

client.upload_points("qdrant-sources", points=points, batch_size=64)

上传的点立即可用于搜索。接下来，查询集合以找到相关的代码片段。

查询代码库

我们使用其中一个模型来搜索集合。从文本嵌入开始。运行以下查询“如何在集合中计数点？”。查看结果。

query = "How do I count points in a collection?"

hits = client.query_points(
    "qdrant-sources",
    query=nlp_model.encode(query).tolist(),
    using="text",
    limit=5,
).points

现在，查看结果。下表列出了模块、文件名和分数。每一行都包含一个指向签名的链接，作为文件中的代码块。

模块	文件名	分数	签名
目录	point_ops.rs	0.59448624	`pub async fn count`
操作	types.rs	0.5493385	`pub struct CountRequestInternal`
collection_manager	segments_updater.rs	0.5121002	`pub(crate) fn upsert_points<'a, T>`
collection	point_ops.rs	0.5063539	`pub async fn count`
map_index	mod.rs	0.49973983	`fn get_points_with_value_count`

看来我们能够找到一些相关的代码结构。让我们尝试使用代码嵌入来做同样的事情：

hits = client.query_points(
    "qdrant-sources",
    query=code_model.encode(query).tolist(),
    using="code",
    limit=5,
).points

输出:

模块	文件名	分数	签名
field_index	geo_index.rs	0.73278356	`fn count_indexed_points`
numeric_index	mod.rs	0.7254976	`fn count_indexed_points`
map_index	mod.rs	0.7124739	`fn count_indexed_points`
map_index	mod.rs	0.7124739	`fn count_indexed_points`
fixtures	payload_context_fixture.rs	0.706204	`fn total_point_count`

虽然不同模型检索到的分数不可比较，但我们可以看到结果是不同的。代码和文本嵌入可以捕捉代码库的不同方面。我们可以使用这两种模型来查询集合，然后结合结果以从单个批量请求中获得最相关的代码片段。

responses = client.query_batch_points(
    "qdrant-sources",
    requests=[
        models.QueryRequest(
            query=nlp_model.encode(query).tolist(),
            using="text",
            with_payload=True,
            limit=5,
        ),
        models.QueryRequest(
            query=code_model.encode(query).tolist(),
            using="code",
            with_payload=True,
            limit=5,
        ),
    ]
)

results = [response.points for response in responses]

输出:

模块	文件名	分数	签名
目录	point_ops.rs	0.59448624	`pub async fn count`
操作	types.rs	0.5493385	`pub struct CountRequestInternal`
collection_manager	segments_updater.rs	0.5121002	`pub(crate) fn upsert_points<'a, T>`
collection	point_ops.rs	0.5063539	`pub async fn count`
map_index	mod.rs	0.49973983	`fn get_points_with_value_count`
field_index	geo_index.rs	0.73278356	`fn count_indexed_points`
numeric_index	mod.rs	0.7254976	`fn count_indexed_points`
map_index	mod.rs	0.7124739	`fn count_indexed_points`
map_index	mod.rs	0.7124739	`fn count_indexed_points`
fixtures	payload_context_fixture.rs	0.706204	`fn total_point_count`

这是一个示例，展示了如何使用不同的模型并组合结果。在现实世界的场景中，您可能会进行一些重新排名和去重，以及对结果进行额外的处理。

代码搜索演示

我们的代码搜索演示使用以下过程：

The user sends a query.
Both models vectorize that query simultaneously. We get two different vectors.
Both vectors are used in parallel to find relevant snippets. We expect 5 examples from the NLP search and 20 examples from the code search.
Once we retrieve results for both vectors, we merge them in one of the following scenarios:
1. If both methods return different results, we prefer the results from the general usage model (NLP).
2. If there is an overlap between the search results, we merge overlapping snippets.

在截图中，我们搜索了flush of wal。结果显示了两模型合并的相关代码。请注意621-629行的高亮代码。这是两模型达成一致的地方。

现在你看到了语义代码智能的实际应用。

结果分组

您可以通过按有效负载属性对搜索结果进行分组来改进搜索结果。在我们的案例中，我们可以按模块对结果进行分组。如果我们使用代码嵌入，我们可以看到来自map_index模块的多个结果。让我们对结果进行分组，并假设每个模块只有一个结果：

results = client.search_groups(
    "qdrant-sources",
    query_vector=(
        "code", code_model.encode(query).tolist()
    ),
    group_by="context.module",
    limit=5,
    group_size=1,
)

输出:

模块	文件名	分数	签名
field_index	geo_index.rs	0.73278356	`fn count_indexed_points`
numeric_index	mod.rs	0.7254976	`fn count_indexed_points`
map_index	mod.rs	0.7124739	`fn count_indexed_points`
fixtures	payload_context_fixture.rs	0.706204	`fn total_point_count`
hnsw_index	graph_links.rs	0.6998417	`fn num_points`

通过分组功能，我们得到了更多样化的结果。

摘要

本教程演示了如何使用Qdrant来导航代码库。要了解端到端的实现，请查看代码搜索笔记本和代码搜索演示。您还可以查看代码搜索演示的运行版本，它通过网页界面公开了Qdrant代码库以供搜索。