使用Qdrant和FastEmbed构建多模态搜索系统

时间: 15 分钟	级别: 初学者	输出: GitHub

在本教程中，您将使用Qdrant和FastEmbed设置一个简单的多模态图像和文本搜索。

概述

我们通常通过结合不同类型的数据来更有效地理解和分享信息。例如，舒适食物的味道可以唤起童年的记忆。我们可能只用“pam pam clap”的声音来描述一首歌，而不是写段落。有时，我们可能会使用表情符号和贴纸来表达我们的感受或分享复杂的想法。

数据的多种形式，如文本、图像、视频和音频，以各种组合形式构成了语义搜索应用的宝贵用例。

向量数据库，由于其模态无关性，非常适合构建这些应用程序。

在这个简单的教程中，我们正在处理两种简单的模态：图像和文本数据。然而，如果你选择了正确的嵌入模型来弥合语义鸿沟，你可以创建任何模态组合的语义搜索应用程序。

语义鸿沟指的是低级特征（即亮度）与高级概念（即可爱度）之间的差异。

例如，Meta AI的ImageBind模型据说可以将所有提到的4种模态绑定在一个共享空间中。

先决条件

注意: 本教程的代码可以在这里找到

要完成本教程，您将需要Docker来运行Qdrant的预构建Docker镜像和Python版本≥ 3.8，或者如果您不想在本地安装任何东西，可以使用Google Collab Notebook。

我们在“创建一个简单的神经搜索服务”教程中展示了如何在Docker中运行Qdrant。

设置

首先，安装所需的库 qdrant-client, fastembed 和 Pillow。例如，使用 pip 包管理器，可以通过以下方式完成。

python3 -m pip install --upgrade qdrant-client fastembed Pillow

数据集

为了使演示简单，我们为您创建了一个包含图像及其标题的小型数据集。

图片可以从这里下载。重要的是将它们放在与您的代码/笔记本相同的文件夹中，文件夹名为images。

你可以通过以下方式查看图片的外观：

from PIL import Image

Image.open('images/lizard.jpg')

向量化数据

FastEmbed 支持 对比语言-图像预训练 (CLIP) 模型，这是多模态图像-文本机器学习中的经典之作（2021年）。 CLIP 模型是首批具备零样本能力的此类模型之一。

当将其用于语义搜索时，重要的是要记住，CLIP的文本编码器被训练为处理不超过77个标记，因此CLIP适用于短文本。

让我们在共享嵌入空间中使用CLIP嵌入一组非常简短的图像及其标题。

from fastembed import TextEmbedding, ImageEmbedding

documents = [{"caption": "A photo of a cute pig",
              "image": "images/piggy.jpg"},
 {"caption": "A picture with a coffee cup",
              "image": "images/coffee.jpg"},
 {"caption": "A photo of a colourful lizard",
              "image": "images/lizard.jpg"}
]

text_model_name = "Qdrant/clip-ViT-B-32-text" #CLIP text encoder
text_model = TextEmbedding(model_name=text_model_name)
text_embeddings_size = text_model._get_model_description(text_model_name)["dim"] #dimension of text embeddings, produced by CLIP text encoder (512)
texts_embeded = list(text_model.embed([document["caption"] for document in documents])) #embedding captions with CLIP text encoder

image_model_name = "Qdrant/clip-ViT-B-32-vision" #CLIP image encoder
image_model = ImageEmbedding(model_name=image_model_name)
image_embeddings_size = image_model._get_model_description(image_model_name)["dim"] #dimension of image embeddings, produced by CLIP image encoder (512)
images_embeded = list(image_model.embed([document["image"] for document in documents]))  #embedding images with CLIP image encoder

上传数据到Qdrant

Create a client object for Qdrant.

from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333") #or QdrantClient(":memory:") if you're using Google Collab, this option is suitable only for simple prototypes/demos with Python client

Create a new collection for your images with captions.

CLIP的权重被训练以最大化真正对应的图像/标题对的缩放余弦相似度，因此我们将选择距离度量作为我们的集合中的命名向量。

使用命名向量，我们可以轻松展示文本到图像和图像到文本（图像到图像和文本到文本）的搜索。

if not client.collection_exists("text_image"): #creating a Collection
 client.create_collection(
        collection_name ="text_image",
        vectors_config={ #Named Vectors
            "image": models.VectorParams(size=image_embeddings_size, distance=models.Distance.COSINE),
            "text": models.VectorParams(size=text_embeddings_size, distance=models.Distance.COSINE),
 }
 )

Upload our images with captions to the Collection.

每张带有标题的图像将在Qdrant中创建一个点。

client.upload_points(
    collection_name="text_image",
    points=[
 models.PointStruct(
            id=idx, #unique id of a point, pre-defined by the user
            vector={
                "text": texts_embeded[idx], #embeded caption
                "image": images_embeded[idx] #embeded image
 },
            payload=doc #original image and its caption
 )
        for idx, doc in enumerate(documents)
 ]
)

搜索

文本到图像

让我们看看查询“早上什么能让我精力充沛？”会得到什么图像。

from PIL import Image

find_image = text_model.embed(["What would make me energetic in the morning?"]) #query, we embed it, so it also becomes a vector

Image.open(client.search(
    collection_name="text_image", #searching in our collection
    query_vector=("image", list(find_image)[0]), #searching only among image vectors with our textual query
    with_payload=["image"], #user-readable information about search results, we are interested to see which image we will find
    limit=1 #top-1 similar to the query result
)[0].payload['image'])

响应:

Coffee Image

图像转文本

现在，让我们用图片进行反向搜索：

from PIL import Image

Image.open('images/piglet.jpg')

Piglet Image

让我们看看通过这张小猪图片搜索会得到什么标题，你可以检查一下，这张图片不在我们的收藏中。

find_image = image_model.embed(['images/piglet.jpg']) #embedding our image query

client.search(
    collection_name="text_image",
    query_vector=("text", list(find_image)[0]), #now we are searching only among text vectors with our image query
    with_payload=["caption"], #user-readable information about search results, we are interested to see which caption we will get
    limit=1
)[0].payload['caption']

响应:

'A photo of a cute pig'

下一步

仅图像和文本多模态搜索的用例就数不胜数：电子商务、媒体管理、内容推荐、情感识别系统、生物医学图像检索、口语手语转录等。

想象一个场景：用户想要找到与他们拥有的图片相似的产品，但他们也有特定的文本要求，比如“米色”。你可以仅使用文本或图像进行搜索，并以后期融合方式（求和和加权可能会出奇地有效）结合它们的嵌入。

此外，使用发现搜索的两种模式，您可以为用户提供无法通过单一模式检索的信息！

加入我们的Discord社区，在这里我们讨论向量搜索和相似性学习，进行实验，并享受乐趣！