如何使用FastEmbed生成ColBERT多向量

ColBERT

ColBERT 是一种嵌入模型,它生成输入文本的矩阵(多向量)表示,为每个标记生成一个向量(标记是机器学习模型的有意义的文本单元)。这种方法使 ColBERT 能够比许多密集嵌入模型捕捉到更细微的输入语义,这些模型通常用单个向量表示整个输入。通过生成更细粒度的输入表示,ColBERT 成为一个强大的检索器。然而,这种优势是以增加资源消耗为代价的,无论是在速度还是内存方面,与传统的密集嵌入模型相比。

尽管ColBERT是一个强大的检索器,但其速度限制可能使其不太适合大规模检索。 因此,我们通常建议使用ColBERT对已经检索到的一小部分示例进行重新排序,而不是用于第一阶段的检索。 一个简单的密集检索器可以初步检索大约100-500个候选者,然后可以使用ColBERT对其进行重新排序,以将最相关的结果带到顶部。

ColBERT 是重新排序模型的一个相当不错的替代方案,相较于交叉编码器,由于它的延迟交互机制,在推理时间上往往更快。

late interaction 是如何工作的?交叉编码器将查询和文档作为一个输入一起处理。 交叉编码器模型将这个输入分成对模型有意义的部分,并检查这些部分之间的关系。 因此,查询和文档之间的所有交互都在模型内部“早期”发生。 而像ColBERT这样的延迟交互模型只做第一部分,生成适合比较的文档和查询部分。 这些部分之间的所有交互预计在模型外部“稍后”完成。

在Qdrant中使用ColBERT

Qdrant 支持 多向量表示,因此您可以在 Qdrant 中使用任何后期交互模型,如 ColBERTColPali,而无需任何额外的预处理/后处理。

本教程使用ColBERT作为玩具数据集上的第一阶段检索器。 您可以在我们的多阶段查询文档中查看如何使用ColBERT作为重新排序器。

设置

安装 fastembed

pip install fastembed

导入用于文本嵌入的延迟交互模型。

from fastembed import LateInteractionTextEmbedding

您可以列出FastEmbed中支持的延迟交互模型。

LateInteractionTextEmbedding.list_supported_models()

此命令显示可用的模型。输出显示了模型的详细信息,包括输出嵌入维度、模型描述、模型大小、模型来源和模型文件。

[{'model': 'colbert-ir/colbertv2.0',
  'dim': 128,
  'description': 'Late interaction model',
  'size_in_GB': 0.44,
  'sources': {'hf': 'colbert-ir/colbertv2.0'},
  'model_file': 'model.onnx'},
 {'model': 'answerdotai/answerai-colbert-small-v1',
  'dim': 96,
  'description': 'Text embeddings, Unimodal (text), Multilingual (~100 languages), 512 input tokens truncation, 2024 year',
  'size_in_GB': 0.13,
  'sources': {'hf': 'answerdotai/answerai-colbert-small-v1'},
  'model_file': 'vespa_colbert.onnx'}]

现在,加载模型。

embedding_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")

模型文件将被获取并下载,进度将显示。

嵌入数据

我们将使用ColBERT对一个玩具电影描述数据集进行向量化:

Movie description dataset
descriptions = ["In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions.",
 "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch.",
 "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist.",
 "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place.",
 "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.",
 "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre.",
 "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it.",
 "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop.",
 "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline.",
 "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent.",
 "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995).",
 "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers.",
 "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home.",
 "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies.",
 "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in.",
 "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household.",
 "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops.",
 "Story of 40-man Turkish task force who must defend a relay station.",
 "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour.",
 "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."]

向量化是通过一个embed生成器函数完成的。

descriptions_embeddings = list(
    embedding_model.embed(descriptions)
)

让我们检查一下生成的嵌入之一的大小。

descriptions_embeddings[0].shape

我们得到以下结果

(48, 128)

这意味着对于第一个描述,我们有48个长度为128的向量来表示它。

上传嵌入到Qdrant

安装 qdrant-client

pip install qdrant-client

Qdrant 客户端有一个简单的内存模式,允许你在本地对小数据量进行实验。 或者,你可以使用 Qdrant 云中的免费集群进行实验。

from qdrant_client import QdrantClient, models

qdrant_client = QdrantClient(":memory:") # Qdrant is running from RAM.

现在,让我们用我们的电影数据创建一个小的集合。 为此,我们将使用Qdrant支持的多向量功能。 要配置多向量集合,我们需要指定:

  • 向量之间的相似性度量;
  • 每个向量的大小(对于ColBERT,它是128);
  • 多向量(矩阵)之间的相似性度量,例如,maximum,因此对于矩阵A中的向量,我们找到矩阵B中最相似的向量,它们的相似性得分将是矩阵相似性。
qdrant_client.create_collection(
    collection_name="movies",
    vectors_config=models.VectorParams(
        size=128, #size of each vector produced by ColBERT
        distance=models.Distance.COSINE, #similarity metric between each vector
        multivector_config=models.MultiVectorConfig(
            comparator=models.MultiVectorComparator.MAX_SIM #similarity metric between multivectors (matrices)
        ),
    ),
)

为了使这个集合易于人类阅读,我们将保存电影元数据(名称、文本形式的描述和电影的长度)以及嵌入的描述。

Movie metadata
metadata = [{"movie_name": "The Passion of Joan of Arc", "movie_watch_time_min": 114, "movie_description": "In 1431, Jeanne d'Arc is placed on trial on charges of heresy. The ecclesiastical jurists attempt to force Jeanne to recant her claims of holy visions."},
{"movie_name": "Sherlock Jr.", "movie_watch_time_min": 45, "movie_description": "A film projectionist longs to be a detective, and puts his meagre skills to work when he is framed by a rival for stealing his girlfriend's father's pocketwatch."},
{"movie_name": "Heat", "movie_watch_time_min": 170, "movie_description": "A group of high-end professional thieves start to feel the heat from the LAPD when they unknowingly leave a clue at their latest heist."},
{"movie_name": "Kagemusha", "movie_watch_time_min": 162, "movie_description": "A petty thief with an utter resemblance to a samurai warlord is hired as the lord's double. When the warlord later dies the thief is forced to take up arms in his place."},
{"movie_name": "Kubo and the Two Strings", "movie_watch_time_min": 101, "movie_description": "A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past."},
{"movie_name": "Sardar Udham", "movie_watch_time_min": 164, "movie_description": "A biopic detailing the 2 decades that Punjabi Sikh revolutionary Udham Singh spent planning the assassination of the man responsible for the Jallianwala Bagh massacre."},
{"movie_name": "Paprika", "movie_watch_time_min": 90, "movie_description": "When a machine that allows therapists to enter their patients' dreams is stolen, all hell breaks loose. Only a young female therapist, Paprika, can stop it."},
{"movie_name": "After Hours", "movie_watch_time_min": 97, "movie_description": "An ordinary word processor has the worst night of his life after he agrees to visit a girl in Soho whom he met that evening at a coffee shop."},
{"movie_name": "Udta Punjab", "movie_watch_time_min": 148, "movie_description": "A story that revolves around drug abuse in the affluent north Indian State of Punjab and how the youth there have succumbed to it en-masse resulting in a socio-economic decline."},
{"movie_name": "Philomena", "movie_watch_time_min": 98, "movie_description": "A world-weary political journalist picks up the story of a woman's search for her son, who was taken away from her decades ago after she became pregnant and was forced to live in a convent."},
{"movie_name": "Neon Genesis Evangelion: The End of Evangelion", "movie_watch_time_min": 87, "movie_description": "Concurrent theatrical ending of the TV series Neon Genesis Evangelion (1995)."},
{"movie_name": "The Dirty Dozen", "movie_watch_time_min": 150, "movie_description": "During World War II, a rebellious U.S. Army Major is assigned a dozen convicted murderers to train and lead them into a mass assassination mission of German officers."},
{"movie_name": "Toy Story 3", "movie_watch_time_min": 103, "movie_description": "The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home."},
{"movie_name": "Edge of Tomorrow", "movie_watch_time_min": 113, "movie_description": "A soldier fighting aliens gets to relive the same day over and over again, the day restarting every time he dies."},
{"movie_name": "Some Like It Hot", "movie_watch_time_min": 121, "movie_description": "After two male musicians witness a mob hit, they flee the state in an all-female band disguised as women, but further complications set in."},
{"movie_name": "Snow White and the Seven Dwarfs", "movie_watch_time_min": 83, "movie_description": "Exiled into the dangerous forest by her wicked stepmother, a princess is rescued by seven dwarf miners who make her part of their household."},
{"movie_name": "It Happened One Night", "movie_watch_time_min": 105, "movie_description": "A renegade reporter trailing a young runaway heiress for a big story joins her on a bus heading from Florida to New York, and they end up stuck with each other when the bus leaves them behind at one of the stops."},
{"movie_name": "Nefes: Vatan Sagolsun", "movie_watch_time_min": 128, "movie_description": "Story of 40-man Turkish task force who must defend a relay station."},
{"movie_name": "This Is Spinal Tap", "movie_watch_time_min": 82, "movie_description": "Spinal Tap, one of England's loudest bands, is chronicled by film director Marty DiBergi on what proves to be a fateful tour."},
{"movie_name": "Let the Right One In", "movie_watch_time_min": 114, "movie_description": "Oskar, an overlooked and bullied boy, finds love and revenge through Eli, a beautiful but peculiar girl."}]
qdrant_client.upload_points(
    collection_name="movies",
    points=[
        models.PointStruct(
            id=idx,
            payload=metadata[idx],
            vector=vector
        )
        for idx, vector in enumerate(descriptions_embeddings)
    ],
)

查询

ColBERT 使用两种不同的方法来嵌入文档和查询,我们在 Fastembed 中也这样做。然而,我们改变了 ColBERT 中使用的查询预处理方法,因此我们不必在 32 个标记长度后截断所有查询,而是直接处理更长的查询。

qdrant_client.query_points(
    collection_name="movies",
    query=list(embedding_model.query_embed("A movie for kids with fantasy elements and wonders"))[0], #converting generator object into numpy.ndarray
    limit=1, #How many closest to the query movies we would like to get
    #with_vectors=True, #If this option is used, vectors will also be returned
    with_payload=True #So metadata is provided in the output
)

结果如下:

QueryResponse(points=[ScoredPoint(id=4, version=0, score=12.063469,
payload={'movie_name': 'Kubo and the Two Strings', 'movie_watch_time_min': 101, 
'movie_description': 'A young boy named Kubo must locate a magical suit of armour worn by his late father in order to defeat a vengeful spirit from the past.'},
vector=None, shard_key=None, order_value=None)])
这个页面有用吗?

感谢您的反馈!🙏

我们很抱歉听到这个消息。😔 你可以在GitHub上编辑这个页面,或者创建一个GitHub问题。