BGE系列#

在本部分中，我们将详细介绍BGE系列，并讲解如何使用BGE嵌入模型。

1. BAAI通用嵌入模型#

BGE代表BAAI通用嵌入模型，是由北京智源人工智能研究院(BAAI)开发和发布的一系列嵌入模型。

BGE的完整API支持及相关用法维护在GitHub上的FlagEmbedding项目中。

运行以下单元格以在您的环境中安装FlagEmbedding。

%%capture
%pip install -U FlagEmbedding

import os 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
# single GPU is better for small tasks
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

BGE模型集合可在Huggingface collection中找到。

2. BGE系列模型#

2.1 BGE#

BGE的首个版本包含6个模型，分别为英文和中文的'large'、'base'和'small'版本。

模型	语言	参数量	模型大小	描述	基础模型
BAAI/bge-large-en	英文	500M	1.34 GB	将文本映射为向量的嵌入模型	BERT
BAAI/bge-base-en	英文	109M	438 MB	一个基础规模的模型，但具备与`bge-large-en`相似的能力	BERT
BAAI/bge-small-en	英文	33.4M	133 MB	小型模型但具有竞争力表现	BERT
BAAI/bge-large-zh	中文	326M	1.3 GB	将文本映射为向量的嵌入模型	BERT
BAAI/bge-base-zh	中文	102M	409 MB	一个基础规模的模型，但具备与`bge-large-zh`相似的能力	BERT
BAAI/bge-small-zh	中文	24M	95.8 MB	小规模模型但具有竞争力表现	BERT

要进行推理，只需从FlagEmbedding导入FlagModel并初始化模型。

from FlagEmbedding import FlagModel

# Load BGE model
model = FlagModel(
    'BAAI/bge-base-en',
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    query_instruction_format='{}{}',
)

queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]

# encode the queries and corpus
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode_corpus(corpus)

# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)

[[0.84864    0.7946737 ]
 [0.760097   0.85449743]]

对于通用编码，可以使用encode()：

FlagModel.encode(sentences, batch_size=256, max_length=512, convert_to_numpy=True)

或直接调用encode()的encode_corpus()：

FlagModel.encode_corpus(corpus, batch_size=256, max_length=512, convert_to_numpy=True)

encode_queries() 函数将 query_instruction_for_retrieval 与每个输入查询拼接形成新的句子，然后将它们输入到 encode() 中。

FlagModel.encode_queries(queries, batch_size=256, max_length=512, convert_to_numpy=True)

2.2 BGE v1.5#

BGE 1.5缓解了相似度分布问题，并在无需指令的情况下增强了检索能力。

模型	语言	参数量	模型大小	描述	基础模型
BAAI/bge-large-en-v1.5	英文	335M	1.34 GB	版本1.5具有更合理的相似度分布	BERT
BAAI/bge-base-en-v1.5	英文	109M	438 MB	版本1.5具有更合理的相似度分布	BERT
BAAI/bge-small-en-v1.5	英文	33.4M	133 MB	版本1.5具有更合理的相似度分布	BERT
BAAI/bge-large-zh-v1.5	中文	326M	1.3 GB	版本1.5具有更合理的相似度分布	BERT
BAAI/bge-base-zh-v1.5	中文	102M	409 MB	版本1.5具有更合理的相似度分布	BERT
BAAI/bge-small-zh-v1.5	中文	24M	95.8 MB	版本1.5具有更合理的相似度分布	BERT

您可以像使用BGE v1模型一样完全相同的使用BGE 1.5模型。

model = FlagModel(
    'BAAI/bge-base-en-v1.5',
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    query_instruction_format='{}{}'
)

queries = ["query 1", "query 2"]
corpus = ["passage 1", "passage 2"]

# encode the queries and corpus
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode_corpus(corpus)

# compute the similarity scores
scores = q_embeddings @ p_embeddings.T
print(scores)

pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 2252.58it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 3575.71it/s]

[[0.76   0.6714]
 [0.6177 0.7603]]

2.3 BGE M3#

BGE-M3是BGE模型的新版本，其显著特点在于以下方面的多功能性：

多功能性：同时执行嵌入模型的三种常见检索功能：密集检索、多向量检索和稀疏检索。
多语言支持：可处理超过100种工作语言。
多粒度处理：能够处理不同粒度的输入，从短句到最长8192个标记的长文档均可支持。

如需了解更多详情，欢迎查阅论文。

模型	语言	参数量	模型大小	描述	基础模型
BAAI/bge-m3	多语言	568M	2.27 GB	多功能性(稠密检索、稀疏检索、多向量(colbert))、多语言支持、多粒度(8192 tokens)	XLM-RoBERTa

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)

sentences = ["What is BGE M3?", "Defination of BM25"]

Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 194180.74it/s]

BGEM3FlagModel.encode(
    sentences, 
    batch_size=12, 
    max_length=8192, 
    return_dense=True, 
    return_sparse=False, 
    return_colbert_vecs=False
)

它返回一个类似这样的字典：

{
    'dense_vecs':       # array of dense embeddings of inputs if return_dense=True, otherwise None,
    'lexical_weights':  # array of dictionaries with keys and values are ids of tokens and their corresponding weights if return_sparse=True, otherwise None,
    'colbert_vecs':     # array of multi-vector embeddings of inputs if return_cobert_vecs=True, otherwise None,'
}

# If you don't need such a long length of 8192 input tokens, you can set max_length to a smaller value to speed up encoding.
embeddings = model.encode(
    sentences, 
    max_length=10,
    return_dense=True, 
    return_sparse=True, 
    return_colbert_vecs=True
)

pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 1148.18it/s]

print(f"dense embedding:\n{embeddings['dense_vecs']}")
print(f"sparse embedding:\n{embeddings['lexical_weights']}")
print(f"multi-vector:\n{embeddings['colbert_vecs']}")

dense embedding:
[[-0.03412  -0.04706  -0.00087  ...  0.04822   0.007614 -0.02957 ]
 [-0.01035  -0.04483  -0.02434  ... -0.008224  0.01497   0.011055]]
sparse embedding:
[defaultdict(<class 'int'>, {'4865': np.float16(0.0836), '83': np.float16(0.0814), '335': np.float16(0.1296), '11679': np.float16(0.2517), '276': np.float16(0.1699), '363': np.float16(0.2695), '32': np.float16(0.04077)}), defaultdict(<class 'int'>, {'262': np.float16(0.05014), '5983': np.float16(0.1367), '2320': np.float16(0.04517), '111': np.float16(0.0634), '90017': np.float16(0.2517), '2588': np.float16(0.3333)})]
multi-vector:
[array([[-8.68966337e-03, -4.89266850e-02, -3.03634931e-03, ...,
        -2.21243706e-02,  5.72856329e-02,  1.28355855e-02],
       [-8.92937183e-03, -4.67235669e-02, -9.52814799e-03, ...,
        -3.14785317e-02,  5.39088845e-02,  6.96671568e-03],
       [ 1.84195358e-02, -4.22310382e-02,  8.55499704e-04, ...,
        -1.97946690e-02,  3.84313315e-02,  7.71250250e-03],
       ...,
       [-2.55824160e-02, -1.65533274e-02, -4.21357416e-02, ...,
        -4.50234264e-02,  4.41286489e-02, -1.00052059e-02],
       [ 5.90990965e-07, -5.53734899e-02,  8.51499755e-03, ...,
        -2.29209941e-02,  6.04418293e-02,  9.39912070e-03],
       [ 2.57394509e-03, -2.92690992e-02, -1.89342294e-02, ...,
        -8.04431178e-03,  3.28964666e-02,  4.38723788e-02]], dtype=float32), array([[ 0.01724418,  0.03835401, -0.02309308, ...,  0.00141706,
         0.02995041, -0.05990082],
       [ 0.00996325,  0.03922409, -0.03849588, ...,  0.00591671,
         0.02722516, -0.06510868],
       [ 0.01781915,  0.03925728, -0.01710397, ...,  0.00801776,
         0.03987768, -0.05070014],
       ...,
       [ 0.05478653,  0.00755799,  0.00328444, ..., -0.01648209,
         0.02405782,  0.00363262],
       [ 0.00936953,  0.05028074, -0.02388872, ...,  0.02567679,
         0.00791224, -0.03257877],
       [ 0.01803976,  0.0133922 ,  0.00019365, ...,  0.0184015 ,
         0.01373822,  0.00315539]], dtype=float32)]

2.4 BGE多语言Gemma2#

BGE Multilingual Gemma2 是一款基于LLM的多语言嵌入模型。

模型	语言	参数量	模型大小	描述	基础模型
BAAI/bge-multilingual-gemma2	多语言	9.24B	37 GB	基于LLM的多语言嵌入模型，在多语言基准测试中取得最先进成果	Gemma2-9B

from FlagEmbedding import FlagLLMModel

queries = ["how much protein should a female eat", "summit define"]
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

model = FlagLLMModel('BAAI/bge-multilingual-gemma2', 
                     query_instruction_for_retrieval="Given a web search query, retrieve relevant passages that answer the query.",
                     use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

embeddings_1 = model.encode_queries(queries)
embeddings_2 = model.encode_corpus(documents)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00,  6.34it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 816.49it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 718.33it/s]

[[0.559     0.01685  ]
 [0.0008683 0.5015   ]]

2.4 BGE ICL#

BGE ICL代表上下文学习。通过在查询中提供少量示例，可以显著增强模型处理新任务的能力。

模型	语言	参数量	模型大小	描述	基础模型
BAAI/bge-en-icl	英文	7.11B	28.5 GB	基于LLM的英文嵌入模型，具有出色的上下文学习能力。	Mistral-7B

documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

examples = [
    {
        'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
        'query': 'what is a virtual interface',
        'response': "A virtual interface is a software-defined abstraction that mimics the behavior and characteristics of a physical network interface. It allows multiple logical network connections to share the same physical network interface, enabling efficient utilization of network resources. Virtual interfaces are commonly used in virtualization technologies such as virtual machines and containers to provide network connectivity without requiring dedicated hardware. They facilitate flexible network configurations and help in isolating network traffic for security and management purposes."
    },
    {
        'instruct': 'Given a web search query, retrieve relevant passages that answer the query.',
        'query': 'causes of back pain in female for a week',
        'response': "Back pain in females lasting a week can stem from various factors. Common causes include muscle strain due to lifting heavy objects or improper posture, spinal issues like herniated discs or osteoporosis, menstrual cramps causing referred pain, urinary tract infections, or pelvic inflammatory disease. Pregnancy-related changes can also contribute. Stress and lack of physical activity may exacerbate symptoms. Proper diagnosis by a healthcare professional is crucial for effective treatment and management."
    }
]

queries = ["how much protein should a female eat", "summit define"]

from FlagEmbedding import FlagICLModel
import os

model = FlagICLModel('BAAI/bge-en-icl', 
                     examples_for_task=examples,  # set `examples_for_task=None` to use model without examples
                    #  examples_instruction_format="<instruct>{}\n<query>{}\n<response>{}" # specify the format to use examples_for_task
                     )

embeddings_1 = model.encode_queries(queries)
embeddings_2 = model.encode_corpus(documents)
similarity = embeddings_1 @ embeddings_2.T

print(similarity)

Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  6.55it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 366.09it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 623.69it/s]

[[0.6064 0.3018]
 [0.257  0.537 ]]