本笔记本将逐步指导您如何使用开源嵌入数据库Chroma，结合OpenAI的text embeddings和chat completion API，回答关于数据集合的问题。

此外，本笔记本展示了构建更健壮的问答系统时的一些权衡取舍。正如我们将看到的，简单的查询并不总能产生最佳结果！

基于大语言模型的问答系统

像OpenAI的ChatGPT这样的大型语言模型(LLMs)可以用来回答关于模型可能未经过训练或无法访问的数据的问题。例如：

个人数据，如电子邮件和笔记
高度专业化的数据，如档案或法律文件
新创建的数据，如最近的新闻报道

为了克服这一限制，我们可以使用支持自然语言查询的数据存储，就像大语言模型本身一样。像Chroma这样的嵌入存储将文档表示为embeddings，同时保留文档本身。

通过嵌入文本查询，Chroma可以找到相关文档，然后我们可以将这些文档传递给LLM来回答问题。我们将展示这种方法的详细示例和变体。

安装与准备工作

首先我们确保所需的Python依赖项已安装。

%pip install -qU openai chromadb pandas

Note: you may need to restart the kernel to use updated packages.

我们在本笔记本中全程使用OpenAI的API。您可以从https://beta.openai.com/account/api-keys获取API密钥

您可以通过在终端执行命令export OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx将API密钥添加为环境变量。请注意，如果环境变量尚未设置，您需要重新加载笔记本。或者，您也可以在笔记本中设置它，如下所示。

import os
from openai import OpenAI

# Uncomment the following line to set the environment variable in the notebook
# os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

api_key = os.getenv("OPENAI_API_KEY")

if api_key:
    client = OpenAI(api_key=api_key)
    print("OpenAI client is ready")
else:
    print("OPENAI_API_KEY environment variable not found")

OpenAI client is ready

# Set the model for all API calls
OPENAI_MODEL = "gpt-4o"

数据集

在本笔记本中，我们使用SciFact数据集。这是一个经过专家标注的科学主张精选数据集，附带包含论文标题和摘要的文本语料库。根据语料库中的文档，每个主张可能被支持、反驳或缺乏足够证据支持。

将语料库作为基准事实可用，使我们能够研究以下LLM问答方法的性能表现如何。

# Load the claim dataset
import pandas as pd

data_path = '../../data'

claim_df = pd.read_json(f'{data_path}/scifact_claims.jsonl', lines=True)
claim_df.head()

	id	声明	证据	引用文档id
0	1	零维生物材料展现出诱导特性...	{}	[31715818]
1	3	千人基因组计划实现基因图谱绘制...	{'14717500': [{'sentences': [2, 5], 'label': '...	[14717500]
2	5	在英国每2000人中有1人呈现异常PrP阳性。	{'13734012': [{'sentences': [4], 'label': 'SUP...	[13734012]
3	13	5%的围产期死亡率是由低出生体重引起的...	{}	[1606628]
4	36	维生素B12缺乏会增加血液中...	{}	[5152028, 11705328]

直接询问模型

ChatGPT基于大量科学信息进行训练。作为基准，我们希望了解模型在没有任何额外上下文的情况下已经掌握的知识。这将使我们能够校准整体性能。

We construct an appropriate prompt, with some example facts, then query the model with each claim in the dataset. We ask the model to assess a claim as 'True', 'False', or 'NEE' if there is not enough evidence one way or the other.

def build_prompt(claim):
    return [
        {"role": "system", "content": "I will ask you to assess a scientific claim. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence."},
        {"role": "user", "content": f"""
Example:

Claim:
0-dimensional biomaterials show inductive properties.

Assessment:
False

Claim:
1/2000 in UK have abnormal PrP positivity.

Assessment:
True

Claim:
Aspirin inhibits the production of PGE2.

Assessment:
False

End of examples. Assess the following claim:

Claim:
{claim}

Assessment:
"""}
    ]


def assess_claims(claims):
    responses = []
    # Query the OpenAI API
    for claim in claims:
        response = client.chat.completions.create(
            model=OPENAI_MODEL,
            messages=build_prompt(claim),
            max_tokens=3,
        )
        # Strip any punctuation or whitespace from the response
        responses.append(response.choices[0].message.content.strip('., '))

    return responses

我们从数据集中抽取了50条声明

# Let's take a look at 50 claims
samples = claim_df.sample(50)

claims = samples['claim'].tolist()

We evaluate the ground-truth according to the dataset. From the dataset description, each claim is either supported or contradicted by the evidence, or else there isn't enough evidence either way.

def get_groundtruth(evidence):
    groundtruth = []
    for e in evidence:
        # Evidence is empty
        if len(e) == 0:
            groundtruth.append('NEE')
        else:
            # In this dataset, all evidence for a given claim is consistent, either SUPPORT or CONTRADICT
            if list(e.values())[0][0]['label'] == 'SUPPORT':
                groundtruth.append('True')
            else:
                groundtruth.append('False')
    return groundtruth

evidence = samples['evidence'].tolist()
groundtruth = get_groundtruth(evidence)

我们还以易于阅读的表格形式输出混淆矩阵，将模型的评估结果与真实情况进行对比。

def confusion_matrix(inferred, groundtruth):
    assert len(inferred) == len(groundtruth)
    confusion = {
        'True': {'True': 0, 'False': 0, 'NEE': 0},
        'False': {'True': 0, 'False': 0, 'NEE': 0},
        'NEE': {'True': 0, 'False': 0, 'NEE': 0},
    }
    for i, g in zip(inferred, groundtruth):
        confusion[i][g] += 1

    # Pretty print the confusion matrix
    print('\tGroundtruth')
    print('\tTrue\tFalse\tNEE')
    for i in confusion:
        print(i, end='\t')
        for g in confusion[i]:
            print(confusion[i][g], end='\t')
        print()

    return confusion

我们要求模型直接评估这些主张，无需额外上下文。

gpt_inferred = assess_claims(claims)
confusion_matrix(gpt_inferred, groundtruth)

	Groundtruth
	True	False	NEE
True	9	3	15	
False	0	3	2	
NEE	8	6	4

{'True': {'True': 9, 'False': 3, 'NEE': 15},
 'False': {'True': 0, 'False': 3, 'NEE': 2},
 'NEE': {'True': 8, 'False': 6, 'NEE': 4}}

结果

From these results we see that the LLM is strongly biased to assess claims as true, even when they are false, and also tends to assess false claims as not having enough evidence. Note that 'not enough evidence' is with respect to the model's assessment of the claim in a vacuum, without additional context.

添加上下文

我们现在将利用论文标题和摘要语料库提供的额外上下文信息。本节展示如何使用OpenAI文本嵌入将文本语料库加载到Chroma中。

首先，我们加载文本语料库。

# Load the corpus into a dataframe
corpus_df = pd.read_json(f'{data_path}/scifact_corpus.jsonl', lines=True)
corpus_df.head()

	文档ID	标题	摘要	结构化
0	4983	人类新生儿大脑微结构发育...	[脑白质结构改变...	False
1	5836	骨髓来源细胞诱导骨髓增生异常...	[骨髓增生异常综合征(MDS)是一种与年龄相关的...	False
2	7912	BC1 RNA，一种主基因转录产物...	[ID元素是短散布元件(...	False
3	18670	人类外周血单核细胞的DNA甲基化组...	[DNA甲基化在生物过程中起着重要作用...	False
4	19238	人类髓鞘碱性蛋白基因包含...	[两个人类Golli（在少突胶质细胞系中表达的基因...	False

将语料库加载到Chroma中

下一步是将语料库加载到Chroma中。给定一个嵌入函数，Chroma会自动处理每个文档的嵌入，并将其与文本和元数据一起存储，使查询变得简单。

我们实例化一个（临时）Chroma客户端，并为SciFact标题和摘要语料库创建一个集合。 Chroma也可以以持久化配置实例化；更多信息请参阅Chroma文档。

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# We initialize an embedding function, and provide it to the collection.
embedding_function = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))

chroma_client = chromadb.Client() # Ephemeral by default
scifact_corpus_collection = chroma_client.create_collection(name='scifact_corpus', embedding_function=embedding_function)

接下来我们将语料库加载到Chroma中。由于这种数据加载对内存要求较高，我们建议采用分批加载方案，每批50-1000条。对于这个示例，整个语料库的加载应该只需要一分多钟。系统会自动在后台使用我们之前指定的embedding_function进行嵌入处理。

batch_size = 100

for i in range(0, len(corpus_df), batch_size):
    batch_df = corpus_df[i:i+batch_size]
    scifact_corpus_collection.add(
        ids=batch_df['doc_id'].apply(lambda x: str(x)).tolist(), # Chroma takes string IDs.
        documents=(batch_df['title'] + '. ' + batch_df['abstract'].apply(lambda x: ' '.join(x))).to_list(), # We concatenate the title and abstract.
        metadatas=[{"structured": structured} for structured in batch_df['structured'].to_list()] # We also store the metadata, though we don't use it in this example.
    )

检索上下文

接下来我们从语料库中检索可能与样本中每个声明相关的文档。我们希望将这些文档作为上下文提供给LLM以评估声明的准确性。根据嵌入距离，我们为每个声明检索3个最相关的文档。

claim_query_result = scifact_corpus_collection.query(query_texts=claims, include=['documents', 'distances'], n_results=3)

我们创建一个新的提示，这次会考虑从语料库中检索到的额外上下文。

def build_prompt_with_context(claim, context):
    return [{'role': 'system', 'content': "I will ask you to assess whether a particular scientific claim, based on evidence provided. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence."},
            {'role': 'user', 'content': f""""
The evidence is the following:

{' '.join(context)}

Assess the following claim on the basis of the evidence. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence. Do not output any other text.

Claim:
{claim}

Assessment:
"""}]


def assess_claims_with_context(claims, contexts):
    responses = []
    # Query the OpenAI API
    for claim, context in zip(claims, contexts):
        # If no evidence is provided, return NEE
        if len(context) == 0:
            responses.append('NEE')
            continue
        response = client.chat.completions.create(
            model=OPENAI_MODEL,
            messages=build_prompt_with_context(claim=claim, context=context),
            max_tokens=3,
        )
        # Strip any punctuation or whitespace from the response
        responses.append(response.choices[0].message.content.strip('., '))

    return responses

然后让模型根据检索到的上下文评估这些主张。

gpt_with_context_evaluation = assess_claims_with_context(claims, claim_query_result['documents'])
confusion_matrix(gpt_with_context_evaluation, groundtruth)

	Groundtruth
	True	False	NEE
True	13	1	4	
False	1	10	2	
NEE	3	1	15

{'True': {'True': 13, 'False': 1, 'NEE': 4},
 'False': {'True': 1, 'False': 10, 'NEE': 2},
 'NEE': {'True': 3, 'False': 1, 'NEE': 15}}

结果

我们看到模型的整体表现有所提升，现在能更准确地识别虚假声明。此外，大多数NEE案例现在也能被正确识别。

查看检索到的文档时，我们发现它们有时与声明无关 - 这会导致模型被额外信息混淆，即使信息不相关，它也可能认为存在足够的证据。这种情况发生是因为我们总是要求获取3个"最"相关的文档，但超过某个点后这些文档可能完全不相关。

根据相关性筛选上下文

除了文档本身，Chroma还会返回一个距离分数。我们可以尝试设置距离阈值，从而减少进入模型上下文的不相关文档数量。

If, after filtering on the threshold, no context documents remain, we bypass the model and simply return that there is not enough evidence.

def filter_query_result(query_result, distance_threshold=0.25):
# For each query result, retain only the documents whose distance is below the threshold
    for ids, docs, distances in zip(query_result['ids'], query_result['documents'], query_result['distances']):
        for i in range(len(ids)-1, -1, -1):
            if distances[i] > distance_threshold:
                ids.pop(i)
                docs.pop(i)
                distances.pop(i)
    return query_result

filtered_claim_query_result = filter_query_result(claim_query_result)

现在，我们使用这个更清晰的上下文来评估这些主张。

gpt_with_filtered_context_evaluation = assess_claims_with_context(claims, filtered_claim_query_result['documents'])
confusion_matrix(gpt_with_filtered_context_evaluation, groundtruth)

	Groundtruth
	True	False	NEE
True	9	0	1	
False	0	7	0	
NEE	8	5	20

{'True': {'True': 9, 'False': 0, 'NEE': 1},
 'False': {'True': 0, 'False': 7, 'NEE': 0},
 'NEE': {'True': 8, 'False': 5, 'NEE': 20}}

结果

The model now assesses many fewer claims as True or False when there is not enough evidence present. However, it also is now much more cautious, tending to label most items as not enough evidence, biasing away from certainty. Most claims are now assessed as having not enough evidence, because a large fraction of them are filtered out by the distance threshold. It's possible to tune the distance threshold to find the optimal operating point, but this can be difficult, and is dataset and embedding model dependent.

假设性文档嵌入：有效利用幻觉生成

我们希望能够检索到相关文档，同时避免检索到相关性较低的文档，以免干扰模型。实现这一目标的一种方法是改进检索查询。

到目前为止，我们一直使用单句陈述形式的声明来查询数据集，而语料库中包含的是描述科学论文的摘要。直观来看，虽然这两者可能相关，但它们的结构和含义存在显著差异。这些差异通过嵌入模型进行编码，因此会影响查询与最相关结果之间的距离。

我们可以通过利用大语言模型生成相关文本来克服这一问题。虽然生成的事实可能是虚构的，但模型生成的文档内容和结构与我们语料库中的文档相似度，比查询语句更高。这可能会产生更好的查询，从而获得更好的结果。

这种方法被称为假设文档嵌入(HyDE),已被证明在检索任务中表现相当出色。它可以帮助我们在不污染上下文的情况下,引入更多相关信息。

太长不看:

相比嵌入单句，嵌入整篇摘要能获得更好的匹配效果
但声明通常是单句
HyDE研究表明，利用GPT3将声明扩展为虚构摘要再进行搜索（声明→摘要→结果）比直接搜索（声明→结果）效果更好

首先，我们利用上下文示例提示模型生成与语料库中类似的文档，针对每个需要评估的主张。

def build_hallucination_prompt(claim):
    return [{'role': 'system', 'content': """I will ask you to write an abstract for a scientific paper which supports or refutes a given claim. It should be written in scientific language, include a title. Output only one abstract, then stop.

    An Example:

    Claim:
    A high microerythrocyte count raises vulnerability to severe anemia in homozygous alpha (+)- thalassemia trait subjects.

    Abstract:
    BACKGROUND The heritable haemoglobinopathy alpha(+)-thalassaemia is caused by the reduced synthesis of alpha-globin chains that form part of normal adult haemoglobin (Hb). Individuals homozygous for alpha(+)-thalassaemia have microcytosis and an increased erythrocyte count. Alpha(+)-thalassaemia homozygosity confers considerable protection against severe malaria, including severe malarial anaemia (SMA) (Hb concentration < 50 g/l), but does not influence parasite count. We tested the hypothesis that the erythrocyte indices associated with alpha(+)-thalassaemia homozygosity provide a haematological benefit during acute malaria.
    METHODS AND FINDINGS Data from children living on the north coast of Papua New Guinea who had participated in a case-control study of the protection afforded by alpha(+)-thalassaemia against severe malaria were reanalysed to assess the genotype-specific reduction in erythrocyte count and Hb levels associated with acute malarial disease. We observed a reduction in median erythrocyte count of approximately 1.5 x 10(12)/l in all children with acute falciparum malaria relative to values in community children (p < 0.001). We developed a simple mathematical model of the linear relationship between Hb concentration and erythrocyte count. This model predicted that children homozygous for alpha(+)-thalassaemia lose less Hb than children of normal genotype for a reduction in erythrocyte count of >1.1 x 10(12)/l as a result of the reduced mean cell Hb in homozygous alpha(+)-thalassaemia. In addition, children homozygous for alpha(+)-thalassaemia require a 10% greater reduction in erythrocyte count than children of normal genotype (p = 0.02) for Hb concentration to fall to 50 g/l, the cutoff for SMA. We estimated that the haematological profile in children homozygous for alpha(+)-thalassaemia reduces the risk of SMA during acute malaria compared to children of normal genotype (relative risk 0.52; 95% confidence interval [CI] 0.24-1.12, p = 0.09).
    CONCLUSIONS The increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassaemia may contribute substantially to their protection against SMA. A lower concentration of Hb per erythrocyte and a larger population of erythrocytes may be a biologically advantageous strategy against the significant reduction in erythrocyte count that occurs during acute infection with the malaria parasite Plasmodium falciparum. This haematological profile may reduce the risk of anaemia by other Plasmodium species, as well as other causes of anaemia. Other host polymorphisms that induce an increased erythrocyte count and microcytosis may confer a similar advantage.

    End of example.

    """}, {'role': 'user', 'content': f""""
    Perform the task for the following claim.

    Claim:
    {claim}

    Abstract:
    """}]


def hallucinate_evidence(claims):
    responses = []
    # Query the OpenAI API
    for claim in claims:
        response = client.chat.completions.create(
            model=OPENAI_MODEL,
            messages=build_hallucination_prompt(claim),
        )
        responses.append(response.choices[0].message.content)
    return responses

我们为每个声明虚构一份文档。

注意：这可能需要一段时间，评估100条声明大约需要7分钟。您可以减少要评估的声明数量以更快获得结果。

hallucinated_evidence = hallucinate_evidence(claims)

我们将这些虚构文档作为查询输入语料库，并使用相同的距离阈值对结果进行过滤。

hallucinated_query_result = scifact_corpus_collection.query(query_texts=hallucinated_evidence, include=['documents', 'distances'], n_results=3)
filtered_hallucinated_query_result = filter_query_result(hallucinated_query_result)

然后我们要求模型利用新的上下文来评估这些主张。

gpt_with_hallucinated_context_evaluation = assess_claims_with_context(claims, filtered_hallucinated_query_result['documents'])
confusion_matrix(gpt_with_hallucinated_context_evaluation, groundtruth)

	Groundtruth
	True	False	NEE
True	13	0	3	
False	1	10	1	
NEE	3	2	17

{'True': {'True': 13, 'False': 0, 'NEE': 3},
 'False': {'True': 1, 'False': 10, 'NEE': 1},
 'NEE': {'True': 3, 'False': 2, 'NEE': 17}}

结果

Combining HyDE with a simple distance threshold leads to a significant improvement. The model no longer biases assessing claims as True, nor toward their not being enough evidence. It also correctly assesses when there isn't enough evidence more often.

结论

为大型语言模型配备基于文档语料的上下文，是将模型的通用推理和自然语言交互能力应用于自有数据的有力技术。然而需注意，简单的查询检索可能无法产生最佳结果！最终理解数据本身才能充分发挥基于检索的问答方法优势。

2025年4月23日

使用Chroma和OpenAI实现稳健的问答功能