Custom LLM as a Judge to Detect Hallucinations with Braintrust

假设你正在开发一个客户服务机器人，并试图评估其回答的质量。考虑一个问题，比如"你们的退货政策是什么？"如果正确答案是"您可以在购买后30天内退货"，但你的机器人回答"您可以在30天内退货"，你会如何评估这是否是一个好的回答？

像Levenshtein字符串距离这样的启发式方法会表明响应不正确。然而，更好的方法是使用LLM-as-a-judge来评估响应的准确性。LLM-as-a-judge是一种利用LLM对回答质量进行评分的技术。LLM能够超越表面层面的字符串比较来推理语言，从而使它们能够更准确地评估答案。

在本教程中，我们将逐步介绍如何使用Braintrust（一个与OpenAI模型兼容的第三方评估平台）构建一个LLM智能体评分系统，用于检测幻觉现象。

安装依赖项

让我们安装一些基本依赖项。我们将使用CoQA数据集（通过DuckDB）、Braintrust进行评估，以及OpenAI的模型。请注意Braintrust是第三方评估平台，在继续操作前您应当查阅其服务条款和隐私政策。

%pip install autoevals duckdb braintrust openai --quiet

Note: you may need to restart the kernel to use updated packages.

接下来，让我们初始化OpenAI客户端。我们将使用AsyncOpenAI客户端以便并行处理请求。braintrust.wrap_openai函数封装了OpenAI客户端，用于将LLM调用记录到Braintrust。我们将使用Braintrust来辅助后续的评估工作。在继续之前，您需要注册一个Braintrust账户并在环境变量中设置有效的BRAINTRUST_API_KEY API密钥。

import os

import braintrust
from openai import AsyncOpenAI

braintrust.login(api_key=os.environ["BRAINTRUST_API_KEY"])
client = braintrust.wrap_openai(AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"]))

探索数据集

我们将使用CoQA数据集，它包含多样化的文章、问题和答案。由于CoQA规模较大，我们仅查看前几篇文章。与任何公开数据集一样，底层LLM可能已记忆了该数据集的某些方面，因此在开发自己的评分器时，最好使用您自己的私有数据进行测试。

import duckdb

# DuckDB has an easy wrapper for loading datasets from Hugging Face.
con = duckdb.connect(":memory:")
full_result = con.query("""
    SELECT * FROM 'hf://datasets/stanfordnlp/coqa/data/validation-00000-of-00001.parquet'
        LIMIT 40
""").fetchall()

single_result = full_result[10]

print("Passage:")
print(single_result[1])

print("\nQuestion:")
print(single_result[2][0])

print("\nAnswer:")
print(single_result[3]["input_text"][0])

Passage:
(CNN)A chiseled boxer's Instagram feed shows him making constant references to the Bible and enjoying gospel singing with his wife.

Another features his formidable opponent counting stacks of money, hanging out in strip clubs, and flashing diamond watches and Ferraris.

Welcome to the world of boxing promotion, circa 2015.

American Floyd Mayweather and Filipino Manny Pacquiao are set to officially announce their heavily anticipated boxing match at a press conference in Los Angeles Wednesday.

With the combined purse for the May 2 bout in Las Vegas reported to touch $300 million pending viewership numbers, the incentives to self-promote could not be higher.

"Nowadays you have to be on social media to launch the fight and to build hype," says boxing promoter Nisse Sauerland, CEO of Team Sauerland. "It couldn't be done without it."

Thirty-eight year old Mayweather (47-0, 26 knockouts), who favors the moniker "The Money Man" or "TBE" (The Best Ever), boasts nearly five million Instagram followers, 5.65 million followers on Twitter and 9.2 million Facebook likes.

He famously confirmed the fight via Shots, a photo sharing social media application that he's invested in, and displays links to his clothing brand, The Money Team, on all his accounts.

Along with professing to the be the best fighter of all time, he could also stake a claim to be one of the greatest social media users in sports.

"I think they're both playing their roles," says Sauerland, who promotes over 45 boxers. "You've got the bad guy and the good guy, really. You've got the guy who throws the money around (Mayweather), that's his image, and Pacquiao, he's the hope of a nation."

Question:
Who are the two boxer featured in this article?

Answer:
Floyd Mayweather and Manny Pacquiao

数据包含一系列段落，每个段落都有若干问题和答案。让我们将其展平为一个(passage, question, answer)元组列表。

from dataclasses import dataclass


@dataclass
class QuestionAnswer:
    passage: str
    question: str
    expected_answer: str
    generated_answer: str


qa_pairs = [
    QuestionAnswer(
        passage=r[1],
        question=question,
        generated_answer=r[3]["input_text"][i],
        expected_answer=r[3]["input_text"][i],
    )
    for r in full_result
    for (i, question) in enumerate(r[2])
]

print(len(qa_pairs))

添加幻觉

由于Braintrust的评分器旨在检测幻觉现象，我们可以利用问答对生成已知的幻觉案例。我们将通过要求大语言模型在不参考原文的情况下，对每个问题自信地生成答案来制造幻觉答案。

import asyncio
import random

random.seed(42)


async def hallucinate_answer(qa):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """\
You are a helpful hallucinating assistant, who makes up fake answers to questions.

Answer the following question in 1 sentence. If you know the answer, then make up some fake
superfluous details that are not in the passage you have memorized.

Make sure to always answer it confidently, even if you don't know the answer. Do not use words
like "perhaps", "likely", "maybe", etc. or punctuation like "...".Do not admit that you cannot
or do not know the answer.""",
            },
            {"role": "user", "content": qa.question},
        ],
        temperature=1,
        max_tokens=100,
    )
    return response.choices[0].message.content


hallucinated_answers = await asyncio.gather(
    *[hallucinate_answer(qa) for qa in qa_pairs]
)


hallucinations = [
    QuestionAnswer(
        passage=qa.passage,
        question=qa.question,
        expected_answer=qa.expected_answer,
        generated_answer=hallucination,
    )
    for (qa, hallucination) in zip(qa_pairs, hallucinated_answers)
    # Exclude simple yes/no answers.
    if "yes" not in hallucination.lower() and "no" not in hallucination.lower()
]

print("Passage:")
print(hallucinations[0].passage)
print("\nQuestion:")
print(hallucinations[0].question)
print("\nExpected Answer:")
print(hallucinations[0].expected_answer)
print("\nGenerated Answer:")
print(hallucinations[0].generated_answer)

print("\n\nNumber of hallucinations:", len(hallucinations))

Passage:
Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing.

"What are you doing, Cotton?!"

"I only wanted to be more like you".

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry.

"Don't ever do that again, Cotton!" they all cried. "Next time you might mess up that pretty white fur of yours and we wouldn't want that!"

Then Cotton thought, "I change my mind. I like being special".

Question:
Where did she live?

Expected Answer:
in a barn

Generated Answer:
She lived in a quaint cottage on the edge of the Misty Hollow Forest, where elves and talking owls often hosted moonlit storytelling festivals.

Number of hallucinations: 270

创建评估器

我们将探讨几种流行的创建LLM-as-a-judge的方法。针对每种方法，我们会创建一个评分器，然后进行"元评估"以观察其表现。由于我们知道幻觉答案是错误的，我们将通过测试评估器给幻觉答案打0分的频率来评估其质量。

LLM作为评判者 #1: 数值评分器

在创建LLM-as-a-judge时，一个常见的初始直觉是要求LLM以1到5的评分标准对答案进行评分。这种方法的优点在于可以轻松将LLM的输出转换为数字分数。

我们将使用一个修改版的Factuality模板，但要求LLM以1到10的评分标准来评估答案。

import json

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
Rate the submission on a scale of 1 to 10.
"""


@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0

这看起来很有希望！既然我们已经通过单个示例进行了基本验证，现在让我们进行正式评估，看看它在更广泛的数据集上表现如何。评估包含三个组成部分：

数据: 在这种情况下，input包含问题、幻觉答案和真实答案。评分器会将其转换为0到1之间的分数。预期得分为0，因为这是一个幻觉回答。
任务: 该任务只需为每个输入调用数值评分器。
分数: 我们将通过比较生成的分数与真实分数来评估生成分数的质量。由于我们知道这两个数字都在0到1之间，因此可以使用归一化差异作为评分标准。

from dataclasses import asdict

from braintrust import Eval


def data():
    for pair in hallucinations:
        yield dict(
            input=dict(asdict(pair)), expected=0, metadata=dict(hallucination=True)
        )


async def task(input):
    return await numeric_rater(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


def normalized_diff(output, expected):
    return 1 - abs(output - expected)


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater",
    max_concurrency=10,
)

Experiment Numeric rater is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater
LLM-as-a-judge [experiment_name=Numeric rater] (data): 270it [00:00, 54634.41it/s]

LLM-as-a-judge [experiment_name=Numeric rater] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]

=========================SUMMARY=========================
95.35% 'normalized_diff' score

201.60tok prompt_tokens
5tok completion_tokens
206.60tok total_tokens

See results for Numeric rater at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater

EvalResultWithSummary(summary="...", results=[...])

看起来数字评分器的总分达到了近94%。这还不错，但如果6%的评估被错误判定，可能会让人很难信任这些结果。让我们深入Braintrust用户界面，看看问题出在哪里。

Partial credit

看起来许多错误答案的评分都在1到10分之间。但目前我们还不清楚模型给出这些分数的原因。接下来让我们看看能否解决这个问题。

LLM-as-a-judge #2: 添加推理过程

让我们调整提示词，使大语言模型也能对其评分进行推理。这种方法被称为思维链推理。除了可能提高分数外，它还能让我们了解模型给出这些评分的原因。

@braintrust.traced
async def numeric_rater(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Rate the submission on a scale of 1 to 10.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "title": "Reasoning",
                                "type": "string",
                            },
                            "rating": {"type": "integer", "minimum": 1, "maximum": 10},
                        },
                        "required": ["rating"],
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    return (arguments["rating"] - 1) / 9


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await numeric_rater(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await numeric_rater(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1.0
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0.0

await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Numeric rater with reasoning",
    max_concurrency=10,
)

Experiment Numeric rater with reasoning is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning
LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (data): 270it [00:00, 111715.70it/s]

LLM-as-a-judge [experiment_name=Numeric rater with reasoning] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]

=========================SUMMARY=========================
Numeric rater with reasoning compared to Numeric rater:
92.10% (-03.25%) 'normalized_diff' score	(5 improvements, 63 regressions)

3.68s duration
3.68s llm_duration
239.60tok (+3800.00%) 'prompt_tokens'    	(0 improvements, 270 regressions)
136.82tok (+13182.22%) 'completion_tokens'	(0 improvements, 270 regressions)
376.43tok (+16982.22%) 'total_tokens'     	(0 improvements, 270 regressions)
0.00$ estimated_cost

See results for Numeric rater with reasoning at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Numeric%20rater%20with%20reasoning

EvalResultWithSummary(summary="...", results=[...])

看起来添加推理步骤并没有提高分数（实际上还降低了3%）。不过，如果我们查看其中一个失败案例，就能了解模型当时的思考过程。以下是模型产生幻觉答案的一个示例：

Output

分数及其推理过程：

Reasoning

看起来模型正在运用自己的判断来计算部分得分。这是数值评分中常见的问题——无论是对于模型还是人类——通常可以通过使用更好的提示来解决。

LLM-as-a-judge #3: 采用分类而非评分

接下来，我们将详细说明具体标准，并要求模型根据这些标准对答案进行分类。这种方法使我们能够更精确地引导模型检测我们正在测试的幻觉内容。直观地说，为模型提供具体的评分标准将产生更准确的分数。

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {expected}
************
[Submission]: {output}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""

# Since we're testing for hallucinations, penalize (B) as much as (D).
CHOICE_SCORES = {
    "A": 0.5,
    "B": 0,
    "C": 1,
    "D": 0,
    "E": 1,
}


@braintrust.traced
async def classifier(input, output, expected):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": PROMPT.format(input=input, output=output, expected=expected),
            }
        ],
        temperature=0,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "rate",
                    "description": "Call this function to select a choice.",
                    "parameters": {
                        "properties": {
                            "reasons": {
                                "description": "Write out in a step by step manner your reasoning to be sure that your conclusion is correct. Avoid simply stating the correct answer at the outset.",
                                "type": "string",
                            },
                            "choice": {
                                "description": "The choice",
                                "type": "string",
                                "enum": ["A", "B", "C", "D", "E"],
                            },
                        },
                        "required": ["reasons", "choice"],
                        "type": "object",
                    },
                },
            }
        ],
        tool_choice={"type": "function", "function": {"name": "rate"}},
    )
    arguments = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    choice = arguments["choice"]
    return CHOICE_SCORES[choice] if choice in CHOICE_SCORES else None


print(qa_pairs[10].question, "On a correct answer:", qa_pairs[10].generated_answer)
print(
    await classifier(
        qa_pairs[10].question,
        qa_pairs[10].generated_answer,
        qa_pairs[10].expected_answer,
    )
)

print(
    hallucinations[10].question,
    "On a hallucinated answer:",
    hallucinations[10].generated_answer,
)
print(
    await classifier(
        hallucinations[10].question,
        hallucinations[10].generated_answer,
        hallucinations[10].expected_answer,
    )
)

What did the other cats do when Cotton emerged from the bucket of water? On a correct answer: licked her face
1
What? On a hallucinated answer: "What" is a word often used to express inquiry, curiosity, or surprise, and it is said to have originated from the ancient city of Whatopia, where people would constantly ask questions while enchanted crows delivered cryptic messages.
0

async def task(input):
    return await classifier(
        input=input["question"],
        output=input["generated_answer"],
        expected=input["expected_answer"],
    )


await Eval(
    "LLM-as-a-judge",
    data=data,
    task=task,
    scores=[normalized_diff],
    experiment_name="Classifier",
    max_concurrency=10,
)

Experiment Classifier is running at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier
LLM-as-a-judge [experiment_name=Classifier] (data): 270it [00:00, 84930.41it/s]

LLM-as-a-judge [experiment_name=Classifier] (tasks):   0%|          | 0/270 [00:00<?, ?it/s]

=========================SUMMARY=========================
Classifier compared to Numeric rater with reasoning:
98.15% (+06.05%) 'normalized_diff' score	(86 improvements, 5 regressions)

4.41s (+72.60%) 'duration'         	(104 improvements, 165 regressions)
4.40s (+72.59%) 'llm_duration'     	(104 improvements, 165 regressions)
418.60tok (+17900.00%) 'prompt_tokens'    	(0 improvements, 270 regressions)
164.91tok (+2809.26%) 'completion_tokens'	(64 improvements, 204 regressions)
583.52tok (+20709.26%) 'total_tokens'     	(0 improvements, 270 regressions)
0.00$ (+00.07%) 'estimated_cost'   	(8 improvements, 255 regressions)

See results for Classifier at https://www.braintrust.dev/app/braintrustdata.com/p/LLM-as-a-judge/experiments/Classifier

EvalResultWithSummary(summary="...", results=[...])

分类器得分达到98%，这是一个显著的提升！

将此模式编码化

上面的分类器可以简单地重写为：

PROMPT = """\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{input}}
************
[Expert]: {{expected}}
************
[Submission]: {{output}}
************
[END DATA]
 
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
 
Answer the question by calling `select_choice` with your reasoning in a step-by-step matter to be
sure that your conclusion is correct. Avoid simply stating the correct answer at the outset. Select a
single choice by setting the `choice` parameter to a single choice from A, B, C, D, or E.
"""
 
Classifier = autoevals.LLMClassifier(
    name="Hallucination detector",
    prompt_template=PROMPT,
    choice_scores={"A": 0.5, "B": 0, "C": 1, "D": 0, "E": 1},
    use_cot=True,
)

后续步骤

下一步，您可以深入研究各项改进和退步点，评估它们并考虑未来对提示的优化。您还可以在自己的数据上进行测试，并再次确认结果是否符合您的用例需求。您也可以评估像o1这样的模型，尝试微调一个较小的模型看看结果是否可复现，或者使用少量示例提示使模型更符合主观标准。无论采用哪种方法，都应努力评估结果，以便严格衡量每次变更的影响。

2024年10月14日

使用Braintrust将自定义LLM作为检测幻觉的评判者