Using logprobs

本笔记本演示了Chat Completions API中logprobs参数的使用。当启用logprobs时，API会返回每个输出词元的对数概率，以及每个词元位置上有限数量的最可能词元及其对数概率。相关请求参数包括：

logprobs: 是否返回输出标记的对数概率。如果为true，则返回消息内容中每个输出标记的对数概率。
top_logprobs: 一个介于0到5之间的整数，用于指定在每个标记位置返回的最可能标记的数量，每个标记都带有相关的对数概率。如果使用此参数，必须将logprobs设置为true。

输出词元的对数概率表示给定上下文中每个词元在序列中出现的可能性。简单来说，对数概率就是log(p)，其中p = 基于上下文中前序词元，某个词元在特定位置出现的概率。关于logprobs的一些关键点：

较高的对数概率表明该标记在该上下文中的出现可能性更高。这使用户能够评估模型对其输出的信心，或探索模型考虑过的替代响应。
Logprob可以是任何负数或0.0。0.0对应100%的概率。
Logprobs使我们能够将序列的联合概率计算为各个token的logprobs之和。这对于评分和排序模型输出非常有用。另一种常见方法是取句子的每个token的平均logprob来选择最佳生成结果。
我们可以检查logprobs分配给不同候选token的情况，以了解模型认为哪些选项是合理或不合理的。

虽然logprobs有广泛的用例，但本笔记本将重点介绍其在以下方面的应用：

分类任务

大型语言模型在众多分类任务中表现出色，但准确衡量模型对其输出的置信度可能具有挑战性。logprobs提供了与每个类别预测相关联的概率，使用户能够设置自己的分类或置信度阈值。

检索（问答）评估

logprobs 可辅助检索应用中的自我评估。在问答示例中，模型会输出一个预设的 has_sufficient_context_for_answer 布尔值，该值可作为判断答案是否包含在检索内容中的置信度分数。此类评估能减少基于检索的幻觉并提高准确性。

自动补全

logprobs 可以帮助我们在用户输入时决定如何建议单词。

标记高亮和输出字节

用户可以利用启用logprobs时内置的分词功能轻松创建标记高亮器。此外，bytes参数包含每个输出字符的ASCII编码，这对于重现表情符号和特殊字符特别有用。

计算困惑度

logprobs 可用于帮助我们评估模型对结果的总体置信度，并比较不同提示生成结果的置信水平。

0. 导入与工具

from openai import OpenAI
from math import exp
import numpy as np
from IPython.display import display, HTML
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message..
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

1. 使用`logprobs`评估分类任务的置信度

假设我们想创建一个系统，将新闻文章分类到一组预定义的类别中。如果不使用logprobs，我们可以通过Chat Completions来实现这一点，但要评估模型进行分类时的确定性会困难得多。

Now, with logprobs enabled, we can see exactly how confident the model is in its predictions, which is crucial for creating an accurate and trustworthy classifier. For example, if the log probability for the chosen category is high, this suggests the model is quite confident in its classification. If it's low, this suggests the model is less confident. This can be particularly useful in cases where the model's classification is not what you expected, or when the model's output needs to be reviewed or validated by a human.

我们将从一个提示开始，向模型展示四个类别：科技、政治、体育和艺术。然后模型的任务是仅根据文章标题将这些文章分类到这些类别中。

CLASSIFICATION_PROMPT = """You will be given a headline of a news article.
Classify the article into one of the following categories: Technology, Politics, Sports, and Art.
Return only the name of the category, and nothing else.
MAKE SURE your output is one of the four categories stated.
Article headline: {headline}"""

让我们看三个示例标题，首先从标准的Chat Completions输出开始，不使用logprobs

headlines = [
    "Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.",
    "Local Mayor Launches Initiative to Enhance Urban Public Transport.",
    "Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut",
]

for headline in headlines:
    print(f"\nHeadline: {headline}")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4o",
    )
    print(f"Category: {API_RESPONSE.choices[0].message.content}\n")

Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
Category: Technology


Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.
Category: Politics


Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut
Category: Art

这里我们可以看到每条标题对应的选定分类。然而，我们无法了解模型对其预测结果的置信度。让我们重新运行相同的提示，但启用logprobs，并将top_logprobs设置为2（这将显示每个标记最可能的2个输出标记）。此外，我们还可以输出每个输出标记的线性概率，以便将对数概率转换为更易解读的0-100%范围。

for headline in headlines:
    print(f"\nHeadline: {headline}")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4o-mini",
        logprobs=True,
        top_logprobs=2,
    )
    top_two_logprobs = API_RESPONSE.choices[0].logprobs.content[0].top_logprobs
    html_content = ""
    for i, logprob in enumerate(top_two_logprobs, start=1):
        html_content += (
            f"<span style='color: cyan'>Output token {i}:</span> {logprob.token}, "
            f"<span style='color: darkorange'>logprobs:</span> {logprob.logprob}, "
            f"<span style='color: magenta'>linear probability:</span> {np.round(np.exp(logprob.logprob)*100,2)}%<br>"
        )
    display(HTML(html_content))
    print("\n")

Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.

输出标记 1: Technology, 对数概率: 0.0, 线性概率： 100.0%
输出令牌 2: Technology, logprobs: -18.75, 线性概率： 0.0%



Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.

输出标记 1: Politics, 对数概率: -3.1281633e-07, 线性概率： 100.0%
输出令牌 2: Polit, 对数概率: -16.0, 线性概率： 0.0%



Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut

输出标记 1: Art, 对数概率: -0.028133942, 线性概率： 97.23%
输出令牌 2: Sports, 对数概率: -4.278134, 线性概率： 1.39%

正如前两个标题所示，gpt-4o-mini对其分类有100%的信心，因为内容分别明显聚焦于技术和政治。然而，第三个标题结合了体育和艺术相关主题，导致信心略低至97%，但仍显示出对其分类的强烈确定性。

logprobs 在分类任务中非常有用。它们允许我们设置置信度阈值，或在所选输出的对数概率不够高时输出多个可能的标记。例如，在创建用于标记文章的推荐引擎时，我们可以自动分类超过特定阈值的标题，并将不太确定的标题发送给人工审核。

2. 检索置信度评分以减少幻觉

为了减少幻觉并提升我们基于RAG的问答系统性能，可以使用logprobs来评估模型对其检索结果的置信度。

假设我们已经使用RAG构建了一个问答检索系统，但遇到了回答中出现幻觉的问题。注意：本示例将使用硬编码的文章，但关于使用RAG进行问答的教程，请参阅手册中的其他条目。

# Article retrieved
ada_lovelace_article = """Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.
"""

# Questions that can be easily answered given the article
easy_questions = [
    "What nationality was Ada Lovelace?",
    "What was an important finding from Lovelace's seventh note?",
]

# Questions that are not fully covered in the article
medium_questions = [
    "Did Lovelace collaborate with Charles Dickens",
    "What concepts did Lovelace build with Charles Babbage",
]

现在，我们可以要求模型回答问题，但同时评估其响应。具体来说，我们将要求模型输出一个布尔值 has_sufficient_context_for_answer。然后我们可以评估 logprobs 来判断模型对其答案是否包含在提供的上下文中有多自信

PROMPT = """You retrieved this article: {article}. The question is: {question}.
Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.
"""

html_output = ""
html_output += "Questions clearly answered in article"

for question in easy_questions:
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=ada_lovelace_article, question=question
                ),
            }
        ],
        model="gpt-4o-mini",
        logprobs=True,
    )
    html_output += f'<p style="color:green">Question: {question}</p>'
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

html_output += "Questions only partially covered in the article"

for question in medium_questions:
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=ada_lovelace_article, question=question
                ),
            }
        ],
        model="gpt-4o",
        logprobs=True,
        top_logprobs=3,
    )
    html_output += f'<p style="color:green">Question: {question}</p>'
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

display(HTML(html_output))

Questions clearly answered in article

问题：Ada Lovelace是哪国人？

has_sufficient_context_for_answer: True, logprobs: -3.1281633e-07, 线性概率: 100.0%

问题：洛夫莱斯第七篇笔记中的一个重要发现是什么？

has_sufficient_context_for_answer: True, logprobs: -7.89631e-07, 线性概率: 100.0%

Questions only partially covered in the article

问题：洛夫莱斯是否与查尔斯·狄更斯合作过

has_sufficient_context_for_answer: False, logprobs: -0.008654992, 线性概率: 99.14%

问题：Lovelace与Charles Babbage共同构建了哪些概念

has_sufficient_context_for_answer: True, logprobs: -0.004082317, 线性概率: 99.59%

对于前两个问题，我们的模型以（接近）100%的置信度断言文章提供了足够的上下文来回答所提出的问题。

另一方面，对于那些文章中答案不太明确的棘手问题，模型对拥有足够上下文的信心较低。这是一个很好的防护机制，有助于确保我们检索到的内容是充分的。

这种自我评估可以帮助减少幻觉现象，因为当sufficient_context_for_answer的对数概率低于某个阈值时，您可以限制回答或重新提示用户。类似的方法已被证明能显著减少问答系统中的RAG幻觉和错误（示例）

3. 自动补全

logprobs的另一个应用场景是自动补全系统。我们无需从头到尾构建整个自动补全系统，这里将演示当用户输入时，logprobs如何帮助我们决定建议哪些单词。

首先，让我们想一个示例句子："My least favorite TV show is Breaking Bad." 假设我们希望在输入这句话时能动态推荐下一个单词或标记，但仅限当模型非常确定下一个单词会是什么时。为了演示这一点，让我们把这个句子分解成连续的组成部分。

sentence_list = [
    "My",
    "My least",
    "My least favorite",
    "My least favorite TV",
    "My least favorite TV show",
    "My least favorite TV show is",
    "My least favorite TV show is Breaking Bad",
]

现在，我们可以让gpt-4o-mini充当自动补全引擎，根据模型提供的任何上下文进行操作。我们可以启用logprobs参数来查看模型对其预测结果的置信度。

high_prob_completions = {}
low_prob_completions = {}
html_output = ""

for sentence in sentence_list:
    PROMPT = """Complete this sentence. You are acting as auto-complete. Simply complete the sentence to the best of your ability, make sure it is just ONE sentence: {sentence}"""
    API_RESPONSE = get_completion(
        [{"role": "user", "content": PROMPT.format(sentence=sentence)}],
        model="gpt-4o-mini",
        logprobs=True,
        top_logprobs=3,
    )
    html_output += f'<p>Sentence: {sentence}</p>'
    first_token = True
    for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
        html_output += f'<p style="color:cyan">Predicted next token: {token.token}, <span style="color:darkorange">logprobs: {token.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(token.logprob)*100,2)}%</span></p>'
        if first_token:
            if np.exp(token.logprob) > 0.95:
                high_prob_completions[sentence] = token.token
            if np.exp(token.logprob) < 0.60:
                low_prob_completions[sentence] = token.token
        first_token = False
    html_output += "<br>"

display(HTML(html_output))

句子：我的

预测下一个词元: My, 对数概率: -0.08344023, 线性概率: 91.99%

预测下一个词元: dog, 对数概率: -3.3334403, 线性概率: 3.57%

预测下一个标记: ap, 对数概率: -3.5834403, 线性概率: 2.78%

句子：我最不喜欢的

预测下一个词元: My, 对数概率: -0.1271426, 线性概率: 88.06%

预测下一个词元: favorite, 对数概率: -2.1271427, 线性概率: 11.92%

预测下一个词元: My, 对数概率: -9.127143, 线性概率: 0.01%

句子：我最不喜欢的

预测下一个词元: My, 对数概率: -0.052905332, 线性概率: 94.85%

预测的下一个词元: food, 对数概率: -4.0529056, 线性概率: 1.74%

预测下一个标记: color, 对数概率: -5.0529056, 线性概率: 0.64%

句子：我最不喜欢的电视节目

预测下一个标记: show, 对数概率: -0.57662326, 线性概率: 56.18%

预测下一个词元: My, 对数概率: -0.82662326, 线性概率: 43.75%

预测下一个标记: show, 对数概率: -8.201623, 线性概率: 0.03%

句子：我最不喜欢的电视节目

预测下一个标记: is, 对数概率: -0.70817715, 线性概率: 49.25%

预测下一个词元: My, 对数概率: -0.70817715, 线性概率: 49.25%

预测下一个标记: was, 对数概率: -4.833177, 线性概率: 0.8%

句子：我最不喜欢的电视节目是

预测下一个词元: My, 对数概率: -0.47896808, 线性概率: 61.94%

预测下一个词元: one, 对数概率: -1.7289681, 线性概率: 17.75%

预测下一个词元: the, 对数概率: -2.9789681, 线性概率: 5.08%

句子：我最不喜欢的电视节目是《绝命毒师》

预测下一个标记: because, 对数概率: -0.034502674, 线性概率: 96.61%

预测下一个标记: ,, 对数概率: -3.7845027, 线性概率: 2.27%

预测的下一个词元: 因为, 对数概率: -5.0345025, 线性概率: 0.65%

让我们看看高置信度的自动补全建议：

high_prob_completions

{'My least favorite TV show is Breaking Bad': 'because'}

这些建议看起来合理！我们可以对这些建议充满信心。在输入'My least favorite TV'之后，您很可能想接着写'show'！现在让我们看看模型不太有把握的自动补全建议：

low_prob_completions

{'My least favorite TV': 'show', 'My least favorite TV show': 'is'}

这些也是合乎逻辑的。仅凭前缀"我最不喜欢的"，很难确定用户会说什么，而作者最喜欢的电视节目是什么也完全是猜测。

因此，使用gpt-4o-mini，我们可以通过logprobs创建一个动态自动补全引擎的基础！

4. 高亮器和字节参数

让我们快速了解一下如何使用logprobs和bytes参数创建一个简单的标记高亮器。首先，我们可以创建一个函数来计数并高亮每个标记。虽然这没有使用对数概率，但它利用了启用logprobs时内置的分词功能。

PROMPT = """What's the longest word in the English language?"""

API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-4o", logprobs=True, top_logprobs=5
)


def highlight_text(api_response):
    colors = [
        "#FF00FF",  # Magenta
        "#008000",  # Green
        "#FF8C00",  # Dark Orange
        "#FF0000",  # Red
        "#0000FF",  # Blue
    ]
    tokens = api_response.choices[0].logprobs.content

    color_idx = 0  # Initialize color index
    html_output = ""  # Initialize HTML output
    for t in tokens:
        token_str = bytes(t.bytes).decode("utf-8")  # Decode bytes to string

        # Add colored token to HTML output
        html_output += f"<span style='color: {colors[color_idx]}'>{token_str}</span>"

        # Move to the next color
        color_idx = (color_idx + 1) % len(colors)
    display(HTML(html_output))  # Display HTML output
    print(f"Total number of tokens: {len(tokens)}")

highlight_text(API_RESPONSE)

这个最长单词在这英文语言是经常已考虑至是 "pne嗯开启oultramicroscopicsilicovol可以oconiosis," a 术语引用至 a 类型的肺疾病导致由吸入正在处理非常好的 silicate 或 quartz 灰尘. 然而, 这是值得无那个这个单词是创造更多用于它的长度比用于实用的使用. 那里是此外化学名称用于蛋白质和其他化合物那个可以是很多更长, 但是他们是通常非已使用在每天语言.

Total number of tokens: 95

接下来，让我们使用bytes参数重构一个句子。启用logprobs后，我们会同时获得每个token及其对应字符串的ASCII（十进制utf-8）值。在处理包含表情符号或特殊字符的token时，这些ASCII值会很有帮助。

PROMPT = """Output the blue heart emoji and its name."""
API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-4o", logprobs=True
)

aggregated_bytes = []
joint_logprob = 0.0

# Iterate over tokens, aggregate bytes and calculate joint logprob
for token in API_RESPONSE.choices[0].logprobs.content:
    print("Token:", token.token)
    print("Log prob:", token.logprob)
    print("Linear prob:", np.round(exp(token.logprob) * 100, 2), "%")
    print("Bytes:", token.bytes, "\n")
    aggregated_bytes += token.bytes
    joint_logprob += token.logprob

# Decode the aggregated bytes to text
aggregated_text = bytes(aggregated_bytes).decode("utf-8")

# Assert that the decoded text is the same as the message content
assert API_RESPONSE.choices[0].message.content == aggregated_text

# Print the results
print("Bytes array:", aggregated_bytes)
print(f"Decoded bytes: {aggregated_text}")
print("Joint prob:", np.round(exp(joint_logprob) * 100, 2), "%")

Token: Here
Log prob: -0.054242473
Linear prob: 94.72 %
Bytes: [72, 101, 114, 101] 

Token:  is
Log prob: -0.0044352207
Linear prob: 99.56 %
Bytes: [32, 105, 115] 

Token:  the
Log prob: -2.1008714e-06
Linear prob: 100.0 %
Bytes: [32, 116, 104, 101] 

Token:  blue
Log prob: -0.0013290489
Linear prob: 99.87 %
Bytes: [32, 98, 108, 117, 101] 

Token:  heart
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [32, 104, 101, 97, 114, 116] 

Token:  emoji
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [32, 101, 109, 111, 106, 105] 

Token:  and
Log prob: -0.038287632
Linear prob: 96.24 %
Bytes: [32, 97, 110, 100] 

Token:  its
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [32, 105, 116, 115] 

Token:  name
Log prob: -1.569009e-05
Linear prob: 100.0 %
Bytes: [32, 110, 97, 109, 101] 

Token: :


Log prob: -0.11313002
Linear prob: 89.3 %
Bytes: [58, 10, 10] 

Token: \xf0\x9f\x92
Log prob: -0.09048584
Linear prob: 91.35 %
Bytes: [240, 159, 146] 

Token: \x99
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [153] 

Token:  Blue
Log prob: -0.023958502
Linear prob: 97.63 %
Bytes: [32, 66, 108, 117, 101] 

Token:  Heart
Log prob: -6.2729996e-06
Linear prob: 100.0 %
Bytes: [32, 72, 101, 97, 114, 116] 

Bytes array: [72, 101, 114, 101, 32, 105, 115, 32, 116, 104, 101, 32, 98, 108, 117, 101, 32, 104, 101, 97, 114, 116, 32, 101, 109, 111, 106, 105, 32, 97, 110, 100, 32, 105, 116, 115, 32, 110, 97, 109, 101, 58, 10, 10, 240, 159, 146, 153, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]
Decoded bytes: Here is the blue heart emoji and its name:

💙 Blue Heart
Joint prob: 72.19 %

这里我们可以看到，虽然第一个标记是\xf0\x9f\x92'，但我们可以获取其ASCII值并将其附加到字节数组中。然后，我们可以轻松地将这个数组解码成完整的句子，并通过断言语句验证解码后的字节与我们的完成消息相同！

此外，我们可以获取整个补全结果的联合概率，即每个标记对数概率的指数乘积。这能告诉我们给定提示下该补全结果的可能性有多大。由于我们的提示非常明确（要求特定表情符号及其名称），这个输出的联合概率很高！但如果我们要求随机输出，就会看到低得多的联合概率。对于开发者进行提示工程时，这也不失为一个好策略。

5. 计算困惑度

在评估模型对结果的置信度时，计算困惑度（perplexity）会很有帮助，这是一种衡量不确定性的指标。困惑度可以通过对logprobs平均值的负数取指数来计算。一般来说，较高的困惑度表示结果不确定性更大，而较低的困惑度表示结果置信度更高。因此，困惑度既可用于评估单次模型运行的结果，也可用于比较不同模型运行结果之间的相对置信度。虽然高置信度不能保证结果准确性，但它可以作为一个有用的信号，与其他评估指标结合使用，从而更好地理解您提示词的行为表现。

例如，假设我想使用gpt-4o-mini来了解更多关于人工智能的知识。我可以提出一个关于近期历史的问题和一个关于未来的问题：

prompts = [
    "In a short sentence, has artifical intelligence grown in the last decade?",
    "In a short sentence, what are your thoughts on the future of artificial intelligence?",
]

for prompt in prompts:
    API_RESPONSE = get_completion(
        [{"role": "user", "content": prompt}],
        model="gpt-4o-mini",
        logprobs=True,
    )

    logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]
    response_text = API_RESPONSE.choices[0].message.content
    response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]
    max_starter_length = max(len(s) for s in ["Prompt:", "Response:", "Tokens:", "Logprobs:", "Perplexity:"])
    max_token_length = max(len(s) for s in response_text_tokens)
    

    formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]
    formatted_lps = [f"{lp:.2f}".rjust(max_token_length) for lp in logprobs]

    perplexity_score = np.exp(-np.mean(logprobs))
    print("Prompt:".ljust(max_starter_length), prompt)
    print("Response:".ljust(max_starter_length), response_text, "\n")
    print("Tokens:".ljust(max_starter_length), " ".join(formatted_response_tokens))
    print("Logprobs:".ljust(max_starter_length), " ".join(formatted_lps))
    print("Perplexity:".ljust(max_starter_length), perplexity_score, "\n")

Prompt:     In a short sentence, has artifical intelligence grown in the last decade?
Response:   Yes, artificial intelligence has grown significantly in the last decade, advancing in capabilities and applications across various fields. 

Tokens:                Yes              ,     artificial   intelligence            has          grown  significantly             in            the           last         decade              ,      advancing             in   capabilities            and   applications         across        various         fields              .
Logprobs:            -0.00           0.00          -0.00           0.00          -0.00          -0.73          -0.00          -0.01          -0.02          -0.00           0.00          -0.02          -0.66          -0.03          -0.62          -0.47          -0.02          -0.39          -0.01          -0.20          -0.00
Perplexity: 1.1644170003987546 

Prompt:     In a short sentence, what are your thoughts on the future of artificial intelligence?
Response:   The future of artificial intelligence holds immense potential for transformative advancements across various sectors, but it also requires careful consideration of ethical and societal impacts. 

Tokens:                 The          future              of      artificial    intelligence           holds         immense       potential             for  transformative    advancements          across         various         sectors               ,             but              it            also        requires         careful   consideration              of         ethical             and        societal         impacts               .
Logprobs:             -0.02           -0.00            0.00           -0.00            0.00           -0.05           -0.35           -0.01           -0.02           -0.64           -0.43           -0.25           -0.16           -0.51           -0.02           -0.43           -0.08           -0.07           -0.97           -0.02           -0.48           -0.00           -0.00           -0.48           -0.01           -0.58           -0.00
Perplexity: 1.2292170270768858

在这个例子中，gpt-4o-mini对于近期历史这类确定性较高的问题返回了较低的困惑度分数，而对于近期未来这类推测性评估则返回了较高的困惑度分数。需要再次说明的是，虽然这些差异不能保证准确性，但它们有助于指引我们解读模型结果的方式以及未来对这些结果的应用。

6. 结论

太棒了！我们成功利用logprobs参数构建了更强大的分类器，评估了问答系统的检索功能，并对每个token的'字节'进行编码和解码！logprobs为我们的补全输出增添了有价值的信息和信号，我们期待看到开发者如何运用它来改进应用程序。

7. 可能的扩展

本手册未涵盖logprobs的许多其他用例。我们可以将logprobs用于：

内容审核
关键词选择
改进提示词并增强输出的可解释性
令牌修复
还有更多！

2023年12月20日

使用logprobs