Techniques to improve reliability

当GPT-3在任务上失败时，你应该怎么做？

寻找一个能引出更可靠答案的更好提示词？
投入数千个示例来微调自定义模型？
假设模型无法完成该任务，然后继续？

没有一个简单的答案——这取决于具体情况。但如果您的任务涉及逻辑推理或复杂性，可以考虑尝试本文中的技术来构建更可靠、高性能的提示词。

为什么GPT-3在复杂任务上表现不佳

如果有人让你计算13乘以17，答案会立刻浮现在你脑海中吗？对我们大多数人来说，可能不会。但这并不意味着人类无法进行两位数乘法。只要花上几秒钟，用笔和纸演算一下，就能轻松得出13 x 17 = 130 + 70 + 21 = 221。

同样地，如果你给GPT-3一个过于复杂的任务，以至于无法在计算下一个token的时间内完成，它可能会编造一个错误的猜测。然而，与人类类似，这并不一定意味着模型无法完成该任务。只要给予一些时间和空间进行推理，模型仍然可能可靠地给出答案。

举个例子，如果你向gpt-3.5-turbo-instruct提出以下关于杂耍球的数学问题，它会给出错误答案：

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A:

There are 8 blue golf balls.

这是否意味着GPT-3无法解决简单的数学问题？不；事实上，通过使用让我们一步步思考来提示模型，模型可以可靠地解决问题：

Q: A juggler has 16 balls. Half of the balls are golf balls and half of the golf balls are blue. How many blue golf balls are there?
A: Let's think step by step.

There are 16 balls in total.
Half of the balls are golf balls.
That means that there are 8 golf balls.
Half of the golf balls are blue.
That means that there are 4 blue golf balls.

当然，仅凭一个例子很难判断这种Let's think step by step技巧是普遍有效，还是仅在这个特定问题上侥幸成功。但它确实有效。在数学文字题的基准测试中，Let's think step by step技巧将GPT-3的解题率从毫无价值的18%大幅提升到了可观的79%！

模型能力取决于上下文

在学习使用GPT-3时，一个常见的概念性错误是认为其能力在所有情境下都是固定的。例如，如果GPT-3答错了一个简单的逻辑问题，就认为它一定不具备简单逻辑能力。

但正如Let's think step by step示例所示，GPT-3的表面失败有时可以通过更好的提示来补救，这些提示能帮助模型引导自己产生正确的输出。

如何提升复杂任务的可靠性

本文其余部分将分享提高大型语言模型在复杂任务上可靠性的技术。尽管其中一些技术针对特定类型的问题，但许多技术建立在通用原则之上，可广泛应用于各种任务，例如：

提供更清晰的说明
将复杂任务拆分为更简单的子任务
构建指令结构以保持模型专注于任务
提示模型在回答前先进行解释
要求提供多种可能答案的论证依据，然后进行综合
生成多个输出，然后让模型选择最佳的一个
微调自定义模型以最大化性能

将复杂任务拆分为更简单的任务

给予模型更多时间和空间思考的一种方法是将任务分解为更简单的部分。

举个例子，假设我们让模型回答一个关于某段文本的多选题——这里以游戏《妙探寻凶》为例。当直接提问时，gpt-3.5-turbo-instruct无法将线索3和5联系起来，从而给出了错误答案：

Use the following clues to answer the following multiple-choice question.

Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.

Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

Solution:

(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

尽管线索3和5证实了Mustard上校是天文台中唯一的人，且天文台中的人持有烛台，但模型未能将它们结合起来得出正确答案(a) 是。

不过，我们可以不直接询问答案，而是将任务拆分为三个部分：

首先，逐一查看线索并思考该线索是否具有潜在相关性
其次，结合相关线索推理出问题的答案
第三，写出最终答案：(a)、(b)或(c)中的一项

Use the following clues to answer the following multiple-choice question, using the following procedure:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant
(2) Second, combine the relevant clues to reason out the answer to the question
(3) Third, map the answer to one of the multiple choice answers: either (a), (b), or (c)

Clues:
1. Miss Scarlett was the only person in the lounge.
2. The person with the pipe was in the kitchen.
3. Colonel Mustard was the only person in the observatory.
4. Professor Plum was not in the library nor the billiard room.
5. The person with the candlestick was in the observatory.

Question: Was Colonel Mustard in the observatory with the candlestick?
(a) Yes; Colonel Mustard was in the observatory with the candlestick
(b) No; Colonel Mustard was not in the observatory with the candlestick
(c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick

Solution:
(1) First, go through the clues one by one and consider whether the clue is potentially relevant:

Clue 1: Miss Scarlett was the only person in the lounge. This clue is not relevant.
Clue 2: The person with the pipe was in the kitchen. This clue is not relevant.
Clue 3: Colonel Mustard was the only person in the observatory. This clue is relevant.
Clue 4: Professor Plum was not in the library nor the billiard room. This clue is not relevant.
Clue 5: The person with the candlestick was in the observatory. This clue is relevant.

(2) Second, combine the relevant clues to reason out the answer to the question:
Clue 3 and Clue 5 both say that the person with the candlestick was in the observatory. Colonel Mustard was the only person in the observatory, so Colonel Mustard was the person with the candlestick.

(3) Third, write the final answer:
(a) Yes; Colonel Mustard was in the observatory with the candlestick

通过给模型更多的时间和空间进行思考，并引导其遵循推理计划，它能够得出正确答案：(a) 是。

将复杂指令拆分为更小子任务的另一个好处是，这有助于保持模型专注于每个子任务。

例如，如果我们要求gpt-3.5-turbo-instruct用原文语言总结一段文本，该模型可能会重新使用英语：

Summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Summary:

The text explains that statistics is a science that studies the variability, collection, organization, analysis, interpretation, and presentation of data, as well as the random process that generates them following the laws of probability.

然而，如果我们先让模型识别文本的语言，然后再进行摘要总结，结果就会更可靠：

First, identify the language of the text. Second, summarize the text using the original language of the text. The summary should be one sentence long.

Text:
"""
La estadística (la forma femenina del término alemán Statistik, derivado a su vez del italiano statista, "hombre de Estado") es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad. La estadística es una ciencia formal deductiva, con un conocimiento propio, dinámico y en continuo desarrollo obtenido a través del método científico formal. En ocasiones, las ciencias fácticas necesitan utilizar técnicas estadísticas durante su proceso de investigación factual, con el fin de obtener nuevos conocimientos basados en la experimentación y en la observación. En estos casos, la aplicación de la estadística permite el análisis de datos provenientes de una muestra representativa, que busca explicar las correlaciones y dependencias de un fenómeno físico o natural, de ocurrencia en forma aleatoria o condicional.
"""

Language:

Spanish

La estadística es una ciencia que estudia la variabilidad, colección, organización, análisis, interpretación, y presentación de los datos, así como el proceso aleatorio que los genera siguiendo las leyes de la probabilidad.

在回答前提示模型进行解释

另一个提高答案可靠性的强大技术是提示模型逐步推理出答案，而不是立即跳到最终答案。通过"大声思考"，模型更有可能得出正确答案。

Zero-shot

方法

由Takeshi Kojima等人于2022年发表，提示模型逐步推理出答案的最简单方法就是在回答前加上Let's think step by step.。图2展示了一个示例：

zero-shot reasoning example
来源：大型语言模型是零样本推理器 作者：Takeshi Kojima等（2022年）。

结果

将这个简单技巧应用于MultiArith数学数据集后，作者发现Let's think step by step使准确率提升了四倍，从18%跃升至79%！

zero-shot reasoning example
来源：大型语言模型是零样本推理器 作者：Takeshi Kojima等（2022年）。

影响

虽然Let's think step by step技巧在数学问题上效果很好，但它并非对所有任务都有效。作者发现它对多步骤算术问题、符号推理问题、策略问题和其他推理问题最有帮助。它对简单的数学问题或常识性问题没有帮助，推测也不会对许多其他非推理任务产生效果。

zero-shot reasoning example
来源：大型语言模型是零样本推理器 作者：Takeshi Kojima等 (2022)。

了解更多信息，请阅读完整论文。

如果您将这项技术应用到自己的任务中，不要害怕尝试自定义指令。Let's think step by step相当通用，因此您可能会发现针对特定用例定制更严格格式的指令效果更好。例如，您可以尝试更有结构化的变体，比如First, think step by step about why X might be true. Second, think step by step about why Y might be true. Third, think step by step about whether X or Y makes more sense.。您甚至可以为模型提供一个示例格式来帮助它保持正轨，例如：

Using the IRS guidance below, answer the following questions using this format:
(1) For each criterion, determine whether it is met by the vehicle purchase
- {Criterion} Let's think step by step. {explanation} {yes or no, or if the question does not apply then N/A}.
(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

IRS guidance:
"""
You may be eligible for a federal tax credit under Section 30D if you purchased a car or truck that meets the following criteria:
- Does the vehicle have at least four wheels?
- Does the vehicle weigh less than 14,000 pounds?
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source?
- Was the vehicle purchased in a year before 2022?
  - If so, has the manufacturer sold less than 200,000 qualifying vehicles? (Tesla and GM have sold more than 200,000 qualifying vehicles.)
- Was the vehicle purchased in a year after 2022?
  - If so, is the vehicle present in the following list of North American-assembled vehicles? (The only electric vehicles assembled in North America are the Audi Q5, BMW 330e, BMW X5, Chevrolet Bolt EUV, Chevrolet Bolt EV, Chrysler Pacifica PHEV, Ford Escape PHEV, Ford F Series, Ford Mustang MACH E, Ford Transit Van, GMC Hummer Pickup, GMC Hummer SUV, Jeep Grand Cherokee PHEV, Jeep Wrangler PHEV, Lincoln Aviator PHEV, Lincoln Corsair Plug-in, Lucid Air, Nissan Leaf, Rivian EDV, Rivian R1S, Rivian R1T, Tesla Model 3, Tesla Model S, Tesla Model X, Tesla Model Y, Volvo S60, BMW 330e, Bolt EV, Cadillac Lyriq, Mercedes EQS SUV, and Nissan Leaf.)
"""

Question: Can I claim a federal tax credit for my Toyota Prius Prime bought in 2021?

Solution:

(1) For each criterion, determine whether it is met by the vehicle purchase
- Does the vehicle have at least four wheels? Let's think step by step.

 The Toyota Prius Prime has four wheels, so the answer is yes.
- Does the vehicle weigh less than 14,000 pounds? Let's think step by step. The Toyota Prius Prime weighs less than 14,000 pounds, so the answer is yes.
- Does the vehicle draw energy from a battery with at least 4 kilowatt hours that may be recharged from an external source? Let's think step by step. The Toyota Prius Prime has a battery with at least 4 kilowatt hours that may be recharged from an external source, so the answer is yes.
- Was the vehicle purchased in a year before 2022? Let's think step by step. The Toyota Prius Prime was purchased in 2021, which is before 2022, so the answer is yes.
- Was the vehicle purchased in a year after 2022? N/A
- If so, is the vehicle present in the following list of North American-assembled vehicles? N/A

(2) After considering each criterion in turn, phrase the final answer as "Because of {reasons}, the answer is likely {yes or no}."

Because the Toyota Prius Prime meets all of the criteria for a federal tax credit, the answer is likely yes.

Few-shot examples

方法

引导模型推理其答案可以通过多种方式实现。一种方法是像Jason Wei和Denny Zhou等来自谷歌的研究人员所研究的那样，通过少量示例进行演示（'few-shot'）。以下是一个少量示例的思维链提示示例：

chain of thought example
来源: 思维链提示激发大型语言模型中的推理 Jason Wei 和 Denny Zhou 等人 (2022)

更多由人工标注者编写的推理链示例：

chain of thought example
来源: 思维链提示激发大型语言模型中的推理能力 Jason Wei 和 Denny Zhou 等人 (2022)

(请注意，梨子是否真的能浮起来存在疑问)

结果

在小学数学问题上进行测试时，作者发现思维链提示将解题率提高了三倍，从18%提升至57%。

chain of thought example
来源: 思维链提示激发大型语言模型中的推理能力 Jason Wei 和 Denny Zhou 等人 (2022)

除了数学问题，思维链提示还提升了与体育理解、抛硬币跟踪和末字母连接相关问题的表现。在大多数情况下，不需要太多示例就能达到性能提升的饱和点（少于8个左右）。

chain of thought example
来源: 思维链提示激发大型语言模型的推理能力 Jason Wei 和 Denny Zhou 等人 (2022)

了解更多信息，请阅读完整论文。

影响

相对于Let's think step by step技术，基于少量示例的方法有一个优势：您可以更轻松地指定希望模型在得出最终答案之前执行的推理格式、长度和风格。这在模型最初没有以正确方式或深度进行推理的情况下特别有用。

Fine-tuned

方法

通常来说，要在任务中获得最佳性能，您需要对自定义模型进行微调。然而，使用解释来微调模型可能需要数千个示例解释，编写这些解释的成本很高。

2022年，Eric Zelikman和Yuhuai Wu等人发表了一种巧妙的流程，通过少量样本提示生成解释数据集，用于微调模型。其核心思路是：先用少量样本提示生成候选解释，仅保留那些能得出正确答案的解释；然后针对部分错误答案，重新尝试少量样本提示，但这次将正确答案作为问题的一部分给出。作者将这一流程命名为STaR（自学推理器）：

STaR procedure
来源: STaR: 通过推理引导推理 作者 Eric Zelikman 和 Yujuai Wu 等人 (2022)

通过这种技术，您可以将微调的优势与思维链提示的优势相结合，而无需编写数千个示例解释。

结果

当作者将这项技术应用于常识问答数据集时，他们发现STaR的表现优于单独的思维链提示（73% > 37%）和单独的微调（73% > 60%）：

STaR results
来源: STaR: 通过推理引导推理 作者 Eric Zelikman 和 Yujuai Wu 等人 (2022)

了解更多信息，请阅读完整论文。

影响

使用少量示例提示来扩展或修改微调数据集这一思路可以推广到解释性写作之外的其他场景。例如，如果您拥有大量希望用于训练的非结构化文本，您可能会发现可以利用提示从非结构化文本中提取结构化数据集，然后基于该结构化数据集微调定制模型。

思维链提示的扩展

思维链提示的多种扩展方法也已发布。

选择-推理提示法

方法

由Antonia Creswell等人发表，思维链技术的一个扩展是将生成解释和答案的单一提示拆分为更小的部分。首先，一个提示从文本中选择相关的事实子集（'选择提示'）。然后，第二个提示从选定的事实中推断出结论（'推理提示'）。这些提示在循环中交替使用，以生成多步推理，并最终得出最终答案。作者在下图中展示了这一思路：

Selection-inference prompting
来源：《选择-推理：利用大型语言模型进行可解释的逻辑推理》，作者Antonia Creswell等人（2022年）

结果

当应用于一个70亿参数的模型时，作者发现相较于思维链提示方法，选择推理提示在bAbi和Proof Writer基准任务上显著提升了性能（这两个任务都需要更长的推理步骤序列）。他们取得的最佳性能是通过将选择推理提示与微调相结合实现的。

Selection-inference prompting
来源：《选择-推理：利用大型语言模型实现可解释的逻辑推理》，作者Antonia Creswell等人（2022年）

影响

尽管这些基准测试上的提升幅度很大，但这些基准测试是专门挑选的，因为它们需要更长的推理序列。对于不需要多步推理的问题，提升幅度可能会小一些。

结果突显了使用大型语言模型时的几个通用经验。首先，将复杂任务拆分为更小的任务是提高可靠性和性能的好方法；任务越原子化，模型出错的余地就越小。其次，要获得最佳性能通常意味着将微调与你选择的任何方法相结合。

了解更多信息，请阅读完整论文。

可信推理架构

在发布选择推理提示技术几个月后，作者们在一篇后续论文中扩展了该技术，提出了以下创新思路：

确定选择-推理循环何时应停止或继续
添加一个价值函数来帮助搜索多个推理路径
通过微调模型使其推理句子标签（如sen1）而非直接写出句子本身，减少虚假事实的幻觉现象

方法

在最初的选择-推理技术中，交替使用专门的“选择”和“推理”提示来选择事实并从这些事实中做出推断，结合起来生成一系列推理步骤。

作者通过两个额外组件扩展了这项技术。

首先，作者添加了一个“暂停”模型，该模型在每次推理步骤后会被询问当前的推理是否足以回答问题。如果是，则模型生成最终答案。

halter模型带来了一些优势：

它可以根据需要指示选择推理过程停止或继续。
如果进程永不停止，你将得不到答案，这通常比产生幻觉般的猜测更可取

来源：Faithful Reasoning Using Large Language Models 作者 Antonia Creswell 等人 (2022)

其次，作者添加了一个价值函数，用于评估推理步骤的质量并搜索多个推理轨迹。这呼应了提高可靠性的常见主题；不是从模型中生成单一答案，而是生成一组答案，然后使用某种类型的价值函数/判别器/验证器模型来选择最佳答案。

来源：Faithful Reasoning Using Large Language Models 作者 Antonia Creswell 等人 (2022)

除了这两种扩展方法外，作者还采用了一个技巧来减少虚构事实的幻觉现象。他们没有要求模型直接输出事实性语句，而是对模型进行微调，使其处理句子标签（例如sen1）。这有助于防止模型在提示上下文中生成未被提及的虚构事实。

来源：Faithful Reasoning Using Large Language Models 作者 Antonia Creswell 等人 (2022)

结果

作者在两个基准测试上评估了他们的技术：ProofWriter任务（未展示）和EntailmentBankQA（已展示）。该技术显著提高了准确率，尤其是在更复杂的推理问题上。

Faithful reasoning
来源：Antonia Creswell等人（2022年）发表的《使用大型语言模型进行可靠推理》https://arxiv.org/abs/2208.14271

此外，他们的句子标签操作技巧基本上消除了幻觉问题！

Faithful reasoning
来源：Antonia Creswell等人（2022年）的论文《基于大型语言模型的可靠推理》](https://arxiv.org/abs/2208.14271)

影响

本文阐述了一系列有助于提升大型语言模型可靠性的实用经验：

将复杂任务拆分为更小、更可靠的子任务
逐步生成你的答案，并在过程中进行评估
生成多个可能的答案，并使用另一个模型或函数来挑选看起来最好的答案
通过限制模型可以表达的内容来减少幻觉（例如，使用句子标签而非完整句子）
通过在特定任务上微调模型来最大化其性能

了解更多信息，请阅读完整论文。

最少到最多提示法

除了在长推理链上表现不佳（这正是选择推理的强项），思维链提示在示例简短但任务冗长时尤其容易遇到困难。

方法

最少到最多提示（Least-to-most prompting）是另一种将推理任务拆分为更小、更可靠子任务的技术。其核心思想是通过类似要解决{question}，我们首先需要解决："的提示，从模型中引出一个子任务。然后，模型可以利用这个子任务生成解决方案。该解决方案会被附加到原始问题中，并重复这个过程，直到产生最终答案。

Least-to-most prompting
来源: 最少到最多提示法使大型语言模型具备复杂推理能力 作者：Denny Zhou等 (2022)

结果

当应用于涉及使用code-davinci-002(该模型针对代码进行了优化但仍能理解文本)进行长推理链的基准测试时，作者测量到的提升幅度高达16% -> 99.7%！

Least-to-most prompting results on DROP numerical reasoning
来源: 最少到最多提示法使大型语言模型具备复杂推理能力 作者：Denny Zhou等 (2022)

影响

尽管上述从最少到最多提示中获得的收益令人印象深刻，但它们是在一组非常狭窄、需要长推理链的任务上测量的。

尽管如此，它们阐明了一个共同主题：通过(a)将复杂任务拆分为更小的子任务，以及(b)给予模型更多时间和空间来推演答案，从而提高可靠性。

了解更多信息，请阅读完整论文。

Maieutic prompting

方法

与之前试图最大化正确答案可能性的技术不同，另一种方法是使用GPT-3生成可能解释（包括正确和错误）的树状结构，然后分析它们之间的关系来推测哪一组是正确的。这种技术被Jaehun Jung等人于2022年5月命名为"产婆术提示法"（maieutic意为通过苏格拉底式提问来引发思想的产婆术方法）。

该方法较为复杂，其工作原理如下：

First, build a maieutic tree, where each node is a statement that could be true or false:
- 从一个选择题或判断题开始（例如 War cannot have a tie）
- 针对问题的每个可能答案，使用模型生成相应的解释（使用类似War cannot have a tie? True, because这样的提示语）
- 然后，用问题和生成的解释提示模型，并要求它生成答案。如果反转解释（加上类似It is wrong to say that {explanation}的前缀）会导致答案反转，那么该解释就被认为是"逻辑完整的"。
- 如果某个解释在逻辑上不完整，则递归重复上述过程，将每个解释转化为一个真或假的问题，并为每个新问题生成更多解释。
- 在所有递归解释完成后，最终会得到一棵解释树，其中树上的每个叶子节点都具有这样的特性：反转解释就会反转模型的答案。
Second, convert the tree into a graph of relations:
- 对于树中的每个节点，计算模型对该节点的相对置信度（通过给定解释后得到True答案的概率推断得出）
- 对于树中的每对节点，使用模型识别它们是否存在蕴含（隐含）或矛盾关系
Third, find the most consistent set of beliefs and take those to be true:
- 具体而言，利用每个节点的置信度及其间的逻辑关系，将该问题建模为加权最大可满足性问题(MAX-SAT)
- 使用求解器找出最自洽的信念集，并将其视为真实

来源: Maieutic Prompting: 通过递归解释实现逻辑一致推理 作者 Jaehun Jung 等人 (2022)

结果

Maieutic prompting results
来源: Maieutic Prompting: 通过递归解释实现逻辑一致推理 作者 Jaehun Jung 等人 (2022)

影响

除了复杂性之外，这种方法的一个限制是它似乎仅适用于可以表述为多项选择题的问题。

了解更多信息，请阅读完整论文。

扩展功能

自我一致性

方法

对于答案选项有限的任务，提高可靠性的一个简单方法是：从模型中采样多个解释和答案（使用正温度参数），然后选择出现频率最高的最终答案。

Self-consistency method
来源：自洽性提升语言模型中的思维链推理 作者：Xuezhi Wang等 (2022)

结果

这项技术在一系列数学和推理基准测试中将准确率提升了1到24个百分点不等。（下图绘制的是谷歌LaMDA模型的结果；使用谷歌更大的PaLM模型时，基线准确率更高，但提升幅度略小。）

Self-consistency results
来源：自洽性提升语言模型中的思维链推理 作者：Xuezhi Wang 等 (2022)

影响

尽管这项技术实现起来很简单，但成本可能很高。生成10个答案会使您的成本增加10倍。

此外，与许多这类技术一样，它仅适用于答案范围有限的任务。对于每个答案都独一无二的开放式任务（例如写诗），选择最常见答案的含义并不明确。

最后，当存在多条路径或表达方式可以得出答案时，这种技术应该最为有效；如果只有一条路径，那么该技术可能完全无济于事。一个极端的例子：如果任务是生成单个标记的答案，那么从100次生成中选取最常见的标记，与直接选择具有最高对数概率的标记（可以通过temperature=0的单次生成获得）并无区别。

验证器

提升任务性能的另一项关键技术是训练一个验证器或判别器模型来评估主生成模型的输出。如果判别器拒绝输出结果，则可以重新采样生成模型，直到获得可接受的输出。在许多情况下，评判答案比生成答案更容易，这有助于解释该方法的有效性。

方法

2021年，OpenAI研究人员将这一技术应用于小学数学问题，采用以下步骤：

首先，他们针对问题和解决方案对模型进行了微调
对于训练集中的每个问题，他们生成了100个解决方案
这100个解决方案中的每一个都根据最终答案是否正确被自动标记为正确或错误
利用这些解决方案，其中一些标记为正确，一些标记为错误，他们对验证器模型进行了微调，以分类问题和候选解决方案是否正确
最后，在测试阶段，生成模型会为每个问题创建100个解决方案，验证模型根据评分选出得分最高的方案作为最终答案

Verifier method
来源: Karl Cobbe等人所著的《训练验证器解决数学应用题》(2021)

结果

借助1750亿参数的GPT-3模型和8000个训练样本，该技术将小学数学题的准确率从约33%显著提升至约55%。

Verifier results
来源: Karl Cobbe等人所著的《训练验证器解决数学应用题》(2021)

影响

类似于自洽技术，这种方法可能会变得昂贵，因为为每个任务生成100个解决方案，成本大约会增加100倍。

可靠性理论

尽管上述方法在具体实现上各有不同，但它们都致力于提升复杂任务处理的可靠性。主要通过以下方式实现：

将不可靠的操作分解为更小、更可靠的操作（例如，选择推理提示）
通过多步骤或多重关系使系统的可靠性超过任何单一组件（例如，启发式提示）

课程	论文	日期
将复杂任务拆解为更简单的子任务（并考虑向用户展示中间输出结果）	AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts	2021年10月
您可以通过生成多个候选结果，然后选择最佳的一个来提升输出质量	Training Verifiers to Solve Math Word Problems	2021年10月
在推理任务中，模型在逐步推理后再回答问题的表现更优	Chain of Thought Prompting Elicits Reasoning in Large Language Models	2022年1月
您可以通过生成多个解释-答案输出并选择最受欢迎的答案，来逐步提升推理能力	Self-Consistency Improves Chain of Thought Reasoning in Language Models	2022年3月
如果想微调一个逐步推理的智能体，仅使用选择题问答数据即可实现	STaR: Bootstrapping Reasoning With Reasoning	2022年3月
即使没有任何示例，逐步推理方法也能表现出色	Large Language Models are Zero-Shot Reasoners	2022年5月
相比逐步推理，您可以通过交替使用"选择"提示和"推理"提示获得更好效果	Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning	2022年5月
在处理长推理问题时，可以通过将问题拆分成若干部分逐步解决，从而改进逐步推理能力	Least-to-most Prompting Enables Complex Reasoning in Large Language Models	2022年5月
你可以让模型同时分析正确和错误的解释，以确定哪组解释最为一致	Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations	2022年5月
您可以将这些技术视为概率编程的范畴，其中系统由不可靠的组件构成	Language Model Cascades	2022年7月
您可以通过句子标签操作消除幻觉，并使用“约束”提示减少错误答案	Faithful Reasoning Using Large Language Models	2022年8月

2022年9月12日

提升可靠性的技术