事半功倍 - 明智使用 GPT-4

May 18, 2023 · 8 min read

Principal Researcher at Microsoft Research

An adaptive way of using GPT-3.5 and GPT-4 outperforms GPT-4 in both coding success rate and inference cost

简要说明：

使用HumanEval基准测试的案例研究表明，与使用GPT-4进行编码相比，自适应地使用多个GPT模型可以实现更高的准确率（从68%提高到90%）并降低推理成本（降低18%）。

GPT-4 是基础模型能力的一次重大升级，例如在代码和数学方面，相较于 GPT-3.5-Turbo，每个 token 的使用价格大幅提高（超过 10 倍）。在 OpenAI 开发的代码补全基准测试 HumanEval 中，GPT-4 可以成功解决 68% 的任务，而 GPT-3.5-Turbo 只能解决 46%。通过生成多个响应或进行多次调用，有可能进一步提高 GPT-4 的成功率。然而，这样会进一步增加成本，目前的成本已经是使用 GPT-3.5-Turbo 的近 20 倍，并且 API 调用频率限制更为严格。我们能否以更少的资源实现更多的成果？

在这篇博客文章中，我们将探索一种创造性且自适应地使用GPT模型的方法，这种方法将带来巨大的进步。

观察

GPT-3.5-Turbo 已经可以解决40%-50%的任务。对于这些任务，如果我们不使用 GPT-4，可以节省近40%-50%的成本。
如果我们使用节省的成本为剩余未解决的任务生成更多的GPT-4响应，那么有可能在保持分摊成本较低的同时解决更多的问题。

利用这些观察结果的障碍在于，我们事先不知道哪些任务可以由较便宜的模型解决，哪些任务可以由较昂贵的模型解决，以及哪些任务可以通过向较昂贵的模型支付更多费用来解决。

为了克服这一障碍，人们可能希望预测哪些任务需要哪种模型来解决，以及每个任务需要多少个响应。让我们看一个代码补全任务的例子：

def vowels_count(s):
    """Write a function vowels_count which takes a string representing
    a word as input and returns the number of vowels in the string.
    Vowels in this case are 'a', 'e', 'i', 'o', 'u'. Here, 'y' is also a
    vowel, but only when it is at the end of the given word.

    Example:
    >>> vowels_count("abcde")
    2
    >>> vowels_count("ACEDY")
    3
    """

我们能预测GPT-3.5-Turbo是否可以解决这个任务，还是我们需要使用GPT-4？我的第一猜测是GPT-3.5-Turbo可以正确完成，因为指令相当直接。然而，事实证明，如果我们只给GPT-3.5-Turbo一次机会，它并不能始终正确地完成。如何在不实际尝试的情况下预测性能并不明显（但这是一个有趣的研究问题！）。

我们还能做些什么？我们注意到： 验证一个给定的解决方案比从头开始找到一个正确的解决方案要“更容易”。

文档字符串中提供了一些简单的测试用例。如果我们已经有了由模型生成的响应，我们可以使用这些测试用例来筛选错误的实现，并选择使用更强大的模型或生成更多的响应，直到结果通过示例测试用例。此外，这一步可以通过要求GPT-3.5-Turbo从文档字符串中给出的示例生成断言语句（我们可以押注的更简单任务）并执行代码来自动化。

解决方案

结合这些观察，我们可以用两个直观的想法设计一个解决方案：

利用自动生成的反馈，即代码执行结果，来过滤响应。
逐一尝试推理配置，直到有一个响应能通过过滤器。

Design

该解决方案自适应地工作，无需知道或预测哪个任务适合哪个配置。它只是从最便宜的配置开始，逐个尝试多个配置。请注意，一个配置可以生成多个响应（通过将推理参数n设置为大于1）。不同的配置可以使用相同的模型和不同的推理参数，如n和温度。每个任务只返回并评估一个响应。

此解决方案的实现已在autogen中提供。它使用了以下配置序列：

GPT-3.5-Turbo, n=1, temperature=0
GPT-3.5-Turbo, n=7, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]
GPT-4，n=1，温度=0
GPT-4, n=2, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]
GPT-4, n=1, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]

实验结果

本博客文章中的第一个图显示了自适应解决方案与默认GPT-4相比的成功率和平均推理成本。推理成本包括在我们解决方案中生成断言的成本。生成的断言并不总是正确的，通过/未通过生成断言的程序也并不总是正确/错误的。尽管如此，自适应解决方案可以将成功率（文献中称为pass@1）从68%提高到90%，同时将成本降低18%。

以下是通过不同配置组合解决的一些函数定义示例。

由GPT-3.5-Turbo解决，n=1，temperature=0

def compare(game,guess):
    """I think we all remember that feeling when the result of some long-awaited
    event is finally known. The feelings and thoughts you have at that moment are
    definitely worth noting down and comparing.
    Your task is to determine if a person correctly guessed the results of a number of matches.
    You are given two arrays of scores and guesses of equal length, where each index shows a match.
    Return an array of the same length denoting how far off each guess was. If they have guessed correctly,
    the value is 0, and if not, the value is the absolute difference between the guess and the score.


    example:

    compare([1,2,3,4,5,1],[1,2,3,4,2,-2]) -> [0,0,0,0,3,3]
    compare([0,5,0,0,0,4],[4,1,1,0,0,-2]) -> [4,4,1,0,0,6]
    """

通过GPT-3.5-Turbo解决，n=7，温度=1，停止条件=["\nclass", "\ndef", "\nif", "\nprint"]：前面展示的vowels_count函数。
由 GPT-4 解答，n=1，temperature=0：

def string_xor(a: str, b: str) -> str:
    """ Input are two strings a and b consisting only of 1s and 0s.
    Perform binary XOR on these inputs and return result also as a string.
    >>> string_xor('010', '110')
    '100'
    """

由GPT-4解决，n=2，温度=1，停止=["\nclass", "\ndef", "\nif", "\nprint"]：

def is_palindrome(string: str) -> bool:
    """ Test if given string is a palindrome """
    return string == string[::-1]


def make_palindrome(string: str) -> str:
    """ Find the shortest palindrome that begins with a supplied string.
    Algorithm idea is simple:
    - Find the longest postfix of supplied string that is a palindrome.
    - Append to the end of the string reverse of a string prefix that comes before the palindromic suffix.
    >>> make_palindrome('')
    ''
    >>> make_palindrome('cat')
    'catac'
    >>> make_palindrome('cata')
    'catac'
    """

由GPT-4解决，n=1，temperature=1，stop=["\nclass", "\ndef", "\nif", "\nprint"]:

def sort_array(arr):
    """
    In this Kata, you have to sort an array of non-negative integers according to
    number of ones in their binary representation in ascending order.
    For similar number of ones, sort based on decimal value.

    It must be implemented like this:
    >>> sort_array([1, 5, 2, 3, 4]) == [1, 2, 3, 4, 5]
    >>> sort_array([-2, -3, -4, -5, -6]) == [-6, -5, -4, -3, -2]
    >>> sort_array([1, 0, 2, 3, 4]) [0, 1, 2, 3, 4]
    """

最后一个问题是一个原定义中包含错误示例测试用例的示例。它误导了自适应解决方案，因为正确的实现被视为错误，并进行了更多的尝试。序列中的最后一个配置返回了正确的实现，即使它没有通过自动生成的断言。这个示例展示了：

我们的自适应解决方案具有一定的容错能力。
如果使用正确的示例测试用例，自适应解决方案的成功率和推理成本可以进一步提高。

值得注意的是，减少的推理成本是所有任务的平均成本。对于每个单独的任务，成本可能比直接使用GPT-4更大或更小。这是自适应解决方案的本质：通常情况下，困难任务的成本比简单任务更大。

运行此实验的示例笔记本可以在以下位置找到：https://github.com/microsoft/FLAML/blob/v1.2.1/notebook/research/autogen_code.ipynb。该实验是在AutoGen作为FLAML的一个子包时运行的。

讨论

我们的解决方案使用autogen中提供的通用接口实现起来非常简单，但结果非常令人鼓舞。

虽然生成断言的具体方法是应用特定的，但主要思想在LLM操作中是通用的：

生成多个响应以供选择 - 在选择一个好的响应比一次性生成一个好响应相对更容易时尤其有用。
Consider multiple configurations to generate responses - especially useful when:
- 模型和其他推理参数的选择会影响效用与成本的权衡；或者
- 不同的配置有互补的效果。

一篇之前的博客文章提供了证据，证明这些想法在解决数学问题中也是相关的。 autogen使用了一种技术EcoOptiGen来支持推理参数调优和模型选择。

在研究和开发中有许多扩展方向：

推广提供反馈的方式。
自动化优化配置的过程。
为不同的应用程序构建自适应代理。

你认为这种方法适用于你的使用场景吗？你还有关于LLM应用的其他挑战可以分享吗？你希望看到更多关于LLM优化或自动化的支持或研究吗？请加入我们的Discord服务器进行讨论。

进一步阅读

文档关于 autogen 和研究论文。
Blog post 关于数学相关研究。

观察​

解决方案​

实验结果​

讨论​

进一步阅读​

观察

解决方案

实验结果

讨论

进一步阅读