AI SQL 准确性：测试不同的 LLM 和上下文策略以最大化 SQL 生成准确性

2023-08-17

太长不看

拥有一个能够回答业务用户简单英语问题的自主AI代理的承诺是一个吸引人但迄今为止难以实现的命题。许多人尝试过让ChatGPT写作，但成功有限。失败的主要原因是LLM缺乏对其被要求查询的特定数据集的知识。

在本文中， 我们展示了上下文的重要性，通过正确的上下文，我们可以将准确率从约3%提高到约80% 。我们探讨了三种不同的上下文策略，并展示了一个明显的赢家——我们将模式定义、文档和先前的SQL查询与相关性搜索结合起来。

我们还比较了几种不同的LLM - 包括Google Bison、GPT 3.5、GPT 4，以及Llama 2的简要尝试。虽然 GPT 4在生成SQL方面夺得了最佳整体LLM的桂冠 ，但在提供足够上下文的情况下，Google的Bison大致相当。

最后，我们展示了如何使用这里演示的方法为您的数据库生成SQL。

这是我们关键发现的总结 -

为什么使用AI生成SQL？

许多组织现在已经采用了某种形式的数据仓库或数据湖——一个存储了大量组织关键数据的存储库，这些数据可以用于分析目的。这片数据的海洋充满了潜在的洞察力，但企业中只有一小部分人具备利用这些数据所需的两项技能——

A solid comprehension of advanced SQL , and
A comprehensive knowledge of the organization’s unique data structure & schema

同时具备上述两种条件的人数不仅极其稀少，而且很可能与提出大多数问题的人群并不重合。

那么组织内部实际上发生了什么？ 业务用户，如产品经理、销售经理和高管，有数据问题，这些问题将影响业务决策和战略。他们会首先检查仪表板，但大多数问题都是临时的和具体的，答案并不容易获得，因此他们会询问数据分析师或工程师——那些具备上述技能组合的人。这些人很忙，需要一段时间来处理请求，一旦他们得到答案，业务用户就会有后续问题。

这个过程对双方都是痛苦的 对于业务用户（获取答案的周期长）和分析师（分散了他们主要项目的注意力）来说，这导致了许多潜在的洞察被丢失。

生成式AI登场！ LLMs（大型语言模型）可能为商业用户提供用简单英语查询数据库的机会（由LLMs进行SQL翻译），我们已经从数十家公司听说，这将为他们的数据团队甚至整个业务带来革命性的变化。

关键挑战是为复杂且混乱的数据库生成准确的SQL 。我们与许多人交谈过，他们尝试使用ChatGPT来编写SQL，但效果有限且过程痛苦。许多人已经放弃，并回到了手动编写SQL的传统方式。在最好的情况下，ChatGPT有时是分析师用来确保语法正确的有用副驾驶。

但是还有希望！ 我们过去几个月一直沉浸在这个问题中，尝试了各种模型、技术和方法来提高由LLM生成的SQL的准确性。在本文中，我们展示了各种LLM的表现，以及如何通过向LLM提供上下文相关的正确SQL的策略，使LLM能够 达到极高的准确性 。

设置测试架构

首先，我们需要定义测试的架构。下面是一个粗略的概述，分为五个步骤，并附有 伪代码 -

Question - We start with the business question.

   question = "how many clients are there in germany"

Prompt - We create the prompt to send to the LLM.

   prompt = f"""
   Write a SQL statement for the following question:
   {question}
   """

Generate SQL - Using an API, we’ll send the prompt to the LLM and get back generated SQL.

   sql = llm.api(api_key=api_key, prompt=prompt, parameters=parameters)

Run SQL - We'll run the SQL against the database.

    df = db.conn.execute(sql)

Validate results - Finally, we’ll validate that the results are in line with what we expect. There are some shades of grey when it comes to the results so we did a manual evaluation of the results. You can see those results here

设置测试杠杆

现在我们已经设置好了实验，我们需要弄清楚哪些杠杆会影响准确性，以及我们的测试集是什么。我们尝试了两个杠杆（LLMs和使用的训练数据），并在由20个问题组成的测试集上运行。因此，在这个实验中，我们总共运行了3个LLMs x 3种上下文策略 x 20个问题 = 180个单独的试验。

选择数据集

首先，我们需要 选择一个合适的数据集 来尝试。我们有几个指导原则 -

Representative . Datasets in enterprises are often complex and this complexity isn’t captured in many demo / sample datasets. We want to use a complicated database that has real-word use cases that contains real-world data.
Accessible . We also wanted that dataset to be publicly available.
Understandable . The dataset should be somewhat understandable to a wide audience - anything too niche or technical would be difficult to decipher.
Maintained . We’d prefer a dataset that’s maintained and updated properly, in reflection of a real database.

我们发现符合上述标准的数据集是Cybersyn SEC文件数据集，该数据集可在Snowflake市场上免费获取：

https://docs.cybersyn.com/our-data-products/economic-and-financial/sec-filings

选择问题

接下来，我们需要 选择问题 。以下是一些示例问题（在此文件中查看所有问题）-

How many companies are there in the dataset?
What annual measures are available from the 'ALPHABET INC.' Income Statement?
What are the quarterly 'Automotive sales' and 'Automotive leasing' for Tesla?
How many Chipotle restaurants are there currently?

现在我们有了数据集和问题，我们需要想出一些杠杆。

选择提示

对于提示，在这次运行中，我们将保持提示不变，尽管我们会在后续中进行提示的变化。

选择LLMs（基础模型）

为了测试 LLMs ，我们将尝试以下内容 -

Bison (Google) - Bison is the version of PaLM 2 that’s available via GCP APIs.
GPT 3.5 Turbo (OpenAI) - GPT 3.5 until recently was the flagship OpenAI model despite 4 being available because of latency and cost benefits, and not a huge accuracy difference (well - we’ll put that to the test) especially for basic tasks.
GPT 4 (OpenAI) - The more advanced but less performant OpenAI model. GPT 4 is multi-modal, though we won’t be using that functionality.
Llama 2 (Meta) - We really wanted to include an open source model - and the leading one is Meta’s Llama 2. But our setup through Replicate quickly conked out, and we weren’t able to fix it in time for publishing this. In our early runs, when it did work, we found performance to be mediocre at best.

选择上下文

最后，我们将有三种类型的 上下文 。上下文指的是我们发送给LLM的内容，这些内容帮助LLM理解我们特定的数据集。

Schema only . We put the schema (using DDL) in the context window.
Static examples . We put static example SQL queries in the context windows.
Contextually relevant examples . Finally, we put the most relevant context (SQL / DDL / documentation) into the context window, finding it via a vector search based on embeddings.

使用ChatGPT生成SQL

这一部分对ChatGPT有点不公平，但它是一个有用的起点。我们只需拿一个问题并让ChatGPT生成SQL -

提示

Write Snowflake SQL to answer the following question - 

What are the quarterly revenues by business segment for Alphabet?

Return SQL only

响应

当然，这里是SQL查询：

SELECT
  EXTRACT(QUARTER FROM date_column) AS quarter,
  business_segment,
  SUM(revenue) AS quarterly_revenue
FROM
  revenue_table
WHERE
  company_name = 'Alphabet'
GROUP BY
  quarter,
  business_segment
ORDER BY
  quarter,
  business_segment;

-- Replace date_column, business_segment, revenue, revenue_table, and adjust any other identifiers to match your specific database schema.

不可避免地，当我们尝试运行这个时，我们会遇到一个错误，因为我们没有提供表名，所以它生成了表名 -

当然，我们对LLMs有些不公平——尽管它们很神奇，但它们（不幸的是？幸运的是？）还不可能知道我们数据库中的内容。所以让我们进入测试，提供更多的上下文。

仅使用模式

首先，我们获取数据集的模式并将其放入上下文窗口中。这通常是我们看到人们在ChatGPT或教程中所做的。

一个示例提示可能如下所示（实际上我们使用了信息模式，因为Snowflake的共享工作方式，但这展示了原理） -

The user provides a question and you provide SQL. You will only respond with SQL code and not with any explanations.

Respond with only SQL code. Do not answer with any explanations -- just the code.

You may use the following DDL statements as a reference for what tables might be available.

CREATE TABLE Table1...

CREATE TABLE Table2...

CREATE TABLE Table3...

结果，简而言之，非常糟糕。在60次尝试中（20个问题 x 3个模型），只有两个问题被正确回答（都是由GPT 4回答的）， 准确率仅为3% 。以下是GPT 4成功回答正确的两个问题 -

What are the top 10 measure descriptions by frequency?
What are the distinct statements in the report attributes?

很明显，仅仅使用模式，我们并不能接近达到一个有用的AI SQL助手的标准，尽管它可能在作为分析师副驾驶方面有些用处。

使用SQL示例

如果我们站在一个第一次接触这个数据集的人的角度，除了表定义之外，他们首先会查看示例查询，以了解如何正确地查询数据库。

这些查询可以提供模式中不可用的额外上下文 - 例如，使用哪些列，表如何连接在一起，以及查询该特定数据集的其他复杂性。

Cybersyn，与Snowflake市场上的其他数据提供商一样，在他们的文档中提供了一些（在这个例子中是3个）示例查询。让我们将这些包含在上下文窗口中。

通过仅提供这3个示例查询，我们看到生成的SQL的正确性有了显著提高。然而，这种准确性在很大程度上取决于底层的LLM。似乎GPT-4最能以生成最准确SQL的方式概括示例查询。

使用上下文相关的示例

企业数据仓库通常包含数百（甚至数千）个表，以及覆盖其组织内所有用例的更多查询。鉴于现代LLM的上下文窗口大小有限，我们不能简单地将所有先前的查询和模式定义塞入提示中。

我们最终的上下文处理方法是一种更复杂的机器学习方法——将先前查询的嵌入和表模式加载到向量数据库中，并仅选择与所提问题最相关的查询/表。以下是我们所做工作的示意图——注意红色框中的上下文相关性搜索——

通过向LLM展示那些SQL查询的最相关示例，我们可以显著提高即使是能力较弱的LLM的性能。在这里，我们为LLM提供了问题的最相关的10个SQL查询示例（从存储的30个示例列表中），准确率大幅提升。

我们可以通过维护可执行的SQL语句历史记录来进一步提高性能，这些语句能够正确回答用户提出的实际问题。

分析结果

很明显，最大的差异不在于LLM的类型，而在于为LLM提供适当上下文的策略（例如使用的“训练数据”）。

当通过上下文策略查看SQL准确性时，很明显这是造成差异的原因。我们仅使用模式时的准确性约为3%，而在智能使用上下文示例时，准确性提高到约80%。

LLMs本身仍然存在一些有趣的趋势。虽然Bison在Schema和Static上下文策略中开始时处于堆的底部，但在完整的Contextual策略中它迅速上升到顶部。在三种策略中平均来看， GPT 4 成为SQL生成的最佳LLM 。

进一步提高准确性的后续步骤

我们很快将对此分析进行跟进，以更深入地了解准确的SQL生成。一些下一步的步骤是 -

Use other datasets : We'd love to try this on other, real world, enterprise datasets. What happens when you get to 100 tables? 1000 tables?
Add more training data : While 30 queries is great, what happens when you 10x, 100x that number?
Try more databases : This test was run on a Snowflake database, but we've also gotten this working on BigQuery, Postgres, Redshift, and SQL Server.
Experiment with more foundational models: We are close to being able to use Llama 2, and we'd love to try other LLMs.

我们有一些关于上述内容的轶事证据，但我们将扩展和完善我们的测试，以包括更多这些项目。

使用AI为您的数据集编写SQL

虽然SEC数据是一个好的开始，但您可能想知道这是否与您的数据和您的组织相关。我们正在构建一个 Python包，它可以为您的数据库生成SQL，以及生成Plotly代码用于图表、后续问题和其他各种功能。

以下是它的工作原理概述

import vanna as vn

Train Using Schema

vn.train(ddl="CREATE TABLE ...")

Train Using Documentation

vn.train(documentation="...")

Train Using SQL Examples

vn.train(sql="SELECT ...")

Generating SQL

使用 Vanna 的最简单方法是 vn.ask(question="What are the ...") ，它将返回 SQL、表格和图表，如您在此示例笔记本中所见。 vn.ask 是 vn.generate_sql 、 vn.run_sql 、 vn.generate_plotly_code 、 vn.get_plotly_figure 和 vn.generate_followup_questions 的封装。这将使用优化的上下文为您的提问生成 SQL，Vanna 将为您调用 LLM。

或者，您可以使用 vn.get_related_training_data(question="What are the ...") 如这个 notebook 所示，它将检索最相关的上下文，您可以使用这些上下文来构建自己的提示，发送给任何LLM。

这个 notebook 展示了如何使用“静态”上下文策略在Cybersyn SEC数据集上训练Vanna的示例。

关于命名的说明

Foundational Model : This is the underlying LLM
Context Model (aka Vanna Model) : This is a layer that sits on top of the LLM and provides context to the LLM
Training : Generally when we refer to "training" we're talking about training the context model.

联系我们

如果有任何问题，请在 Slack 、 Discord 上联系我们，或者安排一次1对1通话。

AI生成SQL的准确性如何？