文本转SQL指南（查询引擎 + 检索器）¶

这是LlamaIndex文本转SQL功能的基础指南。

我们首先展示如何在一个玩具数据集上执行文本到SQL转换：这将完成"检索"（对数据库执行SQL查询）和"合成"操作。
然后我们将展示如何在模式上构建TableIndex，以便在查询时动态检索相关表格。
接下来，我们将展示如何利用查询时行和列检索器来增强Text-to-SQL的上下文理解。
最后我们将展示如何单独定义一个文本到SQL的检索器。

注意：任何文本转SQL应用都应注意，执行任意SQL查询可能存在安全风险。建议根据需要采取预防措施，例如使用受限角色、只读数据库、沙盒环境等。

如果你在Colab上打开这个Notebook，你可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-core llama-index-llms-openai llama-index-embeddings-openai
%pip install llama-index-core llama-index-llms-openai llama-index-embeddings-openai

In [ ]:

Copied!

import os
import openai
导入操作系统模块
导入OpenAI模块

In [ ]:

Copied!

os.environ["OPENAI_API_KEY"] = "sk-.."
os.environ["OPENAI_API_KEY"] = "sk-.."

In [ ]:

Copied!

# import logging
# import sys

# logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
# 导入日志模块
# 导入系统模块

# 配置基础日志设置，输出到标准输出，日志级别为INFO
# 为日志记录器添加流处理器，输出到标准输出

In [ ]:

Copied!

from IPython.display import Markdown, display
from IPython.display import Markdown, display

创建数据库架构¶

我们使用sqlalchemy（一个流行的SQL数据库工具包）来创建一个空的city_stats表

In [ ]:

Copied!





from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
)
from sqlalchemy import (
    create_engine,
    MetaData,
    Table,
    Column,
    String,
    Integer,
    select,
)

In [ ]:

Copied!

engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()
engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()

In [ ]:

Copied!





# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)
# 创建城市SQL表
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)

定义SQL数据库¶

我们首先定义SQLDatabase抽象层（这是对SQLAlchemy的一个轻量级封装）。

In [ ]:

Copied!

from llama_index.core import SQLDatabase
from llama_index.llms.openai import OpenAI
from llama_index.core import SQLDatabase
from llama_index.llms.openai import OpenAI

In [ ]:

Copied!

llm = OpenAI(temperature=0.1, model="gpt-4.1-mini")
llm = OpenAI(temperature=0.1, model="gpt-4.1-mini")

In [ ]:

Copied!

sql_database = SQLDatabase(engine, include_tables=["city_stats"])
sql_database = SQLDatabase(engine, include_tables=["city_stats"])

我们向SQL数据库添加了一些测试数据。

In [ ]:

Copied!





from sqlalchemy import insert

sql_database = SQLDatabase(engine, include_tables=["city_stats"])

rows = [
    {"city_name": "Toronto", "population": 2930000, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
    {
        "city_name": "Chicago",
        "population": 2679000,
        "country": "United States",
    },
    {
        "city_name": "New York",
        "population": 8258000,
        "country": "United States",
    },
    {"city_name": "Seoul", "population": 9776000, "country": "South Korea"},
    {"city_name": "Busan", "population": 3334000, "country": "South Korea"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.begin() as connection:
        cursor = connection.execute(stmt)
from sqlalchemy import insert

sql_database = SQLDatabase(engine, include_tables=["city_stats"])

rows = [
    {"city_name": "Toronto", "population": 2930000, "country": "Canada"},
    {"city_name": "Tokyo", "population": 13960000, "country": "Japan"},
    {
        "city_name": "Chicago",
        "population": 2679000,
        "country": "United States",
    },
    {
        "city_name": "New York",
        "population": 8258000,
        "country": "United States",
    },
    {"city_name": "Seoul", "population": 9776000, "country": "South Korea"},
    {"city_name": "Busan", "population": 3334000, "country": "South Korea"},
]
for row in rows:
    stmt = insert(city_stats_table).values(**row)
    with engine.begin() as connection:
        cursor = connection.execute(stmt)

In [ ]:

Copied!





# view current table
stmt = select(
    city_stats_table.c.city_name,
    city_stats_table.c.population,
    city_stats_table.c.country,
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(results)
# 查看当前表格
stmt = select(
    city_stats_table.c.city_name,
    city_stats_table.c.population,
    city_stats_table.c.country,
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(results)

[('Toronto', 2930000, 'Canada'), ('Tokyo', 13960000, 'Japan'), ('Chicago', 2679000, 'United States'), ('New York', 8258000, 'United States'), ('Seoul', 9776000, 'South Korea'), ('Busan', 3334000, 'South Korea')]

查询索引¶

我们首先展示如何执行原始SQL查询，该查询直接在表上执行。

In [ ]:

Copied!





from sqlalchemy import text

with engine.connect() as con:
    rows = con.execute(text("SELECT city_name from city_stats"))
    for row in rows:
        print(row)
from sqlalchemy import text

with engine.connect() as con:
    rows = con.execute(text("SELECT city_name from city_stats"))
    for row in rows:
        print(row)

('Busan',)
('Chicago',)
('New York',)
('Seoul',)
('Tokyo',)
('Toronto',)

第一部分：文本转SQL查询引擎¶

构建好SQL数据库后，我们可以使用NLSQLTableQueryEngine来构造自然语言查询，这些查询会被合成为SQL查询语句。

请注意，我们需要指定查询引擎要使用的表。如果不指定，查询引擎将拉取所有模式上下文，这可能会超出LLM的上下文窗口限制。

In [ ]:

Copied!





from llama_index.core.query_engine import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database, tables=["city_stats"], llm=llm
)
query_str = "Which city has the highest population?"
response = query_engine.query(query_str)
从llama_index.core.query_engine导入NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database, tables=["city_stats"], llm=llm
)
query_str = "哪座城市人口最多？"
response = query_engine.query(query_str)

In [ ]:

Copied!

display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))

东京是所有城市中人口最多的，人口达13,960,000。

在以下情况下应使用此查询引擎：您可以预先指定要查询的表，或者所有表模式加上提示其余部分的总大小适合您的上下文窗口。

第二部分：文本转SQL查询时的表格检索¶

如果我们事先不知道要使用哪个表，并且表结构的总大小超出了上下文窗口的限制，我们应该将表结构存储在索引中，以便在查询时能够检索到正确的结构。

我们可以通过使用SQLTableNodeMapping对象来实现这一点，该对象接收一个SQLDatabase并为传入ObjectIndex构造函数的每个SQLTableSchema对象生成一个Node对象。

In [ ]:

Copied!





from llama_index.core.indices.struct_store.sql_query import (
    SQLTableRetrieverQueryEngine,
)
from llama_index.core.objects import (
    SQLTableNodeMapping,
    ObjectIndex,
    SQLTableSchema,
)
from llama_index.core import VectorStoreIndex
from llama_index.core.embeddings.openai import OpenAIEmbedding

# set Logging to DEBUG for more detailed outputs
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats"))
]  # add a SQLTableSchema for each table

obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
)
query_engine = SQLTableRetrieverQueryEngine(
    sql_database, obj_index.as_retriever(similarity_top_k=1)
)
from llama_index.core.indices.struct_store.sql_query import (
    SQLTableRetrieverQueryEngine,
)
from llama_index.core.objects import (
    SQLTableNodeMapping,
    ObjectIndex,
    SQLTableSchema,
)
from llama_index.core import VectorStoreIndex
from llama_index.core.embeddings.openai import OpenAIEmbedding

# 将日志级别设为DEBUG以获取更详细的输出
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats"))
]  # 为每个表添加SQLTableSchema

obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
)
query_engine = SQLTableRetrieverQueryEngine(
    sql_database, obj_index.as_retriever(similarity_top_k=1)
)

现在我们可以使用SQLTableRetrieverQueryEngine来查询获取响应。

In [ ]:

Copied!

response = query_engine.query("Which city has the highest population?")
display(Markdown(f"<b>{response}</b>"))
response = query_engine.query("Which city has the highest population?")
display(Markdown(f"{response}"))

东京是所有城市中人口最多的，人口数为13,960,000。

In [ ]:

Copied!

# you can also fetch the raw result from SQLAlchemy!
response.metadata["result"]
# 你也可以直接从SQLAlchemy获取原始结果!
response.metadata["result"]

输出[ ]:

[('Tokyo', 13960000)]

您还可以为每个定义的表模式添加额外的上下文信息。

In [ ]:

Copied!





# manually set context text
city_stats_text = (
    "This table gives information regarding the population and country of a"
    " given city.\nThe user will query with codewords, where 'foo' corresponds"
    " to population and 'bar'corresponds to city."
)

table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats", context_str=city_stats_text))
]
# 手动设置上下文文本
city_stats_text = (
    "该表提供有关城市人口和国家信息。\n用户将使用代码词进行查询，其中'foo'对应人口，'bar'对应城市。"
)

table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = [
    (SQLTableSchema(table_name="city_stats", context_str=city_stats_text))
]

第三部分：文本到SQL查询时的行与列检索¶

当提出类似"美国有多少个城市？"这样的问题时，会出现一个挑战。在这种情况下，生成的查询可能只会查找国家列为"US"的城市，而可能遗漏标记为"United States"的条目。为了解决这个问题，您可以应用查询时行检索、查询时列检索或两者的组合。

查询时行检索¶

在查询时行检索中，我们会嵌入每个表的行数据，从而为每个表生成一个索引。

In [ ]:

Copied!





from llama_index.core.schema import TextNode

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()

city_nodes = [TextNode(text=str(t)) for t in results]

city_rows_index = VectorStoreIndex(
    city_nodes, embed_model=OpenAIEmbedding(model="text-embedding-3-small")
)
city_rows_retriever = city_rows_index.as_retriever(similarity_top_k=1)

city_rows_retriever.retrieve("US")
from llama_index.core.schema import TextNode

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()

城市节点 = [TextNode(text=str(t)) for t in results]

city_rows_index = VectorStoreIndex(
    city_nodes, embed_model=OpenAIEmbedding(model="text-embedding-3-small")
)
city_rows_retriever = city_rows_index.as_retriever(similarity_top_k=1)

city_rows_retriever.retrieve("US")

输出[ ]:

[NodeWithScore(node=TextNode(id_='8ae10176-afd8-40ee-a97b-b24f66235489', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text="('Chicago', 2679000, 'United States')", mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}'), score=0.7843469586763699)]

然后，可以将每个表的行检索器提供给SQLTableRetrieverQueryEngine。

In [ ]:

Copied!





rows_retrievers = {
    "city_stats": city_rows_retriever,
}
query_engine = SQLTableRetrieverQueryEngine(
    sql_database,
    obj_index.as_retriever(similarity_top_k=1),
    rows_retrievers=rows_retrievers,
)
rows_retrievers = {
    "city_stats": city_rows_retriever,
}
query_engine = SQLTableRetrieverQueryEngine(
    sql_database,
    obj_index.as_retriever(similarity_top_k=1),
    rows_retrievers=rows_retrievers,
)

在查询过程中，行检索器用于识别与输入查询语义最相似的行。这些检索到的行随后被整合为上下文，以提升文本到SQL生成的性能。

In [ ]:

Copied!

response = query_engine.query("How many cities are in the US?")
response = query_engine.query("美国有多少个城市？")

In [ ]:

Copied!

display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))

根据city_stats表中的数据，美国有2个城市。

查询时列检索¶

虽然查询时行检索增强了文本到SQL的生成能力，但它会单独嵌入每一行数据，即使许多行包含重复值（例如分类数据中的值）。这可能导致令牌使用效率低下和不必要的开销。此外，在具有大量列的表中，检索器可能仅显示部分相关值，可能会遗漏其他对准确查询生成重要的值。

为解决这一问题，可以采用查询时列检索方法。该方法对选定列中的每个不同值建立索引，为表中的每一列创建单独的索引。

In [ ]:

Copied!





city_cols_retrievers = {}

for column_name in ["city_name", "country"]:
    stmt = select(city_stats_table.c[column_name]).distinct()
    with engine.connect() as connection:
        values = connection.execute(stmt).fetchall()
    nodes = [TextNode(text=t[0]) for t in values]

    column_index = VectorStoreIndex(
        nodes, embed_model=OpenAIEmbedding(model="text-embedding-3-small")
    )
    column_retriever = column_index.as_retriever(similarity_top_k=1)

    city_cols_retrievers[column_name] = column_retriever
city_cols_retrievers = {}

for column_name in ["city_name", "country"]:
    stmt = select(city_stats_table.c[column_name]).distinct()
    with engine.connect() as connection:
        values = connection.execute(stmt).fetchall()
    nodes = [TextNode(text=t[0]) for t in values]

    column_index = VectorStoreIndex(
        nodes, embed_model=OpenAIEmbedding(model="text-embedding-3-small")
    )
    column_retriever = column_index.as_retriever(similarity_top_k=1)

    city_cols_retrievers[column_name] = column_retriever

然后，可以将每个表的列检索器提供给SQLTableRetrieverQueryEngine。

In [ ]:

Copied!





cols_retrievers = {
    "city_stats": city_cols_retrievers,
}
query_engine = SQLTableRetrieverQueryEngine(
    sql_database,
    obj_index.as_retriever(similarity_top_k=1),
    rows_retrievers=rows_retrievers,
    cols_retrievers=cols_retrievers,
    llm=llm,
)
cols_retrievers = {
    "city_stats": city_cols_retrievers,
}
query_engine = SQLTableRetrieverQueryEngine(
    sql_database,
    obj_index.as_retriever(similarity_top_k=1),
    rows_retrievers=rows_retrievers,
    cols_retrievers=cols_retrievers,
    llm=llm,
)

在查询过程中，列检索器用于识别与输入查询语义最相似的列值。这些检索到的值随后被整合为上下文，以提高文本到SQL生成的性能。

In [ ]:

Copied!

response = query_engine.query("How many cities are in the US?")
response = query_engine.query("美国有多少个城市？")

In [ ]:

Copied!

display(Markdown(f"<b>{response}</b>"))
display(Markdown(f"{response}"))

美国有2个城市。

第四部分：文本到SQL检索器¶

目前我们的文本转SQL功能封装在一个查询引擎中，包含检索和合成两部分。

你可以单独使用SQL检索器。我们会展示一些可以尝试的不同参数，并演示如何将其接入我们的RetrieverQueryEngine来获得大致相同的结果。

In [ ]:

Copied!





from llama_index.core.retrievers import NLSQLRetriever

# default retrieval (return_raw=True)
nl_sql_retriever = NLSQLRetriever(
    sql_database, tables=["city_stats"], llm=llm, return_raw=True
)
from llama_index.core.retrievers import NLSQLRetriever

# 默认检索模式 (return_raw=True)
nl_sql_retriever = NLSQLRetriever(
    sql_database, tables=["city_stats"], llm=llm, return_raw=True
)

In [ ]:

Copied!

results = nl_sql_retriever.retrieve(
    "Return the top 5 cities (along with their populations) with the highest population."
)
results = nl_sql_retriever.retrieve(
    "返回人口数量最高的前5个城市（及其人口数量）。"
)

In [ ]:

Copied!

from llama_index.core.response.notebook_utils import display_source_node

for n in results:
    display_source_node(n)
从llama_index.core.response.notebook_utils导入display_source_node

for n in results:
    display_source_node(n)

节点ID: f640a54f-7413-4dc0-9135-cd63c7ca8f45
相似度: 无
文本内容: [('东京', 13960000), ('首尔', 9776000), ('纽约', 8258000), ('釜山', 3334000), ('多伦多', ...

In [ ]:

Copied!





# default retrieval (return_raw=False)
nl_sql_retriever = NLSQLRetriever(
    sql_database, tables=["city_stats"], return_raw=False
)
# 默认检索模式 (return_raw=False)
nl_sql_retriever = NLSQLRetriever(
    sql_database, tables=["city_stats"], return_raw=False
)

In [ ]:

Copied!

results = nl_sql_retriever.retrieve(
    "Return the top 5 cities (along with their populations) with the highest population."
)
results = nl_sql_retriever.retrieve(
    "返回人口最多的前5个城市（及其人口数量）"
)

In [ ]:

Copied!

# NOTE: all the content is in the metadata
for n in results:
    display_source_node(n, show_source_metadata=True)
# 注意：所有内容都在元数据中
for n in results:
    display_source_node(n, show_source_metadata=True)

节点ID: 05c61a90-598e-4c29-a6b4-b27f2579819e
相似度: 无
文本:
元数据: {'city_name': 'Tokyo', 'population': 13960000}

节点ID: c7f5fc4c-9754-4946-92c6-54a0d2b40fd9
相似度: 无
文本:
元数据: {'city_name': 'Seoul', 'population': 9776000}

节点ID: 3a00e201-f3b5-430e-af0e-aa4c34a71131
相似度: 无
文本:
元数据: {'city_name': 'New York', 'population': 8258000}

节点ID: ee911f7f-8aae-4bad-a52d-c0bdfab63942
相似度: 无
文本:
元数据: {'city_name': 'Busan', 'population': 3334000}

节点ID: dca6b482-52e4-41e0-992f-a58109e6f3f6
相似度: 无
文本:
元数据: {'city_name': 'Toronto', 'population': 2930000}

接入我们的`RetrieverQueryEngine`¶

我们将SQL检索器与标准的RetrieverQueryEngine组合使用，以合成响应。结果与我们封装的Text-to-SQL查询引擎大致相似。

In [ ]:

Copied!

from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(nl_sql_retriever, llm=llm)
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(nl_sql_retriever, llm=llm)

In [ ]:

Copied!

response = query_engine.query(
    "Return the top 5 cities (along with their populations) with the highest population."
)
response = query_engine.query(
    "返回人口数量最高的前5个城市（包含它们的人口数据）。"
)

In [ ]:

Copied!

print(str(response))
print(str(response))

The top 5 cities with the highest populations are:

1. Tokyo - 13,960,000
2. Seoul - 9,776,000
3. New York - 8,258,000
4. Busan - 3,334,000
5. Toronto - 2,930,000

文本转SQL指南（查询引擎 + 检索器）¶

创建数据库架构¶

定义SQL数据库¶

查询索引¶

第一部分：文本转SQL查询引擎¶

第二部分：文本转SQL查询时的表格检索¶

第三部分：文本到SQL查询时的行与列检索¶

查询时行检索¶

查询时列检索¶

第四部分：文本到SQL检索器¶

接入我们的RetrieverQueryEngine¶

接入我们的`RetrieverQueryEngine`¶