SQLite 示例#

本示例展示了如何将LLM连接到数据库上回答问题,您可以在LangChain示例中找到原始笔记本。此示例将向您展示如何通过GPTCache设置LLM,从而实现对LLM数据的缓存。您也可以在Google Colab上尝试此示例。

进入GPTCache#

请先安装gptcache,然后我们可以初始化缓存。有两种初始化缓存的方式,第一种是使用map缓存(精确匹配缓存),第二种是使用数据库缓存(相似搜索缓存),更推荐使用第二种方式,但您需要安装相关依赖。

在运行示例之前,请确保通过执行echo $OPENAI_API_KEY来设置OPENAI_API_KEY环境变量。如果尚未设置,可以在Unix/Linux/MacOS系统上使用export OPENAI_API_KEY=YOUR_API_KEY或在Windows系统上使用set OPENAI_API_KEY=YOUR_API_KEY进行设置。此外,缓存设置中有一个get_content_func函数:

# get the content(only question) form the prompt to cache
def get_content_func(data, **_):
    return data.get("prompt").split("Question")[-1]

1. 初始化精确匹配缓存#

# from gptcache import cache
# cache.init(pre_embedding_func=get_content_func)
# cache.set_openai_key()

2. 初始化相似匹配缓存#

初始化gptcache时,需要配置以下四个参数:

  • pre_embedding_func: 在提取特征向量之前进行预处理,它将使用get_content_func方法

  • embedding_func: 提取文本特征向量的方法

  • data_manager: 用于缓存管理的DataManager

  • similarity_evaluation: 缓存命中后的评估方法

data_manager 用于处理音频特征向量和示例中的响应文本,它使用 Milvus(请确保已启动),您也可以配置其他向量存储,参考 VectorBase API

from gptcache import cache
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation


onnx = Onnx()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('milvus', host='127.0.0.1', port='19530', dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base)
cache.init(
    pre_embedding_func=get_content_func,
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )
cache.set_openai_key()

初始化缓存后,您可以使用LangChain LLMs与gptcache.adapter.langchain_models。此时gptcache将缓存答案,与原始示例的唯一区别是将llm = OpenAI(temperature=0)改为llm = LangChainLLMs(llm=OpenAI(temperature=0)),这将在代码块中进行注释。您还可以设置session来配置会话设置,通过llm = LangChainLLMs(llm=OpenAI(temperature=0), session=session),更多详情请参考session example

然后你会发现,在搜索相似内容时速度会更快,让我们来试试看。

快速开始#

这里使用的是示例Chinook数据库。 要设置它,请按照https://database.guide/2-sample-databases-sqlite/上的说明操作,将.db文件放在此仓库根目录下的notebooks文件夹中。

from langchain import OpenAI, SQLDatabase, SQLDatabaseChain

from gptcache.adapter.langchain_models import LangChainLLMs
from gptcache.session import Session
session = Session(name="sqlite-example") # set session for LangChainLLMs
db = SQLDatabase.from_uri("sqlite:///./Chinook.db")
# llm = OpenAI(temperature=0) # using the following code to cache with gptcache
llm = LangChainLLMs(llm=OpenAI(temperature=0), session=session)

注意:对于数据敏感的项目,您可以在SQLDatabaseChain初始化时指定return_direct=True,以直接返回SQL查询的输出而不进行任何额外格式化。这样可以防止LLM看到数据库中的任何内容。但请注意,默认情况下LLM仍然可以访问数据库模式(即方言、表和键名)。

db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)
db_chain.run("How many employees are there?")
> Entering new SQLDatabaseChain chain...
How many employees are there? 
SQLQuery:
/Users/chenshiyu/miniconda3/envs/shiyu/lib/python3.9/site-packages/langchain/sql_database.py:191: SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage.
  sample_rows = connection.execute(command)
 SELECT COUNT(*) FROM Employee;
SQLResult: [(8,)]
Answer: There are 8 employees.
> Finished chain.
' There are 8 employees.'

自定义提示#

您也可以自定义使用的提示语。以下是一个示例,提示它理解foobar与Employee表是相同的

from langchain.prompts.prompt import PromptTemplate

_DEFAULT_TEMPLATE = """Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer.
Use the following format:

Question: "Question here"
SQLQuery: "SQL Query to run"
SQLResult: "Result of the SQLQuery"
Answer: "Final answer here"

Only use the following tables:

{table_info}

If someone asks for the table foobar, they really mean the employee table.

Question: {input}"""
PROMPT = PromptTemplate(
    input_variables=["input", "table_info", "dialect"], template=_DEFAULT_TEMPLATE
)
db_chain = SQLDatabaseChain(llm=llm, database=db, prompt=PROMPT, verbose=True)
db_chain.run("How many employees are there in the foobar table?")
> Entering new SQLDatabaseChain chain...
How many employees are there in the foobar table? 
SQLQuery: SELECT COUNT(*) FROM Employee;
SQLResult: [(8,)]
Answer: There are 8 employees in the foobar table.
> Finished chain.
' There are 8 employees in the foobar table.'

返回中间步骤#

你也可以返回SQLDatabaseChain的中间步骤。这使你可以访问生成的SQL语句,以及该语句在SQL数据库上运行的结果。

db_chain = SQLDatabaseChain(llm=llm, database=db, prompt=PROMPT, verbose=True, return_intermediate_steps=True)
result = db_chain("How many employees are there in the foobar table?")
result["intermediate_steps"]
> Entering new SQLDatabaseChain chain...
How many employees are there in the foobar table? 
SQLQuery: SELECT COUNT(*) FROM Employee;
SQLResult: [(8,)]
Answer: There are 8 employees in the foobar table.
> Finished chain.
[' SELECT COUNT(*) FROM Employee;', '[(8,)]']

选择如何限制返回的行数#

如果您正在查询表中的多行数据,可以通过使用'top_k'参数(默认值为10)来选择您希望获取的最大结果数量。这有助于避免查询结果超出提示最大长度或不必要地消耗令牌。

db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True, top_k=3)
db_chain.run("What are some example tracks by composer Johann Sebastian Bach?")
> Entering new SQLDatabaseChain chain...
What are some example tracks by composer Johann Sebastian Bach? 
SQLQuery: SELECT Name, Composer FROM Track WHERE Composer LIKE '%Johann Sebastian Bach%' LIMIT 3;
SQLResult: [('Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Johann Sebastian Bach'), ('Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria', 'Johann Sebastian Bach'), ('Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude', 'Johann Sebastian Bach')]
Answer: Some example tracks by composer Johann Sebastian Bach are 'Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria', and 'Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude'.
> Finished chain.
' Some example tracks by composer Johann Sebastian Bach are \'Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace\', \'Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria\', and \'Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude\'.'

从每个表中添加示例行#

有时,数据的格式并不直观,最佳做法是在提示中包含表中部分行的示例,以便让LLM在提供最终查询之前理解数据。这里我们将使用此功能,通过提供Track表中的两行数据,让LLM知道艺术家是以全名保存的。

db = SQLDatabase.from_uri(
    "sqlite:///./Chinook.db",
    include_tables=['Track'], # we include only one table to save tokens in the prompt :)
    sample_rows_in_table_info=2)

样本行会在每个对应表的列信息之后添加到提示中:

print(db.table_info)
CREATE TABLE "Track" (
	"TrackId" INTEGER NOT NULL, 
	"Name" NVARCHAR(200) NOT NULL, 
	"AlbumId" INTEGER, 
	"MediaTypeId" INTEGER NOT NULL, 
	"GenreId" INTEGER, 
	"Composer" NVARCHAR(220), 
	"Milliseconds" INTEGER NOT NULL, 
	"Bytes" INTEGER, 
	"UnitPrice" NUMERIC(10, 2) NOT NULL, 
	PRIMARY KEY ("TrackId"), 
	FOREIGN KEY("MediaTypeId") REFERENCES "MediaType" ("MediaTypeId"), 
	FOREIGN KEY("GenreId") REFERENCES "Genre" ("GenreId"), 
	FOREIGN KEY("AlbumId") REFERENCES "Album" ("AlbumId")
)

/*
2 rows from Track table:
TrackId	Name	AlbumId	MediaTypeId	GenreId	Composer	Milliseconds	Bytes	UnitPrice
1	For Those About To Rock (We Salute You)	1	1	1	Angus Young, Malcolm Young, Brian Johnson	343719	11170334	0.99
2	Balls to the Wall	2	2	1	None	342562	5510424	0.99
*/
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)
db_chain.run("What are some example tracks by Bach?")
> Entering new SQLDatabaseChain chain...
What are some example tracks by Bach? 
SQLQuery: SELECT Name, Composer FROM Track WHERE Composer LIKE '%Johann Sebastian Bach%' LIMIT 3;
SQLResult: [('Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Johann Sebastian Bach'), ('Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria', 'Johann Sebastian Bach'), ('Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude', 'Johann Sebastian Bach')]
Answer: Examples of tracks by Johann Sebastian Bach include Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace, Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria, and Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude.
> Finished chain.
' Examples of tracks by Johann Sebastian Bach include Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace, Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria, and Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude.'

自定义表信息#

在某些情况下,提供自定义表信息而非使用自动生成的表定义和前sample_rows_in_table_info行样本数据会很有帮助。例如,如果您知道表的前几行信息量不足,手动提供更具多样性或信息更丰富的示例行可能对模型更有益。如果存在不必要的列,也可以限制模型可见的列。

这些信息可以以字典形式提供,其中表名作为键,表信息作为值。例如,我们为Track表提供一个仅包含少数列的自定义定义和示例行:

custom_table_info = {
    "Track": """CREATE TABLE Track (
	"TrackId" INTEGER NOT NULL, 
	"Name" NVARCHAR(200) NOT NULL,
	"Composer" NVARCHAR(220),
	PRIMARY KEY ("TrackId")
)
/*
3 rows from Track table:
TrackId	Name	Composer
1	For Those About To Rock (We Salute You)	Angus Young, Malcolm Young, Brian Johnson
2	Balls to the Wall	None
3	My favorite song ever	The coolest composer of all time
*/"""
}
db = SQLDatabase.from_uri(
    "sqlite:///./Chinook.db",
    include_tables=['Track', 'Playlist'],
    sample_rows_in_table_info=2,
    custom_table_info=custom_table_info)

print(db.table_info)
CREATE TABLE "Playlist" (
	"PlaylistId" INTEGER NOT NULL, 
	"Name" NVARCHAR(120), 
	PRIMARY KEY ("PlaylistId")
)

/*
2 rows from Playlist table:
PlaylistId	Name
1	Music
2	Movies
*/

CREATE TABLE Track (
	"TrackId" INTEGER NOT NULL, 
	"Name" NVARCHAR(200) NOT NULL,
	"Composer" NVARCHAR(220),
	PRIMARY KEY ("TrackId")
)
/*
3 rows from Track table:
TrackId	Name	Composer
1	For Those About To Rock (We Salute You)	Angus Young, Malcolm Young, Brian Johnson
2	Balls to the Wall	None
3	My favorite song ever	The coolest composer of all time
*/

请注意我们如何通过自定义的Track表定义和示例行覆盖了sample_rows_in_table_info参数。未被custom_table_info覆盖的表(本例中的Playlist)将像往常一样自动收集其表信息。

db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)
db_chain.run("What are some example tracks by Bach?")
> Entering new SQLDatabaseChain chain...
What are some example tracks by Bach? 
SQLQuery: SELECT Name, Composer FROM Track WHERE Composer LIKE '%Johann Sebastian Bach%' LIMIT 3;
SQLResult: [('Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace', 'Johann Sebastian Bach'), ('Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria', 'Johann Sebastian Bach'), ('Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude', 'Johann Sebastian Bach')]
Answer: Some example tracks by Bach are Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace, Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria, and Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude.
> Finished chain.
' Some example tracks by Bach are Concerto for 2 Violins in D Minor, BWV 1043: I. Vivace, Aria Mit 30 Veränderungen, BWV 988 "Goldberg Variations": Aria, and Suite for Solo Cello No. 1 in G Major, BWV 1007: I. Prélude.'