使用Vanna与本地上下文数据库¶
通常来说,使用我们托管的服务开始使用Vanna会容易得多。然而,如果你想在本地运行Vanna,你可以通过使用ChromaDB运行一个本地上下文数据库来实现。本笔记本将引导你完成这些步骤。
In [ ]:
%pip install 'vanna[chromadb,snowflake,openai]'
In [1]:
from vanna.local import LocalContext_OpenAI
配置OpenAI API密钥¶
如果您使用的是Azure OpenAI,除了API密钥外,还有一些额外的参数需要指定
In [2]:
# Use the regular OpenAI API
vn = LocalContext_OpenAI({"api_key": "sk-..."})
# Use the Azure OpenAI API
vn = LocalContext_OpenAI(config={
"api_type": "azure",
"api_base": "https://...",
"api_version": "2023-05-15",
"engine": "YOUR_ENGINE_HERE",
"api_key": "sk-..."
})
使用示例SQL查询进行训练¶
入门的最简单方法是直接粘贴一个SQL查询,并让Vanna对其进行训练。这将创建一个新的上下文数据库并在其上训练模型。然后,您可以使用该模型生成SQL查询。
In [3]:
vn.train(sql="""
SELECT c.c_name as customer_name,
sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 5;
""")
Using model gpt-3.5-turbo for 147.5 tokens (approx) Question generated with sql: What are the top 5 customers in terms of total sales? Adding SQL...
连接到数据库¶
这是将用于运行生成的SQL查询的数据库。这里我们连接到Snowflake。
In [ ]:
vn.connect_to_snowflake(account='my-account', username='my-username', password='my-password', database='my-database')
开始提问¶
这里有一个自动反馈循环,所以当你提出一个问题并生成可以执行的SQL时,它会自动添加到上下文数据库中,并用于训练模型。
In [5]:
vn.ask("What are the top 10 customers by sales?")
Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1
Using model gpt-3.5-turbo for 167.0 tokens (approx)
SELECT c.c_name as customer_name,
sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 10;
| 客户名称 | 总销售额 | |
|---|---|---|
| 0 | 客户#000143500 | 6757566.0218 |
| 1 | 客户#000095257 | 6294115.3340 |
| 2 | 客户#000087115 | 6184649.5176 |
| 3 | 客户#000131113 | 6080943.8305 |
| 4 | 客户#000134380 | 6075141.9635 |
| 5 | 客户#000103834 | 6059770.3232 |
| 6 | 客户#000069682 | 6057779.0348 |
| 7 | Customer#000102022 | 6039653.6335 |
| 8 | 客户#000098587 | 6027021.5855 |
| 9 | Customer#000064660 | 5905659.6159 |
Using model gpt-3.5-turbo for 259.25 tokens (approx) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
培训计划¶
训练计划是从数据库中提取元数据以用于训练的好方法。您可以指定要过滤的数据库和模式,它将尝试从查询历史记录中检索过去的查询,并从信息模式中检索表和列名。
In [6]:
training_plan = vn.get_training_plan_snowflake(filter_databases=['SNOWFLAKE_SAMPLE_DATA'], filter_schemas=['TPCH_SF1'])
training_plan
Trying query history Using model gpt-3.5-turbo for 147.25 tokens (approx) Using model gpt-3.5-turbo for 168.25 tokens (approx) Using model gpt-3.5-turbo for 147.0 tokens (approx) Using model gpt-3.5-turbo for 147.25 tokens (approx) Trying INFORMATION_SCHEMA.DATABASES Trying INFORMATION_SCHEMA.COLUMNS for SNOWFLAKE_SAMPLE_DATA
Out[6]:
Train on SQL: What are the top 10 customers based on their total sales? Train on SQL: What are the top 10 countries with the highest total sales? Train on SQL: What are the top 5 customers based on their total sales? Train on SQL: What are the top 15 customers based on their total sales? Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 CUSTOMER Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 SUPPLIER Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 LINEITEM Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PARTSUPP Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PART Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 ORDERS Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 REGION Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 NATION
In [7]:
vn.train(plan=training_plan)
提出更多问题!¶
现在你已经有了一个完全训练好的模型,你可以继续提问,它会继续学习和改进。
In [ ]:
vn.ask()