使用Vanna与本地上下文数据库¶

通常来说，使用我们托管的服务开始使用Vanna会容易得多。然而，如果你想在本地运行Vanna，你可以通过使用ChromaDB运行一个本地上下文数据库来实现。本笔记本将引导你完成这些步骤。

In [ ]:

%pip install 'vanna[chromadb,snowflake,openai]'

In [1]:

from vanna.local import LocalContext_OpenAI

配置OpenAI API密钥¶

如果您使用的是Azure OpenAI，除了API密钥外，还有一些额外的参数需要指定

In [2]:

# Use the regular OpenAI API
vn = LocalContext_OpenAI({"api_key": "sk-..."})

# Use the Azure OpenAI API
vn = LocalContext_OpenAI(config={
    "api_type": "azure",
    "api_base": "https://...",
    "api_version": "2023-05-15",
    "engine": "YOUR_ENGINE_HERE",
    "api_key": "sk-..."
})

使用示例SQL查询进行训练¶

入门的最简单方法是直接粘贴一个SQL查询，并让Vanna对其进行训练。这将创建一个新的上下文数据库并在其上训练模型。然后，您可以使用该模型生成SQL查询。

In [3]:

vn.train(sql="""
SELECT c.c_name as customer_name,
       sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM   snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
        ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
        ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 5;
""")

Using model gpt-3.5-turbo for 147.5 tokens (approx)
Question generated with sql: What are the top 5 customers in terms of total sales? 
Adding SQL...

连接到数据库¶

这是将用于运行生成的SQL查询的数据库。这里我们连接到Snowflake。

In [ ]:

vn.connect_to_snowflake(account='my-account', username='my-username', password='my-password', database='my-database')

开始提问¶

这里有一个自动反馈循环，所以当你提出一个问题并生成可以执行的SQL时，它会自动添加到上下文数据库中，并用于训练模型。

In [5]:

vn.ask("What are the top 10 customers by sales?")

Number of requested results 10 is greater than number of elements in index 1, updating n_results = 1

Using model gpt-3.5-turbo for 167.0 tokens (approx)
SELECT c.c_name as customer_name,
       sum(l.l_extendedprice * (1 - l.l_discount)) as total_sales
FROM   snowflake_sample_data.tpch_sf1.lineitem l join snowflake_sample_data.tpch_sf1.orders o
        ON l.l_orderkey = o.o_orderkey join snowflake_sample_data.tpch_sf1.customer c
        ON o.o_custkey = c.c_custkey
GROUP BY customer_name
ORDER BY total_sales desc limit 10;

	客户名称	总销售额
0	客户#000143500	6757566.0218
1	客户#000095257	6294115.3340
2	客户#000087115	6184649.5176
3	客户#000131113	6080943.8305
4	客户#000134380	6075141.9635
5	客户#000103834	6059770.3232
6	客户#000069682	6057779.0348
7	Customer#000102022	6039653.6335
8	客户#000098587	6027021.5855
9	Customer#000064660	5905659.6159

Using model gpt-3.5-turbo for 259.25 tokens (approx)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

No description has been provided for this image

培训计划¶

训练计划是从数据库中提取元数据以用于训练的好方法。您可以指定要过滤的数据库和模式，它将尝试从查询历史记录中检索过去的查询，并从信息模式中检索表和列名。

In [6]:

training_plan = vn.get_training_plan_snowflake(filter_databases=['SNOWFLAKE_SAMPLE_DATA'], filter_schemas=['TPCH_SF1'])
training_plan

Trying query history
Using model gpt-3.5-turbo for 147.25 tokens (approx)
Using model gpt-3.5-turbo for 168.25 tokens (approx)
Using model gpt-3.5-turbo for 147.0 tokens (approx)
Using model gpt-3.5-turbo for 147.25 tokens (approx)
Trying INFORMATION_SCHEMA.DATABASES
Trying INFORMATION_SCHEMA.COLUMNS for SNOWFLAKE_SAMPLE_DATA

Out[6]:

Train on SQL:  What are the top 10 customers based on their total sales?
Train on SQL:  What are the top 10 countries with the highest total sales?
Train on SQL:  What are the top 5 customers based on their total sales?
Train on SQL:  What are the top 15 customers based on their total sales?
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 CUSTOMER
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 SUPPLIER
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 LINEITEM
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PARTSUPP
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 PART
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 ORDERS
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 REGION
Train on Information Schema: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 NATION

In [7]:

vn.train(plan=training_plan)

提出更多问题！¶

现在你已经有了一个完全训练好的模型，你可以继续提问，它会继续学习和改进。

In [ ]:

vn.ask()