基于Pandas数据框的查询管道¶

这是一个简单的示例，构建了一个查询管道，能够对Pandas DataFrame执行结构化操作以满足用户查询，利用LLM来推断操作集。

这可以视为我们PandasQueryEngine的"从零开始"版本。

警告：此工具为LLM提供了访问eval函数的能力。在运行此工具的机器上可能执行任意代码。不建议在生产环境中使用此工具，若必须使用则需要严格沙箱隔离或虚拟机保护。

In [ ]:

Copied!

%pip install llama-index-llms-openai llama-index-experimental
%pip install llama-index-llms-openai llama-index-experimental

In [ ]:

Copied!





from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
    PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate
from llama_index.core.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.experimental.query_engine.pandas import (
    PandasInstructionParser,
)
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate

下载数据¶

这里我们加载泰坦尼克号CSV数据集。

In [ ]:

Copied!

!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'

--2024-01-13 18:39:07--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/docs/examples/data/csv/titanic_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57726 (56K) [text/plain]
Saving to: ‘titanic_train.csv’

titanic_train.csv   100%[===================>]  56.37K  --.-KB/s    in 0.007s  

2024-01-13 18:39:07 (7.93 MB/s) - ‘titanic_train.csv’ saved [57726/57726]

In [ ]:

Copied!

import pandas as pd

df = pd.read_csv("./titanic_train.csv")
import pandas as pd

df = pd.read_csv("./titanic_train.csv")

定义模块¶

这里我们定义模块集合：

Pandas提示功能，用于从用户查询中推断pandas指令
Pandas输出解析器，用于在数据框上执行pandas指令并返回数据框
响应合成提示，用于根据数据框合成最终响应
LLM

pandas输出解析器专门设计用于安全执行Python代码。它包含了许多安全检查功能，这些功能如果从头开始编写可能会很繁琐。这包括仅允许从一组已批准的模块导入（例如不允许使用可能修改文件系统的模块如os），同时确保不会调用任何私有/双下划线方法。

In [ ]:

Copied!





instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")
instruction_str = (
    "1. 使用Pandas将查询转换为可执行的Python代码\n"
    "2. 代码的最后一行应该是一个可以用`eval()`函数调用的Python表达式\n"
    "3. 代码应该代表查询的解决方案\n"
    "4. 仅打印表达式\n"
    "5. 不要对表达式加引号\n"
)

pandas_prompt_str = (
    "您正在使用Python中的pandas数据框\n"
    "数据框的名称为`df`\n"
    "这是`print(df.head())`的结果:\n"
    "{df_str}\n\n"
    "请遵循以下指令:\n"
    "{instruction_str}\n"
    "查询: {query_str}\n\n"
    "表达式:"
)
response_synthesis_prompt_str = (
    "根据输入问题，从查询结果中综合出响应\n"
    "查询: {query_str}\n\n"
    "Pandas指令(可选):\n{pandas_instructions}\n\n"
    "Pandas输出: {pandas_output}\n\n"
    "响应: "
)

pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")

构建查询管道¶

看起来是这样的：输入 query_str -> pandas_prompt -> llm1 -> pandas_output_parser -> response_synthesis_prompt -> llm2

与response_synthesis_prompt的额外连接：llm1 -> pandas_instructions，以及pandas_output_parser -> pandas_output。

In [ ]:

Copied!





qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)
# add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")
qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)
# 添加从响应合成提示到llm2的链接
qp.add_link("response_synthesis_prompt", "llm2")

运行查询¶

In [ ]:

Copied!

response = qp.run(
    query_str="What is the correlation between survival and age?",
)
response = qp.run(
    query_str="生存与年龄之间的相关性是什么？",
)

> Running module input with input: 
query_str: What is the correlation between survival and age?

> Running module pandas_prompt with input: 
query_str: What is the correlation between survival and age?

> Running module llm1 with input: 
messages: You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
   survived  pclass                                               name  ...

> Running module pandas_output_parser with input: 
input: assistant: df['survived'].corr(df['age'])

> Running module response_synthesis_prompt with input: 
query_str: What is the correlation between survival and age?
pandas_instructions: assistant: df['survived'].corr(df['age'])
pandas_output: -0.07722109457217755

> Running module llm2 with input: 
messages: Given an input question, synthesize a response from the query results.
Query: What is the correlation between survival and age?

Pandas Instructions (optional):
df['survived'].corr(df['age'])

Pandas ...

In [ ]:

Copied!

print(response.message.content)
print(response.message.content)

The correlation between survival and age is -0.0772. This indicates a weak negative correlation, suggesting that as age increases, the likelihood of survival slightly decreases.