Evaluating Agents with Langfuse

在本教程中，我们将学习如何监控OpenAI agent SDK的内部步骤（追踪），并使用Langfuse评估其性能。

本指南涵盖了团队为快速可靠地将智能体投入生产所使用的在线和离线评估指标。要了解更多评估策略，请查看这篇博客文章。

为什么AI智能体评估很重要：

当任务失败或产生次优结果时调试问题
实时监控成本和性能
通过持续反馈提升可靠性和安全性

步骤0：安装所需库

下面我们安装openai-agents库(OpenAI Agents SDK link text)、pydantic-ai[logfire] OpenTelemetry工具包、langfuse以及Hugging Face的datasets库

%pip install openai-agents
%pip install nest_asyncio
%pip install pydantic-ai[logfire]
%pip install langfuse
%pip install datasets

第一步：为您的智能体添加监测工具

在本笔记本中，我们将使用Langfuse来追踪、调试和评估我们的智能体。

注意： 如果您正在使用LlamaIndex或LangGraph，可以在此处here和here找到关于如何集成它们的文档。

import os
import base64

# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # 🇪🇺 EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # 🇺🇸 US region

LANGFUSE_AUTH = base64.b64encode(
    f"{os.environ.get('LANGFUSE_PUBLIC_KEY')}:{os.environ.get('LANGFUSE_SECRET_KEY')}".encode()
).decode()

os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = os.environ.get("LANGFUSE_HOST") + "/api/public/otel"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {LANGFUSE_AUTH}"

# Set your OpenAI API Key
os.environ["OPENAI_API_KEY"] = "sk-proj-..."

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

# Create a TracerProvider for OpenTelemetry
trace_provider = TracerProvider()

# Add a SimpleSpanProcessor with the OTLPSpanExporter to send traces
trace_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter()))

# Set the global default tracer provider
from opentelemetry import trace
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer(__name__)

Pydantic Logfire 为 OpenAi Agent SDK 提供了监控功能。我们利用它向 Langfuse OpenTelemetry 后端发送追踪数据。

import nest_asyncio
nest_asyncio.apply()

import logfire

# Configure logfire instrumentation.
logfire.configure(
    service_name='my_agent_service',

    send_to_logfire=False,
)
# This method automatically patches the OpenAI Agents SDK to send logs via OTLP to Langfuse.
logfire.instrument_openai_agents()

步骤2：测试您的检测工具

这是一个简单的问答智能体。我们运行它以确认监控功能正常工作。如果一切设置正确，您将在可观测性仪表盘中看到日志/追踪记录。

import asyncio
from agents import Agent, Runner

async def main():
    agent = Agent(
        name="Assistant",
        instructions="You are a senior software engineer",
    )

    result = await Runner.run(agent, "Tell me why it is important to evaluate AI agents.")
    print(result.final_output)

loop = asyncio.get_running_loop()
await loop.create_task(main())

12:01:03.401 OpenAI Agents trace: Agent workflow
12:01:03.403 Agent run: 'Assistant'
12:01:03.404 Responses API with 'gpt-4o'
Evaluating AI agents is crucial for several reasons:

1. **Performance Verification**: Ensures that the AI performs its intended tasks accurately and efficiently, meeting the desired objectives and criteria.

2. **Reliability and Consistency**: Assesses whether the AI provides consistent results across different scenarios and over time.

3. **Safety and Risk Management**: Identifies potential risks or harmful behaviors that could lead to undesirable outcomes, ensuring the AI operates safely within defined limits.

4. **Bias and Fairness**: Checks for any biases in the AI’s decision-making process to promote fairness and avoid discrimination against particular groups.

5. **User Trust and Adoption**: Builds confidence and trust in the AI system among users and stakeholders, which is essential for widespread adoption.

6. **Regulatory Compliance**: Ensures that the AI adheres to relevant laws, regulations, and ethical guidelines, which may vary by industry or region.

7. **Continuous Improvement**: Provides feedback that can be used to refine and improve the AI model over time, enhancing its effectiveness and efficiency.

8. **Integration and Compatibility**: Evaluates how well the AI integrates with existing systems and processes, ensuring compatibility and smooth operation.

9. **Resource Optimization**: Assesses the efficiency of the AI in terms of computational resources, which can lead to cost savings and improved performance.

Evaluating AI agents systematically and rigorously supports their development and deployment in a responsible and effective manner.

检查您的Langfuse追踪仪表板以确认已记录所有跨度和日志。

Langfuse中的示例追踪：

Example trace in Langfuse

查看追踪记录

步骤3：观察并评估一个更复杂的智能体

既然您已确认监控工具正常工作，让我们尝试一个更复杂的查询，以便查看如何跟踪高级指标（令牌使用量、延迟、成本等）。

import asyncio
from agents import Agent, Runner, function_tool

# Example function tool.
@function_tool
def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny."

agent = Agent(
    name="Hello world",
    instructions="You are a helpful agent.",
    tools=[get_weather],
)

async def main():
    result = await Runner.run(agent, input="What's the weather in Berlin?")
    print(result.final_output)

loop = asyncio.get_running_loop()
await loop.create_task(main())

13:33:30.839 OpenAI Agents trace: Agent workflow
13:33:30.840   Agent run: 'Hello world'
13:33:30.842     Responses API with 'gpt-4o'
13:33:31.822     Function: get_weather
13:33:31.825     Responses API with 'gpt-4o'
The weather in Berlin is currently sunny.

追踪结构

Langfuse记录一个追踪(trace)，其中包含代表智能体逻辑每个步骤的跨度(spans)。在这里，追踪包含整个智能体运行过程以及以下子跨度：

工具调用 (get_weather)
LLM调用（使用'gpt-4o'的Responses API）

您可以检查这些内容，精确查看时间消耗在哪里、使用了多少令牌等等：

Trace tree in Langfuse

查看追踪记录

在线评估

在线评估指的是在真实世界的实时环境中评估智能体，即在生产环境实际使用期间进行。这涉及持续监控智能体在真实用户交互中的表现并分析结果。

我们在此处here撰写了关于不同评估技术的指南。

生产环境中需要跟踪的常见指标

成本 — 该工具会记录令牌使用情况，您可以通过为每个令牌分配价格将其转换为近似成本。
延迟 — 观察完成每个步骤或整个运行所需的时间。
用户反馈 — 用户可以直接提供反馈（点赞/点踩）来帮助优化或纠正智能体。
LLM-as-a-Judge — 使用另一个LLM近乎实时地评估您的智能体输出（例如检查毒性或正确性）。

下面，我们展示这些指标的示例。

1. 成本

以下截图展示了gpt-4o调用的使用情况。这对查看高消耗步骤并优化您的智能体很有帮助。

Costs

查看追踪记录

2. 延迟

我们还可以查看完成每个步骤所需的时间。在下面的示例中，整个运行耗时7秒，你可以按步骤分解时间。这有助于你识别瓶颈并优化智能体。

Latency

查看追踪记录

4. 用户反馈

如果您的智能体嵌入到用户界面中，您可以记录用户的直接反馈（例如聊天界面中的点赞/点踩）。以下是使用IPython.display实现简单反馈机制的示例。

在下面的代码片段中，当用户发送聊天消息时，我们会捕获OpenTelemetry的追踪ID。如果用户喜欢/不喜欢最后一个回答，我们会为该追踪附加一个评分。

from agents import Agent, Runner, WebSearchTool
from opentelemetry.trace import format_trace_id
import ipywidgets as widgets
from IPython.display import display
from langfuse import Langfuse

langfuse = Langfuse()

# Define your agent with the web search tool
agent = Agent(
    name="WebSearchAgent",
    instructions="You are an agent that can search the web.",
    tools=[WebSearchTool()]
)

formatted_trace_id = None  # We'll store the current trace_id globally for demonstration

def on_feedback(button):
    if button.icon == "thumbs-up":
      langfuse.score(
            value=1,
            name="user-feedback",
            comment="The user gave this response a thumbs up",
            trace_id=formatted_trace_id
        )
    elif button.icon == "thumbs-down":
      langfuse.score(
            value=0,
            name="user-feedback",
            comment="The user gave this response a thumbs down",
            trace_id=formatted_trace_id
        )
    print("Scored the trace in Langfuse")

user_input = input("Enter your question: ")

# Run agent
with trace.get_tracer(__name__).start_as_current_span("OpenAI-Agent-Trace") as span:

    # Run your agent with a query
    result = Runner.run_sync(agent, user_input)
    print(result.final_output)

    current_span = trace.get_current_span()
    span_context = current_span.get_span_context()
    trace_id = span_context.trace_id
    formatted_trace_id = str(format_trace_id(trace_id))
    langfuse.trace(id=formatted_trace_id, input=user_input, output=result.final_output)

# Get feedback
print("How did you like the agent response?")

thumbs_up = widgets.Button(description="👍", icon="thumbs-up")
thumbs_down = widgets.Button(description="👎", icon="thumbs-down")

thumbs_up.on_click(on_feedback)
thumbs_down.on_click(on_feedback)

display(widgets.HBox([thumbs_up, thumbs_down]))

Enter your question: What is Langfuse?
13:54:41.574 OpenAI Agents trace: Agent workflow
13:54:41.575   Agent run: 'WebSearchAgent'
13:54:41.577     Responses API with 'gpt-4o'
Langfuse is an open-source engineering platform designed to enhance the development, monitoring, and optimization of Large Language Model (LLM) applications. It offers a suite of tools that provide observability, prompt management, evaluations, and metrics, facilitating the debugging and improvement of LLM-based solutions. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai))

**Key Features of Langfuse:**

- **LLM Observability:** Langfuse enables developers to monitor and analyze the performance of language models by tracking API calls, user inputs, prompts, and outputs. This observability aids in understanding model behavior and identifying areas for improvement. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai))

- **Prompt Management:** The platform provides tools for managing, versioning, and deploying prompts directly within Langfuse. This feature allows for efficient organization and refinement of prompts to optimize model responses. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai))

- **Evaluations and Metrics:** Langfuse offers capabilities to collect and calculate scores for LLM completions, run model-based evaluations, and gather user feedback. It also tracks key metrics such as cost, latency, and quality, providing insights through dashboards and data exports. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai))

- **Playground Environment:** The platform includes a playground where users can interactively experiment with different models and prompts, facilitating prompt engineering and testing. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai))

- **Integration Capabilities:** Langfuse integrates seamlessly with various tools and frameworks, including LlamaIndex, LangChain, OpenAI SDK, LiteLLM, and more, enhancing its functionality and allowing for the development of complex applications. ([toolerific.ai](https://toolerific.ai/ai-tools/opensource/langfuse-langfuse?utm_source=openai))

- **Open Source and Self-Hosting:** Being open-source, Langfuse allows developers to customize and extend the platform according to their specific needs. It can be self-hosted, providing full control over infrastructure and data. ([vafion.com](https://www.vafion.com/blog/unlocking-power-language-models-langfuse/?utm_source=openai))

Langfuse is particularly valuable for developers and researchers working with LLMs, offering a comprehensive set of tools to improve the performance and reliability of LLM applications. Its flexibility, integration capabilities, and open-source nature make it a robust choice for those seeking to enhance their LLM projects. 
How did you like the agent response?

HBox(children=(Button(description='👍', icon='thumbs-up', style=ButtonStyle()), Button(description='👎', icon='t…

Scored the trace in Langfuse

用户反馈随后被记录在Langfuse中：

User feedback is being captured in Langfuse

5. LLM作为评判者

LLM-as-a-Judge 是另一种自动评估智能体输出的方法。您可以设置一个单独的LLM调用来衡量输出的正确性、毒性、风格或您关心的任何其他标准。

工作流程:

您定义一个评估模板，例如"检查文本是否含有毒性内容"。
您设置一个模型作为评判模型；在本例中使用的是gpt-4o-mini。
每次您的智能体生成输出时，您都可以使用该模板将输出传递给您的"评判"LLM。
评判大语言模型会返回一个评分或标签，您可以记录到您的可观测性工具中。

来自Langfuse的示例:

LLM-as-a-Judge Evaluation Template LLM-as-a-Judge Evaluator

# Example: Checking if the agent’s output is toxic or not.
from agents import Agent, Runner, WebSearchTool

# Define your agent with the web search tool
agent = Agent(
    name="WebSearchAgent",
    instructions="You are an agent that can search the web.",
    tools=[WebSearchTool()]
)

input_query = "Is eating carrots good for the eyes?"

# Run agent
with trace.get_tracer(__name__).start_as_current_span("OpenAI-Agent-Trace") as span:
    # Run your agent with a query
    result = Runner.run_sync(agent, input_query)

    # Add input and output values to parent trace
    span.set_attribute("input.value", input_query)
    span.set_attribute("output.value", result.final_output)

14:05:34.735 OpenAI Agents trace: Agent workflow
14:05:34.736   Agent run: 'WebSearchAgent'
14:05:34.738     Responses API with 'gpt-4o'

你可以看到这个例子的回答被判定为"无毒"。

LLM-as-a-Judge Evaluation Score

6. 可观测性指标概览

所有这些指标都可以在仪表板中一起可视化。这使您能够快速查看您的智能体在多个会话中的表现，并帮助您跟踪随时间变化的质量指标。

Observability metrics overview

离线评估

在线评估对于实时反馈至关重要，但您同样需要离线评估——在开发前或开发期间进行的系统性检查。这有助于在将变更推送到生产环境前保持质量和可靠性。

数据集评估

在离线评估中，您通常需要：

拥有一个基准数据集（包含提示和预期输出对）
在该数据集上运行您的智能体
将输出结果与预期结果进行比较，或使用额外的评分机制

下面，我们通过search-dataset数据集来演示这种方法，该数据集包含可以通过网络搜索工具回答的问题及其预期答案。

import pandas as pd
from datasets import load_dataset

# Fetch search-dataset from Hugging Face
dataset = load_dataset("junzhang1207/search-dataset", split = "train")
df = pd.DataFrame(dataset)
print("First few rows of search-dataset:")
print(df.head())

README.md:   0%|          | 0.00/2.12k [00:00<?, ?B/s]

data-samples.json:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

data.jsonl:   0%|          | 0.00/316k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/934 [00:00<?, ? examples/s]

First few rows of GSM8K dataset:
                                     id  \
0  20caf138-0c81-4ef9-be60-fe919e0d68d4   
1  1f37d9fd-1bcc-4f79-b004-bc0e1e944033   
2  76173a7f-d645-4e3e-8e0d-cca139e00ebe   
3  5f5ef4ca-91fe-4610-a8a9-e15b12e3c803   
4  64dbed0d-d91b-4acd-9a9c-0a7aa83115ec   

                                            question  \
0                 steve jobs statue location budapst   
1  Why is the Battle of Stalingrad considered a t...   
2  In what year did 'The Birth of a Nation' surpa...   
3  How many Russian soldiers surrendered to AFU i...   
4   What event led to the creation of Google Images?   

                                     expected_answer       category       area  
0  The Steve Jobs statue is located in Budapest, ...           Arts  Knowledge  
1  The Battle of Stalingrad is considered a turni...   General News       News  
2  This question is based on a false premise. 'Th...  Entertainment       News  
3  About 300 Russian soldiers surrendered to the ...   General News       News  
4  Jennifer Lopez's appearance in a green Versace...     Technology       News

接下来，我们在Langfuse中创建一个数据集实体来跟踪运行记录。然后，我们将数据集中的每个项目添加到系统中。

from langfuse import Langfuse
langfuse = Langfuse()

langfuse_dataset_name = "search-dataset_huggingface_openai-agent"

# Create a dataset in Langfuse
langfuse.create_dataset(
    name=langfuse_dataset_name,
    description="search-dataset uploaded from Huggingface",
    metadata={
        "date": "2025-03-14",
        "type": "benchmark"
    }
)

Dataset(id='cm88w66t102qpad07xhgeyaej', name='search-dataset_huggingface_openai-agent', description='search-dataset uploaded from Huggingface', metadata={'date': '2025-03-14', 'type': 'benchmark'}, project_id='cloramnkj0002jz088vzn1ja4', created_at=datetime.datetime(2025, 3, 14, 14, 47, 14, 676000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 3, 14, 14, 47, 14, 676000, tzinfo=datetime.timezone.utc))

for idx, row in df.iterrows():
    langfuse.create_dataset_item(
        dataset_name=langfuse_dataset_name,
        input={"text": row["question"]},
        expected_output={"text": row["expected_answer"]}
    )
    if idx >= 49: # For this example, we upload only the first 50 items
        break

Dataset items in Langfuse

在数据集上运行智能体

我们定义了一个辅助函数 run_openai_agent()，它：

启动一个OpenTelemetry跨度
在提示上运行我们的智能体
在Langfuse中记录追踪ID

然后，我们遍历每个数据集项，运行智能体，并将追踪链接到数据集项。如果需要，我们还可以附加一个快速评估分数。

from agents import Agent, Runner, WebSearchTool
from opentelemetry.trace import format_trace_id

# Define your agent with the web search tool
agent = Agent(
    name="WebSearchAgent",
    instructions="You are an agent that can search the web.",
    tools=[WebSearchTool(search_context_size= "high")]
)

def run_openai_agent(question):
    with tracer.start_as_current_span("OpenAI-Agent-Trace") as span:
        span.set_attribute("langfuse.tag", "dataset-run")

        # Run your agent with a query
        result = Runner.run_sync(agent, question)

        # Get the Langfuse trace_id to link the dataset run item to the agent trace
        current_span = trace.get_current_span()
        span_context = current_span.get_span_context()
        trace_id = span_context.trace_id
        formatted_trace_id = format_trace_id(trace_id)

        langfuse_trace = langfuse.trace(
            id=formatted_trace_id,
            input=question,
            output=result.final_output
        )
    return langfuse_trace, result.final_output

dataset = langfuse.get_dataset(langfuse_dataset_name)

# Run our agent against each dataset item
for item in dataset.items:
    langfuse_trace, output = run_openai_agent(item.input["text"])

    # Link the trace to the dataset item for analysis
    item.link(
        langfuse_trace,
        run_name="openai-agent-run-03",
        run_metadata={ "search_context_size": "high"}
    )

    # Optionally, store a quick evaluation score for demonstration
    langfuse_trace.score(
        name="<example_eval>",
        value=1,
        comment="This is a comment"
    )

# Flush data to ensure all telemetry is sent
langfuse.flush()

你可以用不同的方式重复这个过程：

搜索工具（例如针对OpenAI的WebSearchTool的不同上下文大小）
模型 (gpt-4o-mini, o1等)
工具（带搜索 vs. 不带搜索）

然后在Langfuse中进行并排比较。在这个例子中，我对50个数据集问题运行了智能体3次。每次运行时，我为OpenAI的WebSearchTool使用了不同的上下文大小设置。你可以看到增加上下文大小也略微提高了答案正确率，从0.89上升到0.92。correct_answer分数是由LLM-as-a-Judge Evaluator创建的，该评估器被设置为根据数据集中提供的示例答案来判断问题的正确性。

Dataset run overview Dataset run comparison

2025年3月31日

使用Langfuse评估智能体

步骤0：安装所需库

第一步：为您的智能体添加监测工具

步骤2：测试您的检测工具

步骤3：观察并评估一个更复杂的智能体

追踪结构

在线评估

生产环境中需要跟踪的常见指标

1. 成本

2. 延迟

3. 其他属性

4. 用户反馈

5. LLM作为评判者

6. 可观测性指标概览

离线评估

数据集评估

在数据集上运行智能体