在本教程中,我们将学习如何监控OpenAI agent SDK的内部步骤(追踪),并使用Langfuse评估其性能。
本指南涵盖了团队为快速可靠地将智能体投入生产所使用的在线和离线评估指标。要了解更多评估策略,请查看这篇博客文章。
为什么AI智能体评估很重要:
- 当任务失败或产生次优结果时调试问题
- 实时监控成本和性能
- 通过持续反馈提升可靠性和安全性
在本教程中,我们将学习如何监控OpenAI agent SDK的内部步骤(追踪),并使用Langfuse评估其性能。
本指南涵盖了团队为快速可靠地将智能体投入生产所使用的在线和离线评估指标。要了解更多评估策略,请查看这篇博客文章。
为什么AI智能体评估很重要:
下面我们安装openai-agents库(OpenAI Agents SDK link text)、pydantic-ai[logfire] OpenTelemetry工具包、langfuse以及Hugging Face的datasets库
%pip install openai-agents
%pip install nest_asyncio
%pip install pydantic-ai[logfire]
%pip install langfuse
%pip install datasets在本笔记本中,我们将使用Langfuse来追踪、调试和评估我们的智能体。
注意: 如果您正在使用LlamaIndex或LangGraph,可以在此处here和here找到关于如何集成它们的文档。
import os
import base64
# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # 🇪🇺 EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # 🇺🇸 US region
LANGFUSE_AUTH = base64.b64encode(
f"{os.environ.get('LANGFUSE_PUBLIC_KEY')}:{os.environ.get('LANGFUSE_SECRET_KEY')}".encode()
).decode()
os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] = os.environ.get("LANGFUSE_HOST") + "/api/public/otel"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"Authorization=Basic {LANGFUSE_AUTH}"
# Set your OpenAI API Key
os.environ["OPENAI_API_KEY"] = "sk-proj-..."from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
# Create a TracerProvider for OpenTelemetry
trace_provider = TracerProvider()
# Add a SimpleSpanProcessor with the OTLPSpanExporter to send traces
trace_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter()))
# Set the global default tracer provider
from opentelemetry import trace
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer(__name__)Pydantic Logfire 为 OpenAi Agent SDK 提供了监控功能。我们利用它向 Langfuse OpenTelemetry 后端发送追踪数据。
import nest_asyncio
nest_asyncio.apply()import logfire
# Configure logfire instrumentation.
logfire.configure(
service_name='my_agent_service',
send_to_logfire=False,
)
# This method automatically patches the OpenAI Agents SDK to send logs via OTLP to Langfuse.
logfire.instrument_openai_agents()这是一个简单的问答智能体。我们运行它以确认监控功能正常工作。如果一切设置正确,您将在可观测性仪表盘中看到日志/追踪记录。
import asyncio
from agents import Agent, Runner
async def main():
agent = Agent(
name="Assistant",
instructions="You are a senior software engineer",
)
result = await Runner.run(agent, "Tell me why it is important to evaluate AI agents.")
print(result.final_output)
loop = asyncio.get_running_loop()
await loop.create_task(main())12:01:03.401 OpenAI Agents trace: Agent workflow 12:01:03.403 Agent run: 'Assistant' 12:01:03.404 Responses API with 'gpt-4o' Evaluating AI agents is crucial for several reasons: 1. **Performance Verification**: Ensures that the AI performs its intended tasks accurately and efficiently, meeting the desired objectives and criteria. 2. **Reliability and Consistency**: Assesses whether the AI provides consistent results across different scenarios and over time. 3. **Safety and Risk Management**: Identifies potential risks or harmful behaviors that could lead to undesirable outcomes, ensuring the AI operates safely within defined limits. 4. **Bias and Fairness**: Checks for any biases in the AI’s decision-making process to promote fairness and avoid discrimination against particular groups. 5. **User Trust and Adoption**: Builds confidence and trust in the AI system among users and stakeholders, which is essential for widespread adoption. 6. **Regulatory Compliance**: Ensures that the AI adheres to relevant laws, regulations, and ethical guidelines, which may vary by industry or region. 7. **Continuous Improvement**: Provides feedback that can be used to refine and improve the AI model over time, enhancing its effectiveness and efficiency. 8. **Integration and Compatibility**: Evaluates how well the AI integrates with existing systems and processes, ensuring compatibility and smooth operation. 9. **Resource Optimization**: Assesses the efficiency of the AI in terms of computational resources, which can lead to cost savings and improved performance. Evaluating AI agents systematically and rigorously supports their development and deployment in a responsible and effective manner.
检查您的Langfuse追踪仪表板以确认已记录所有跨度和日志。
Langfuse中的示例追踪:

既然您已确认监控工具正常工作,让我们尝试一个更复杂的查询,以便查看如何跟踪高级指标(令牌使用量、延迟、成本等)。
import asyncio
from agents import Agent, Runner, function_tool
# Example function tool.
@function_tool
def get_weather(city: str) -> str:
return f"The weather in {city} is sunny."
agent = Agent(
name="Hello world",
instructions="You are a helpful agent.",
tools=[get_weather],
)
async def main():
result = await Runner.run(agent, input="What's the weather in Berlin?")
print(result.final_output)
loop = asyncio.get_running_loop()
await loop.create_task(main())13:33:30.839 OpenAI Agents trace: Agent workflow 13:33:30.840 Agent run: 'Hello world' 13:33:30.842 Responses API with 'gpt-4o' 13:33:31.822 Function: get_weather 13:33:31.825 Responses API with 'gpt-4o' The weather in Berlin is currently sunny.
Langfuse记录一个追踪(trace),其中包含代表智能体逻辑每个步骤的跨度(spans)。在这里,追踪包含整个智能体运行过程以及以下子跨度:
您可以检查这些内容,精确查看时间消耗在哪里、使用了多少令牌等等:

在线评估指的是在真实世界的实时环境中评估智能体,即在生产环境实际使用期间进行。这涉及持续监控智能体在真实用户交互中的表现并分析结果。
我们在此处here撰写了关于不同评估技术的指南。
下面,我们展示这些指标的示例。
以下截图展示了gpt-4o调用的使用情况。这对查看高消耗步骤并优化您的智能体很有帮助。

我们还可以查看完成每个步骤所需的时间。在下面的示例中,整个运行耗时7秒,你可以按步骤分解时间。这有助于你识别瓶颈并优化智能体。

Opentelemetry 允许您通过设置 set_attribute 为所有跨度附加一组属性。这使您可以设置诸如 Langfuse 会话 ID 等属性,将跟踪分组到 Langfuse 会话中;或设置用户 ID,将跟踪分配给特定用户。您可以在此处找到所有支持属性的列表。
在这个示例中,我们向Langfuse传递了user_id、session_id和trace_tags参数。您也可以使用span属性input.value和output.value来设置追踪层级的输入和输出。
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
trace_provider = TracerProvider()
trace_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter()))
# Sets the global default tracer provider
from opentelemetry import trace
trace.set_tracer_provider(trace_provider)
# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(__name__)input_query = "Why is AI agent evaluation important?"
with tracer.start_as_current_span("OpenAI-Agent-Trace") as span:
span.set_attribute("langfuse.user.id", "user-12345")
span.set_attribute("langfuse.session.id", "my-agent-session")
span.set_attribute("langfuse.tags", ["staging", "demo", "OpenAI Agent SDK"])
async def main(input_query):
agent = Agent(
name = "Assistant",
instructions = "You are a helpful assistant.",
)
result = await Runner.run(agent, input_query)
print(result.final_output)
return result
result = await main(input_query)
# Add input and output values to parent trace
span.set_attribute("input.value", input_query)
span.set_attribute("output.value", result.final_output)13:34:49.654 OpenAI Agents trace: Agent workflow 13:34:49.655 Agent run: 'Assistant' 13:34:49.657 Responses API with 'gpt-4o' AI agent evaluation is crucial for several reasons: 1. **Performance Verification**: It ensures that the AI agent performs its intended tasks effectively and meets specific criteria or benchmarks. 2. **Safety and Reliability**: Evaluation helps identify and mitigate risks, ensuring that the AI operates safely and reliably in real-world situations. 3. **Continuous Improvement**: Analyzing performance data allows developers to refine and enhance the AI, leading to better outcomes and more efficient systems. 4. **Transparency and Accountability**: Thorough evaluation provides transparency into how decisions are made by the AI, which is essential for accountability, especially in sensitive applications. 5. **Bias and Fairness**: Evaluating AI systems helps detect and address potential biases, ensuring fair treatment of all users and stakeholders. 6. **Compliance**: It ensures adherence to regulations and industry standards, which is critical for legal and ethical compliance. 7. **User Trust**: A well-evaluated AI fosters trust among users, stakeholders, and the public, as they can be confident in its capabilities and limitations. 8. **Resource Allocation**: Evaluation helps determine if the AI is using resources efficiently, which can be crucial for cost management and scalability.

如果您的智能体嵌入到用户界面中,您可以记录用户的直接反馈(例如聊天界面中的点赞/点踩)。以下是使用IPython.display实现简单反馈机制的示例。
在下面的代码片段中,当用户发送聊天消息时,我们会捕获OpenTelemetry的追踪ID。如果用户喜欢/不喜欢最后一个回答,我们会为该追踪附加一个评分。
from agents import Agent, Runner, WebSearchTool
from opentelemetry.trace import format_trace_id
import ipywidgets as widgets
from IPython.display import display
from langfuse import Langfuse
langfuse = Langfuse()
# Define your agent with the web search tool
agent = Agent(
name="WebSearchAgent",
instructions="You are an agent that can search the web.",
tools=[WebSearchTool()]
)
formatted_trace_id = None # We'll store the current trace_id globally for demonstration
def on_feedback(button):
if button.icon == "thumbs-up":
langfuse.score(
value=1,
name="user-feedback",
comment="The user gave this response a thumbs up",
trace_id=formatted_trace_id
)
elif button.icon == "thumbs-down":
langfuse.score(
value=0,
name="user-feedback",
comment="The user gave this response a thumbs down",
trace_id=formatted_trace_id
)
print("Scored the trace in Langfuse")
user_input = input("Enter your question: ")
# Run agent
with trace.get_tracer(__name__).start_as_current_span("OpenAI-Agent-Trace") as span:
# Run your agent with a query
result = Runner.run_sync(agent, user_input)
print(result.final_output)
current_span = trace.get_current_span()
span_context = current_span.get_span_context()
trace_id = span_context.trace_id
formatted_trace_id = str(format_trace_id(trace_id))
langfuse.trace(id=formatted_trace_id, input=user_input, output=result.final_output)
# Get feedback
print("How did you like the agent response?")
thumbs_up = widgets.Button(description="👍", icon="thumbs-up")
thumbs_down = widgets.Button(description="👎", icon="thumbs-down")
thumbs_up.on_click(on_feedback)
thumbs_down.on_click(on_feedback)
display(widgets.HBox([thumbs_up, thumbs_down]))Enter your question: What is Langfuse? 13:54:41.574 OpenAI Agents trace: Agent workflow 13:54:41.575 Agent run: 'WebSearchAgent' 13:54:41.577 Responses API with 'gpt-4o' Langfuse is an open-source engineering platform designed to enhance the development, monitoring, and optimization of Large Language Model (LLM) applications. It offers a suite of tools that provide observability, prompt management, evaluations, and metrics, facilitating the debugging and improvement of LLM-based solutions. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai)) **Key Features of Langfuse:** - **LLM Observability:** Langfuse enables developers to monitor and analyze the performance of language models by tracking API calls, user inputs, prompts, and outputs. This observability aids in understanding model behavior and identifying areas for improvement. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai)) - **Prompt Management:** The platform provides tools for managing, versioning, and deploying prompts directly within Langfuse. This feature allows for efficient organization and refinement of prompts to optimize model responses. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai)) - **Evaluations and Metrics:** Langfuse offers capabilities to collect and calculate scores for LLM completions, run model-based evaluations, and gather user feedback. It also tracks key metrics such as cost, latency, and quality, providing insights through dashboards and data exports. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai)) - **Playground Environment:** The platform includes a playground where users can interactively experiment with different models and prompts, facilitating prompt engineering and testing. ([toolkitly.com](https://www.toolkitly.com/langfuse?utm_source=openai)) - **Integration Capabilities:** Langfuse integrates seamlessly with various tools and frameworks, including LlamaIndex, LangChain, OpenAI SDK, LiteLLM, and more, enhancing its functionality and allowing for the development of complex applications. ([toolerific.ai](https://toolerific.ai/ai-tools/opensource/langfuse-langfuse?utm_source=openai)) - **Open Source and Self-Hosting:** Being open-source, Langfuse allows developers to customize and extend the platform according to their specific needs. It can be self-hosted, providing full control over infrastructure and data. ([vafion.com](https://www.vafion.com/blog/unlocking-power-language-models-langfuse/?utm_source=openai)) Langfuse is particularly valuable for developers and researchers working with LLMs, offering a comprehensive set of tools to improve the performance and reliability of LLM applications. Its flexibility, integration capabilities, and open-source nature make it a robust choice for those seeking to enhance their LLM projects. How did you like the agent response?
HBox(children=(Button(description='👍', icon='thumbs-up', style=ButtonStyle()), Button(description='👎', icon='t…
Scored the trace in Langfuse
用户反馈随后被记录在Langfuse中:

LLM-as-a-Judge 是另一种自动评估智能体输出的方法。您可以设置一个单独的LLM调用来衡量输出的正确性、毒性、风格或您关心的任何其他标准。
工作流程:
gpt-4o-mini。来自Langfuse的示例:

# Example: Checking if the agent’s output is toxic or not.
from agents import Agent, Runner, WebSearchTool
# Define your agent with the web search tool
agent = Agent(
name="WebSearchAgent",
instructions="You are an agent that can search the web.",
tools=[WebSearchTool()]
)
input_query = "Is eating carrots good for the eyes?"
# Run agent
with trace.get_tracer(__name__).start_as_current_span("OpenAI-Agent-Trace") as span:
# Run your agent with a query
result = Runner.run_sync(agent, input_query)
# Add input and output values to parent trace
span.set_attribute("input.value", input_query)
span.set_attribute("output.value", result.final_output)14:05:34.735 OpenAI Agents trace: Agent workflow 14:05:34.736 Agent run: 'WebSearchAgent' 14:05:34.738 Responses API with 'gpt-4o'
你可以看到这个例子的回答被判定为"无毒"。

所有这些指标都可以在仪表板中一起可视化。这使您能够快速查看您的智能体在多个会话中的表现,并帮助您跟踪随时间变化的质量指标。

在线评估对于实时反馈至关重要,但您同样需要离线评估——在开发前或开发期间进行的系统性检查。这有助于在将变更推送到生产环境前保持质量和可靠性。
在离线评估中,您通常需要:
下面,我们通过search-dataset数据集来演示这种方法,该数据集包含可以通过网络搜索工具回答的问题及其预期答案。
import pandas as pd
from datasets import load_dataset
# Fetch search-dataset from Hugging Face
dataset = load_dataset("junzhang1207/search-dataset", split = "train")
df = pd.DataFrame(dataset)
print("First few rows of search-dataset:")
print(df.head())README.md: 0%| | 0.00/2.12k [00:00<?, ?B/s]
data-samples.json: 0%| | 0.00/2.48k [00:00<?, ?B/s]
data.jsonl: 0%| | 0.00/316k [00:00<?, ?B/s]
Generating train split: 0%| | 0/934 [00:00<?, ? examples/s]
First few rows of GSM8K dataset:
id \
0 20caf138-0c81-4ef9-be60-fe919e0d68d4
1 1f37d9fd-1bcc-4f79-b004-bc0e1e944033
2 76173a7f-d645-4e3e-8e0d-cca139e00ebe
3 5f5ef4ca-91fe-4610-a8a9-e15b12e3c803
4 64dbed0d-d91b-4acd-9a9c-0a7aa83115ec
question \
0 steve jobs statue location budapst
1 Why is the Battle of Stalingrad considered a t...
2 In what year did 'The Birth of a Nation' surpa...
3 How many Russian soldiers surrendered to AFU i...
4 What event led to the creation of Google Images?
expected_answer category area
0 The Steve Jobs statue is located in Budapest, ... Arts Knowledge
1 The Battle of Stalingrad is considered a turni... General News News
2 This question is based on a false premise. 'Th... Entertainment News
3 About 300 Russian soldiers surrendered to the ... General News News
4 Jennifer Lopez's appearance in a green Versace... Technology News
接下来,我们在Langfuse中创建一个数据集实体来跟踪运行记录。然后,我们将数据集中的每个项目添加到系统中。
from langfuse import Langfuse
langfuse = Langfuse()
langfuse_dataset_name = "search-dataset_huggingface_openai-agent"
# Create a dataset in Langfuse
langfuse.create_dataset(
name=langfuse_dataset_name,
description="search-dataset uploaded from Huggingface",
metadata={
"date": "2025-03-14",
"type": "benchmark"
}
)Dataset(id='cm88w66t102qpad07xhgeyaej', name='search-dataset_huggingface_openai-agent', description='search-dataset uploaded from Huggingface', metadata={'date': '2025-03-14', 'type': 'benchmark'}, project_id='cloramnkj0002jz088vzn1ja4', created_at=datetime.datetime(2025, 3, 14, 14, 47, 14, 676000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 3, 14, 14, 47, 14, 676000, tzinfo=datetime.timezone.utc))for idx, row in df.iterrows():
langfuse.create_dataset_item(
dataset_name=langfuse_dataset_name,
input={"text": row["question"]},
expected_output={"text": row["expected_answer"]}
)
if idx >= 49: # For this example, we upload only the first 50 items
break
我们定义了一个辅助函数 run_openai_agent(),它:
然后,我们遍历每个数据集项,运行智能体,并将追踪链接到数据集项。如果需要,我们还可以附加一个快速评估分数。
from agents import Agent, Runner, WebSearchTool
from opentelemetry.trace import format_trace_id
# Define your agent with the web search tool
agent = Agent(
name="WebSearchAgent",
instructions="You are an agent that can search the web.",
tools=[WebSearchTool(search_context_size= "high")]
)
def run_openai_agent(question):
with tracer.start_as_current_span("OpenAI-Agent-Trace") as span:
span.set_attribute("langfuse.tag", "dataset-run")
# Run your agent with a query
result = Runner.run_sync(agent, question)
# Get the Langfuse trace_id to link the dataset run item to the agent trace
current_span = trace.get_current_span()
span_context = current_span.get_span_context()
trace_id = span_context.trace_id
formatted_trace_id = format_trace_id(trace_id)
langfuse_trace = langfuse.trace(
id=formatted_trace_id,
input=question,
output=result.final_output
)
return langfuse_trace, result.final_outputdataset = langfuse.get_dataset(langfuse_dataset_name)
# Run our agent against each dataset item
for item in dataset.items:
langfuse_trace, output = run_openai_agent(item.input["text"])
# Link the trace to the dataset item for analysis
item.link(
langfuse_trace,
run_name="openai-agent-run-03",
run_metadata={ "search_context_size": "high"}
)
# Optionally, store a quick evaluation score for demonstration
langfuse_trace.score(
name="<example_eval>",
value=1,
comment="This is a comment"
)
# Flush data to ensure all telemetry is sent
langfuse.flush()你可以用不同的方式重复这个过程:
WebSearchTool的不同上下文大小)然后在Langfuse中进行并排比较。在这个例子中,我对50个数据集问题运行了智能体3次。每次运行时,我为OpenAI的WebSearchTool使用了不同的上下文大小设置。你可以看到增加上下文大小也略微提高了答案正确率,从0.89上升到0.92。correct_answer分数是由LLM-as-a-Judge Evaluator创建的,该评估器被设置为根据数据集中提供的示例答案来判断问题的正确性。
