2025年3月27日

使用智能体SDK构建语音助手

假设你是一家消费科技公司的AI负责人。你设想部署一个单一入口的数字语音助手,能够帮助用户处理任何查询,无论他们是想对自己的账户进行操作、查找产品信息,还是获取实时指导。

然而,将这一愿景变为现实可能极其困难——首先需要通过文本来构建和测试处理每个具体用例的能力,整合对各种所需工具和系统的访问权限,并以某种方式将它们协调成一个连贯的体验。然后,一旦达到了令人满意的质量水平(甚至评估这一点都可能很困难),你还面临着为语音交互重构整个工作流程的艰巨任务。

幸运的是,OpenAI最近发布的三项更新让实现这一愿景变得前所未有的简单,它们提供了通过语音构建和编排模块化智能体工作流的工具,且只需最简配置:

  • Responses API - 一个智能化的API,通过管理有状态的对话、追踪响应以支持评估,以及内置的文件搜索、网络搜索、计算机使用等工具,轻松与我们的前沿模型进行交互
  • Agents SDK - 一个轻量级、可定制的开源框架,用于构建和编排跨多个不同智能体的工作流,使您的助手能够将输入路由到适当的智能体,并扩展以支持多种用例
  • 语音智能体 - Agents SDK的扩展功能,支持使用语音管道,只需几行代码即可让您的智能体从纯文本交互升级为能够理解和生成音频

本教程展示了如何使用上述工具为一个虚构的消费者应用程序构建一个简单的应用内语音助手。我们将创建一个分诊智能体,它会问候用户、确定用户意图,并将请求路由到三个专业智能体之一:

  • 搜索智能体 - 通过Responses API内置工具执行网络搜索,为用户查询提供实时信息
  • 知识智能体 - 利用Responses API的文件搜索工具从OpenAI管理的向量数据库中检索信息
  • 账户智能体 - 通过函数调用提供触发API自定义操作的能力

最后,我们将利用AgentsSDK的语音功能将此工作流转化为实时语音助手,通过麦克风捕获输入、执行语音转文本、经由智能体路由处理,并以文本转语音方式响应。

设置

要执行此手册,您需要安装以下软件包以访问OpenAI的API、智能体SDK以及音频处理库。此外,您可以通过set_default_openai_key函数设置您的OpenAI API密钥供智能体使用。

%pip install openai
%pip install openai-agents 'openai-agents[voice]'
%pip install numpy
%pip install sounddevice
%pip install os
from agents import Agent, function_tool, WebSearchTool, FileSearchTool, set_default_openai_key
from agents.extensions.handoff_prompt import prompt_with_handoff_instructions

set_default_openai_key("YOUR_API_KEY")

定义智能体与工具

今天我们将为虚构的消费者应用程序ACME商店构建一个助手,主要专注于支持以下三个关键用例场景:

  • 通过网页搜索回答实时问题,为购买决策提供信息
  • 提供我们产品组合中可用选项的信息
  • 提供账户信息,让用户了解自己的预算和支出情况

为了实现这一目标,我们将采用智能体架构。这种架构允许我们将每个用例的功能拆分为独立的智能体,从而降低单个智能体需要处理的任务复杂度/范围,同时提高准确性。我们的智能体架构相对简单,主要针对上述三个用例,但Agents SDK的妙处在于:当您需要添加新功能时,可以极其轻松地扩展工作流并添加额外的智能体:

Agent Architecture

Search Agent

我们的第一个智能体是一个简单的网络搜索代理,它使用Responses API提供的WebSearchTool来查找用户查询的实时信息。在这些示例中我们将保持指令提示简单,但稍后会进行迭代以展示如何针对您的使用场景优化响应格式。

# --- Agent: Search Agent ---
search_agent = Agent(
    name="SearchAgent",
    instructions=(
        "You immediately provide an input to the WebSearchTool to find up-to-date information on the user's query."
    ),
    tools=[WebSearchTool()],
)

Knowledge Agent

我们的第二个智能体需要能够回答关于我们产品组合的问题。为此,我们将使用FileSearchTool从由OpenAI管理的向量存储中检索包含我们公司特定产品信息的数据。对此,我们有两个选择:

  1. 使用OpenAI平台网站 - 访问platform.openai.com/storage并创建向量存储,上传您选择的文档。然后,获取向量存储ID并将其替换到下面的FileSearchTool初始化中。

  2. 使用OpenAI API - 通过OpenAI Python客户端的vector_stores.create函数创建向量存储,然后使用vector_stores.files.create函数向其中添加文件。完成后,您可再次使用FileSearchTool搜索该向量存储。请参考以下代码示例(可使用提供的示例文件或修改为您本地的文件路径):

from openai import OpenAI
import os

client = OpenAI(api_key='YOUR_API_KEY')

def upload_file(file_path: str, vector_store_id: str):
    file_name = os.path.basename(file_path)
    try:
        file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")
        attach_response = client.vector_stores.files.create(
            vector_store_id=vector_store_id,
            file_id=file_response.id
        )
        return {"file": file_name, "status": "success"}
    except Exception as e:
        print(f"Error with {file_name}: {str(e)}")
        return {"file": file_name, "status": "failed", "error": str(e)}

def create_vector_store(store_name: str) -> dict:
    try:
        vector_store = client.vector_stores.create(name=store_name)
        details = {
            "id": vector_store.id,
            "name": vector_store.name,
            "created_at": vector_store.created_at,
            "file_count": vector_store.file_counts.completed
        }
        print("Vector store created:", details)
        return details
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return {}
    
vector_store_id = create_vector_store("ACME Shop Product Knowledge Base")
upload_file("voice_agents_knowledge/acme_product_catalogue.pdf", vector_store_id["id"])

在实现向量存储后,我们现在可以让知识智能体使用FileSearchTool来搜索给定的存储ID。

# --- Agent: Knowledge Agent ---
knowledge_agent = Agent(
    name="KnowledgeAgent",
    instructions=(
        "You answer user questions on our product portfolio with concise, helpful responses using the FileSearchTool."
    ),
    tools=[FileSearchTool(
            max_num_results=3,
            vector_store_ids=["VECTOR_STORE_ID"],
        ),],
)

Account Agent

虽然到目前为止我们一直在使用智能体SDK提供的内置工具,但您可以通过function_tool装饰器定义自己的工具,供智能体用于与您的系统集成。在这里,我们将为账户智能体定义一个简单的虚拟函数,用于返回给定用户ID的账户信息。

# --- Tool 1: Fetch account information (dummy) ---
@function_tool
def get_account_info(user_id: str) -> dict:
    """Return dummy account info for a given user."""
    return {
        "user_id": user_id,
        "name": "Bugs Bunny",
        "account_balance": "£72.50",
        "membership_status": "Gold Executive"
    }

# --- Agent: Account Agent ---
account_agent = Agent(
    name="AccountAgent",
    instructions=(
        "You provide account information based on a user ID using the get_account_info tool."
    ),
    tools=[get_account_info],
)

最后,我们将定义分诊智能体,该智能体会根据用户意图将查询路由到合适的智能体。这里我们使用prompt_with_handoff_instructions函数,该函数提供了关于如何处理交接的额外指导,建议将其提供给任何具有明确定义的交接集和指令集的智能体。

# --- Agent: Triage Agent ---
triage_agent = Agent(
    name="Assistant",
    instructions=prompt_with_handoff_instructions("""
You are the virtual assistant for Acme Shop. Welcome the user and ask how you can help.
Based on the user's intent, route to:
- AccountAgent for account-related queries
- KnowledgeAgent for product FAQs
- SearchAgent for anything requiring real-time web search
"""),
    handoffs=[account_agent, knowledge_agent, search_agent],
)

运行工作流

既然我们已经定义了智能体,现在可以在几个示例查询上运行工作流,看看它的表现如何。

# %%
from agents import Runner, trace

async def test_queries():
    examples = [
        "What's my ACME account balance doc? My user ID is 1234567890", # Account Agent test
        "Ooh i've got money to spend! How big is the input and how fast is the output of the dynamite dispenser?", # Knowledge Agent test
        "Hmmm, what about duck hunting gear - what's trending right now?", # Search Agent test

    ]
    with trace("ACME App Assistant"):
        for query in examples:
            result = await Runner.run(triage_agent, query)
            print(f"User: {query}")
            print(result.final_output)
            print("---")
# Run the tests
await test_queries()
User: What's my ACME account balance doc? My user ID is 1234567890
Your ACME account balance is £72.50. You have a Gold Executive membership.
---
User: Ooh i've got money to spend! How big is the input and how fast is the output of the dynamite dispenser?
The Automated Dynamite Dispenser can hold up to 10 sticks of dynamite and dispenses them at a speed of 1 stick every 2 seconds.
---
User: Hmmm, what about duck hunting gear - what's trending right now?
Staying updated with the latest trends in duck hunting gear can significantly enhance your hunting experience. Here are some of the top trending items for the 2025 season:



**Banded Aspire Catalyst Waders**  
These all-season waders feature waterproof-breathable technology, ensuring comfort in various conditions. They boast a minimal-stitch design for enhanced mobility and include PrimaLoft Aerogel insulation for thermal protection. Additional features like an over-the-boot protective pant and an integrated LED light in the chest pocket make them a standout choice. ([blog.gritroutdoors.com](https://blog.gritroutdoors.com/must-have-duck-hunting-gear-for-a-winning-season/?utm_source=openai))




**Sitka Delta Zip Waders**  
Known for their durability, these waders have reinforced shins and knees with rugged foam pads, ideal for challenging terrains. Made with GORE-TEX material, they ensure dryness throughout the season. ([blog.gritroutdoors.com](https://blog.gritroutdoors.com/must-have-duck-hunting-gear-for-a-winning-season/?utm_source=openai))




**MOmarsh InvisiMan Blind**  
This one-person, low-profile blind is praised for its sturdiness and ease of setup. Hunters have reported that even late-season, cautious ducks approach without hesitation, making it a valuable addition to your gear. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))




**Slayer Calls Ranger Duck Call**  
This double reed call produces crisp and loud sounds, effectively attracting distant ducks in harsh weather conditions. Its performance has been noted for turning the heads of ducks even at extreme distances. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))




**Sitka Full Choke Pack**  
A favorite among hunters, this backpack-style blind bag offers comfort and efficiency. It has proven to keep gear dry during heavy downpours and is durable enough to withstand over 60 hunts in a season. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))


Incorporating these trending items into your gear can enhance your comfort, efficiency, and success during the hunting season. 
---

Tracing

从上面我们可以看到输出似乎符合我们的预期,但Agents SDK的一个关键优势在于它内置了追踪功能,能够跟踪智能体运行期间跨LLM调用、交接和工具的事件流程。

通过Traces仪表板,我们可以在开发和生产过程中调试、可视化并监控工作流程。如下所示,每个测试查询都被正确路由到了相应的智能体。

Traces Dashboard

启用语音功能

设计好工作流程后,实际上我们会花时间评估跟踪记录并迭代优化流程,以确保其尽可能高效。但假设我们对当前流程已满意,现在可以开始考虑如何将应用内助手从基于文本的交互转换为基于语音的交互。

为此,我们可以直接利用Agents SDK提供的类将基于文本的工作流转换为语音工作流。VoicePipeline类提供了转录音频输入、执行指定智能体工作流以及生成文本转语音响应以播放给用户的接口,而SingleAgentVoiceWorkflow类则让我们能够复用之前文本工作流中使用的智能体工作流。我们将使用sounddevice库来处理音频的输入输出。

端到端的新工作流程如下所示:

Agent Architecture 2

实现此功能的代码如下:

# %%
import numpy as np
import sounddevice as sd
from agents.voice import AudioInput, SingleAgentVoiceWorkflow, VoicePipeline

async def voice_assistant():
    samplerate = sd.query_devices(kind='input')['default_samplerate']

    while True:
        pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(triage_agent))

        # Check for input to either provide voice or exit
        cmd = input("Press Enter to speak your query (or type 'esc' to exit): ")
        if cmd.lower() == "esc":
            print("Exiting...")
            break      
        print("Listening...")
        recorded_chunks = []

         # Start streaming from microphone until Enter is pressed
        with sd.InputStream(samplerate=samplerate, channels=1, dtype='int16', callback=lambda indata, frames, time, status: recorded_chunks.append(indata.copy())):
            input()

        # Concatenate chunks into single buffer
        recording = np.concatenate(recorded_chunks, axis=0)

        # Input the buffer and await the result
        audio_input = AudioInput(buffer=recording)

        with trace("ACME App Voice Assistant"):
            result = await pipeline.run(audio_input)

         # Transfer the streamed result into chunks of audio
        response_chunks = []
        async for event in result.stream():
            if event.type == "voice_stream_event_audio":
                response_chunks.append(event.data)

        response_audio = np.concatenate(response_chunks, axis=0)

        # Play response
        print("Assistant is responding...")
        sd.play(response_audio, samplerate=samplerate)
        sd.wait()
        print("---")

# Run the voice assistant
await voice_assistant()
Listening...
Assistant is responding...
---
Exiting...

执行上述代码后,会得到以下响应结果,这些响应正确地提供了与基于文本的工作流相同的功能。

from IPython.display import display, Audio
display(Audio("voice_agents_audio/account_balance_response_base.mp3"))
display(Audio("voice_agents_audio/product_info_response_base.mp3"))
display(Audio("voice_agents_audio/trending_items_response_base.mp3"))

提示:当对语音智能体使用追踪功能时,您可以在追踪仪表盘中回放音频

Audio trace

优化语音

这是一个很好的开始,但我们还可以做得更好。由于我们只是简单地将基于文本的智能体转换为基于语音的版本,这些响应在语气或格式方面都没有进行输出优化,导致它们听起来机械而不自然。

为了解决这个问题,我们需要对提示词进行一些调整。

首先,我们可以调整现有的智能体,使其包含一个通用的系统提示,提供关于如何优化文本响应以便后续转换为语音格式的指导

# Common system prompt for voice output best practices:
voice_system_prompt = """
[Output Structure]
Your output will be delivered in an audio voice response, please ensure that every response meets these guidelines:
1. Use a friendly, human tone that will sound natural when spoken aloud.
2. Keep responses short and segmented—ideally one to two concise sentences per step.
3. Avoid technical jargon; use plain language so that instructions are easy to understand.
4. Provide only essential details so as not to overwhelm the listener.
"""

# --- Agent: Search Agent ---
search_voice_agent = Agent(
    name="SearchVoiceAgent",
    instructions=voice_system_prompt + (
        "You immediately provide an input to the WebSearchTool to find up-to-date information on the user's query."
    ),
    tools=[WebSearchTool()],
)

# --- Agent: Knowledge Agent ---
knowledge_voice_agent = Agent(
    name="KnowledgeVoiceAgent",
    instructions=voice_system_prompt + (
        "You answer user questions on our product portfolio with concise, helpful responses using the FileSearchTool."
    ),
    tools=[FileSearchTool(
            max_num_results=3,
            vector_store_ids=["VECTOR_STORE_ID"],
        ),],
)

# --- Agent: Account Agent ---
account_voice_agent = Agent(
    name="AccountVoiceAgent",
    instructions=voice_system_prompt + (
        "You provide account information based on a user ID using the get_account_info tool."
    ),
    tools=[get_account_info],
)

# --- Agent: Triage Agent ---
triage_voice_agent = Agent(
    name="VoiceAssistant",
    instructions=prompt_with_handoff_instructions("""
You are the virtual assistant for Acme Shop. Welcome the user and ask how you can help.
Based on the user's intent, route to:
- AccountAgent for account-related queries
- KnowledgeAgent for product FAQs
- SearchAgent for anything requiring real-time web search
"""),
    handoffs=[account_voice_agent, knowledge_voice_agent, search_voice_agent],
)

接下来,我们可以通过instructions字段,指导Agents SDK使用的默认OpenAI TTS模型gpt-4o-mini-tts如何传递智能体生成文本的音频输出。

在这里,我们对输出拥有极大的控制权,包括能够指定输出的个性、发音、语速和情感。

以下我列举了几个示例,展示如何针对不同应用场景向模型提供提示。

health_assistant= "Voice Affect: Calm, composed, and reassuring; project quiet authority and confidence."
"Tone: Sincere, empathetic, and gently authoritative—express genuine apology while conveying competence."
"Pacing: Steady and moderate; unhurried enough to communicate care, yet efficient enough to demonstrate professionalism."

coach_assistant="Voice: High-energy, upbeat, and encouraging, projecting enthusiasm and motivation."
"Punctuation: Short, punchy sentences with strategic pauses to maintain excitement and clarity."
"Delivery: Fast-paced and dynamic, with rising intonation to build momentum and keep engagement high."

themed_character_assistant="Affect: Deep, commanding, and slightly dramatic, with an archaic and reverent quality that reflects the grandeur of Olde English storytelling."
"Tone: Noble, heroic, and formal, capturing the essence of medieval knights and epic quests, while reflecting the antiquated charm of Olde English."    
"Emotion: Excitement, anticipation, and a sense of mystery, combined with the seriousness of fate and duty."
"Pronunciation: Clear, deliberate, and with a slightly formal cadence."
"Pause: Pauses after important Olde English phrases such as \"Lo!\" or \"Hark!\" and between clauses like \"Choose thy path\" to add weight to the decision-making process and allow the listener to reflect on the seriousness of the quest."

我们的配置将专注于营造一种友好、温暖且支持性的语气,使其在朗读时听起来自然,并引导用户完成对话。

from agents.voice import TTSModelSettings, VoicePipeline, VoicePipelineConfig, SingleAgentVoiceWorkflow, AudioInput
import sounddevice as sd
import numpy as np

# Define custom TTS model settings with the desired instructions
custom_tts_settings = TTSModelSettings(
    instructions="Personality: upbeat, friendly, persuasive guide"
    "Tone: Friendly, clear, and reassuring, creating a calm atmosphere and making the listener feel confident and comfortable."
    "Pronunciation: Clear, articulate, and steady, ensuring each instruction is easily understood while maintaining a natural, conversational flow."
    "Tempo: Speak relatively fast, include brief pauses and after before questions"
    "Emotion: Warm and supportive, conveying empathy and care, ensuring the listener feels guided and safe throughout the journey."
)

async def voice_assistant_optimized():
    samplerate = sd.query_devices(kind='input')['default_samplerate']
    voice_pipeline_config = VoicePipelineConfig(tts_settings=custom_tts_settings)

    while True:
        pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(triage_voice_agent), config=voice_pipeline_config)

        # Check for input to either provide voice or exit
        cmd = input("Press Enter to speak your query (or type 'esc' to exit): ")
        if cmd.lower() == "esc":
            print("Exiting...")
            break       
        print("Listening...")
        recorded_chunks = []

         # Start streaming from microphone until Enter is pressed
        with sd.InputStream(samplerate=samplerate, channels=1, dtype='int16', callback=lambda indata, frames, time, status: recorded_chunks.append(indata.copy())):
            input()

        # Concatenate chunks into single buffer
        recording = np.concatenate(recorded_chunks, axis=0)

        # Input the buffer and await the result
        audio_input = AudioInput(buffer=recording)

        with trace("ACME App Optimized Voice Assistant"):
            result = await pipeline.run(audio_input)

         # Transfer the streamed result into chunks of audio
        response_chunks = []
        async for event in result.stream():
            if event.type == "voice_stream_event_audio":
                response_chunks.append(event.data)
        response_audio = np.concatenate(response_chunks, axis=0)

        # Play response
        print("Assistant is responding...")
        sd.play(response_audio, samplerate=samplerate)
        sd.wait()
        print("---")

# Run the voice assistant
await voice_assistant_optimized()
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...

运行上述代码会得到以下响应,这些响应在表达上更加自然流畅且引人入胜。

display(Audio("voice_agents_audio/account_balance_response_opti.mp3"))
display(Audio("voice_agents_audio/product_info_response_opti.mp3"))
display(Audio("voice_agents_audio/trending_items_response_opti.mp3"))

...而对于一些不那么隐晦的内容,我们可以切换到themed_character_assistant指令并收到以下响应:

display(Audio("voice_agents_audio/product_info_character.wav"))
display(Audio("voice_agents_audio/product_info_character_2.wav"))

结论

瞧!

在本指南中,我们展示了如何:

  • 定义智能体,为我们的应用内语音助手提供特定用例功能
  • 利用内置和自定义工具配合Responses API,为智能体提供多样化的功能,并通过追踪评估其性能
  • 使用Agents SDK编排这些智能体
  • 使用Agents SDK的语音功能将智能体从基于文本的交互转换为基于语音的交互

Agents SDK 采用模块化方法构建语音助手,让您能够逐个用例进行开发和评估,在准备好后将工作流程从文本转换为语音之前,单独对每个用例进行迭代优化。

我们希望这本指南能为您提供有用的指导,帮助您开始构建自己的应用内语音助手!