Comparing Speech-to-Text Methods with the OpenAI API

概述

本笔记本为初学者提供了一个清晰实用的指南，帮助您快速上手使用OpenAI API进行语音转文字（STT）。您将探索多种实用方法、它们的应用场景以及注意事项。

最终你将能够为你的使用场景选择并使用合适的转录方法。

注意：为简化使用流程，本笔记本采用WAV音频文件。未启用实时麦克风流式传输（例如来自网页应用或麦克风）。

📊 快速概览

模式	首个token的延迟	最佳适用场景(实际案例)	优势	主要限制
File upload + `stream=False` (blocking)	seconds	Voicemail, meeting recordings	Simple to set up	• No partial results, users see nothing until file finishes • Max 25 MB per request (you must chunk long audio)
File upload + `stream=True`	subseconds	Voice memos in mobile apps	Simple to set up & provides a “live” feel via token streaming	• Still requires a completed file • You implement progress bars / chunked uploads
实时WebSocket	亚秒级	网络研讨会中的实时字幕	真正实时；支持连续音频流	• 音频格式必须为pcm16、g711_ulaw或g711_alaw • 会话时长≤30分钟，支持重连拼接 • 需自行处理说话人轮换格式以构建完整文本记录
Agents SDK VoicePipeline	subseconds	内部帮助台助手	实时流式处理，轻松构建智能体工作流	• 仅限Python测试版 • API接口可能变更

安装（一次性）

要设置您的环境，请在新Python环境中取消注释并运行以下单元格：

!pip install --upgrade -q openai openai-agents websockets sounddevice pyaudio nest_asyncio resampy httpx websocket-client

这将安装运行笔记本所需的必要软件包。

认证

在继续之前，请确保您已将OpenAI API密钥设置为名为OPENAI_API_KEY的环境变量。通常可以在终端或笔记本环境中设置：export OPENAI_API_KEY="your-api-key-here"

通过运行下一个单元格来验证您的API密钥是否设置正确。

# ─── Standard Library ──────────────────────────────────────────────────────────
import asyncio
import struct
import base64          # encode raw PCM bytes → base64 before sending JSON
import json            # compose/parse WebSocket messages
import os
import time
from typing import List
from pathlib import Path

# ─── Third-Party ───────────────────────────────────────────────────────────────
import nest_asyncio
import numpy as np
from openai import OpenAI
import resampy         # high-quality sample-rate conversion
import soundfile as sf # reads many audio formats into float32 arrays
import websockets      # asyncio-based WebSocket client
from agents import Agent
from agents.voice import (
    SingleAgentVoiceWorkflow,
    StreamedAudioInput,
    VoicePipeline,
    VoicePipelineConfig,
)
from IPython.display import Audio, display
# ───────────────────────────────────────────────────────────────────────────────
nest_asyncio.apply()

# ✏️  Put your key in an env-var or just replace the call below.
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

client = OpenAI(api_key=OPENAI_API_KEY)
print("✅ OpenAI client ready")

✅ OpenAI client ready

1 · 使用音频文件进行语音转文字

model = gpt-4o-transcribe

何时使用

您有一个已完成的音频文件（最大25 MB）。支持以下输入文件类型：mp3、mp4、mpeg、mpga、m4a、wav和webm。
适用于播客、客服通话录音或语音备忘录等批量处理任务。
不需要实时反馈或部分结果。

工作原理

STT Not Streaming Transcription flow

优势

易于使用： 只需一个HTTP请求——非常适合自动化或后端脚本。
准确性：一次性处理整个音频，提升上下文理解和转录质量。
文件支持： 可处理WAV、MP3、MP4、M4A、FLAC、Ogg等格式。

局限性

无部分结果： 您必须等待处理完成才能看到任何转录文本。
延迟随时长增加： 录音时间越长意味着等待时间越长。
文件大小上限： 最大25 MB（约16kHz单声道WAV格式30分钟）。
仅限离线使用：不适用于实时场景，如实时字幕或对话式AI。

让我们先预览一下音频文件。我已从此处下载了该音频文件。

AUDIO_PATH = Path('./data/sample_audio_files/lotsoftimes-78085.mp3')  # change me
MODEL_NAME = "gpt-4o-transcribe"

if AUDIO_PATH.exists():
    display(Audio(str(AUDIO_PATH)))
else:
    print('⚠️ Provide a valid audio file')

现在，我们可以调用STT端点来转录音频。

if AUDIO_PATH.exists():
    with AUDIO_PATH.open('rb') as f:
        transcript = client.audio.transcriptions.create(
            file=f,
            model=MODEL_NAME,
            response_format='text',
        )
    print('\n--- TRANSCRIPT ---\n')
    print(transcript)

--- TRANSCRIPT ---

And lots of times you need to give people more than one link at a time. A band could give their fans a couple new videos from a live concert, a behind-the-scenes photo gallery, an album to purchase, like these next few links.

2 · 音频文件转文本：流式处理

model = gpt-4o-transcribe

何时使用

你已经有一个完整录制的音频文件。
您需要实时获取转录结果（部分或最终结果）。
部分反馈提升用户体验的场景，例如上传长语音备忘录。

STT Streaming Transcription flow

优势

实时感：用户几乎可以立即看到转录更新。
进度可见性：中间转录文本显示当前进度。
改进的用户体验：即时反馈让用户保持参与感。

局限性

需要预先提供完整音频文件： 不适用于实时音频流。
实现开销： 您需要自行处理流式逻辑和进度更新。

if AUDIO_PATH.exists():
    with AUDIO_PATH.open('rb') as f:
        stream = client.audio.transcriptions.create(
            file=f,
            model=MODEL_NAME,
            response_format='text',
              stream=True
)

for event in stream:
    # If this is an incremental update, you can get the delta using `event.delta`
    if getattr(event, "delta", None): 
        print(event.delta, end="", flush=True)
        time.sleep(0.05) # simulate real-time pacing
        
    # When transcription is complete, you can get the final transcript using `event.text`
    elif getattr(event, "text", None):
        print()
        print("\n" + event.text)

And lots of times you need to give people more than one link at a time. A band could give their fans a couple new videos from a live concert, a behind-the-scenes photo gallery, an album to purchase, like these next few links.

And lots of times you need to give people more than one link at a time. A band could give their fans a couple new videos from a live concert, a behind-the-scenes photo gallery, an album to purchase, like these next few links.

3 · 实时转录API

model = gpt-4o-transcribe

何时使用

实时场景的实时字幕（例如会议、演示）。
需要内置语音活动检测、噪声抑制或令牌级对数概率。
能够轻松处理WebSocket和实时事件流。

工作原理

Realtime Transcription flow

优势

超低延迟： 通常为300-800毫秒，实现近乎即时的转录。
动态更新：支持部分和最终转录文本，提升用户体验。
高级功能：内置轮次检测、降噪以及可选的详细对数概率。

局限性

复杂集成： 需要管理WebSockets、Base64编码和稳健的错误处理。
会话限制：每次会话限时30分钟。
受限格式：仅支持原始PCM（不支持MP3或Opus）；对于pcm16，输入音频必须为16位PCM，采样率24kHz，单声道（mono），且采用小端字节序。

TARGET_SR     = 24_000
PCM_SCALE     = 32_767
CHUNK_SAMPLES = 3_072                 # ≈128 ms at 24 kHz
RT_URL        = "wss://api.openai.com/v1/realtime?intent=transcription"

EV_DELTA      = "conversation.item.input_audio_transcription.delta"
EV_DONE       = "conversation.item.input_audio_transcription.completed"
# ── helpers ────────────────────────────────────────────────────────────────
def float_to_16bit_pcm(float32_array):
    clipped = [max(-1.0, min(1.0, x)) for x in float32_array]
    pcm16 = b''.join(struct.pack('<h', int(x * 32767)) for x in clipped)
    return pcm16

def base64_encode_audio(float32_array):
    pcm_bytes = float_to_16bit_pcm(float32_array)
    encoded = base64.b64encode(pcm_bytes).decode('ascii')
    return encoded

def load_and_resample(path: str, sr: int = TARGET_SR) -> np.ndarray:
    """Return mono PCM-16 as a NumPy array."""
    data, file_sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)
    if file_sr != sr:
        data = resampy.resample(data, file_sr, sr)
    return data

async def _send_audio(ws, pcm: np.ndarray, chunk: int, sr: int) -> None:
    """Producer: stream base-64 chunks at real-time pace, then signal EOF."""
    dur = 0.025 # Add pacing to ensure real-time transcription
    t_next = time.monotonic()

    for i in range(0, len(pcm), chunk):
        float_chunk = pcm[i:i + chunk]
        payload = {
            "type":  "input_audio_buffer.append",
            "audio": base64_encode_audio(float_chunk),
        }
        await ws.send(json.dumps(payload))
        t_next += dur
        await asyncio.sleep(max(0, t_next - time.monotonic()))

    await ws.send(json.dumps({"type": "input_audio_buffer.end"}))

async def _recv_transcripts(ws, collected: List[str]) -> None:
    """
    Consumer: build `current` from streaming deltas, promote it to `collected`
    whenever a …completed event arrives, and flush the remainder on socket
    close so no words are lost.
    """
    current: List[str] = []

    try:
        async for msg in ws:
            ev = json.loads(msg)

            typ = ev.get("type")
            if typ == EV_DELTA:
                delta = ev.get("delta")
                if delta:
                    current.append(delta)
                    print(delta, end="", flush=True)
            elif typ == EV_DONE:
                # sentence finished → move to permanent list
                collected.append("".join(current))
                current.clear()
    except websockets.ConnectionClosedOK:
        pass

    # socket closed → flush any remaining partial sentence
    if current:
        collected.append("".join(current))

def _session(model: str, vad: float = 0.5) -> dict:
    return {
        "type": "transcription_session.update",
        "session": {
            "input_audio_format": "pcm16",
            "turn_detection": {"type": "server_vad", "threshold": vad},
            "input_audio_transcription": {"model": model},
        },
    }

async def transcribe_audio_async(
    wav_path,
    api_key,
    *,
    model: str = MODEL_NAME,
    chunk: int = CHUNK_SAMPLES,
) -> str:
    pcm = load_and_resample(wav_path)
    headers = {"Authorization": f"Bearer {api_key}", "OpenAI-Beta": "realtime=v1"}

    async with websockets.connect(RT_URL, additional_headers=headers, max_size=None) as ws:
        await ws.send(json.dumps(_session(model)))

        transcripts: List[str] = []
        await asyncio.gather(
            _send_audio(ws, pcm, chunk, TARGET_SR),
            _recv_transcripts(ws, transcripts),
        )  # returns when server closes

    return " ".join(transcripts)

transcript = await transcribe_audio_async(AUDIO_PATH, OPENAI_API_KEY)
transcript

And lots of times you need to give people more than one link at a time.A band could give their fans a couple new videos from a live concert, a behind-the-scenes photo galleryLike these next few linksAn album to purchase.

'And lots of times you need to give people more than one link at a time. A band could give their fans a couple new videos from a live concert, a behind-the-scenes photo gallery Like these next few linksAn album to purchase. '

4 · Agents SDK 实时转录

模型 = gpt-4o-transcribe, gpt-4o-mini

何时使用

利用OpenAI Agents SDK实现实时转录与合成，设置简单。
您希望将转录功能直接集成到智能体驱动的工作流程中。
更倾向于对音频输入/输出、WebSockets和缓冲进行高级管理。

工作原理

Agents Transcription flow

优势

最小化样板代码： VoicePipeline 处理重采样、语音活动检测、缓冲、令牌认证和重新连接。
无缝智能体集成: 支持通过实时音频转录直接与GPT智能体进行交互。

限制

仅限Python测试版： 其他语言暂不可用；API可能会变更。
控制较少：微调VAD阈值或数据包调度需要深入研究SDK内部机制。

# ── 1 · agent that replies in French ---------------------------------------
fr_agent = Agent(
    name="Assistant-FR",
    instructions=
        "Translate the user's words into French.",
    model="gpt-4o-mini",
)

# ── 2 · workflow that PRINTS what it yields --------------------------------
class PrintingWorkflow(SingleAgentVoiceWorkflow):
    """Subclass that prints every chunk it yields (the agent's reply)."""

    async def run(self, transcription: str):
        # Optionally: also print the user transcription
        print()
        print("[User]:", transcription)
        print("[Assistant]: ", end="", flush=True)
        async for chunk in super().run(transcription):
            print(chunk, end="", flush=True)   # <-- agent (French) text
            yield chunk                        # still forward to TTS


pipeline = VoicePipeline(
    workflow=PrintingWorkflow(fr_agent),
    stt_model=MODEL_NAME,
    config=VoicePipelineConfig(tracing_disabled=True),
)

# ── 3 · helper to stream ~40 ms chunks at 24 kHz ---------------------------
def load_and_resample(path: str, sr: int = 24_000) -> np.ndarray:
    """Return mono PCM-16 as a NumPy array."""
    data, file_sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)
    if file_sr != sr:
        data = resampy.resample(data, file_sr, sr)
    return data
        
def audio_chunks(path: str, target_sr: int = 24_000, chunk_ms: int = 40):
    # 1️⃣ reuse the helper
    audio = load_and_resample(path, target_sr)

    # 2️⃣ float-32 → int16 NumPy array
    pcm = (np.clip(audio, -1, 1) * 32_767).astype(np.int16)

    # 3️⃣ yield real-time sized hops
    hop = int(target_sr * chunk_ms / 1_000)
    for off in range(0, len(pcm), hop):
        yield pcm[off : off + hop]

# ── 4 · stream the file ----------------------------------------------------
async def stream_audio(path: str):
    sai = StreamedAudioInput()
    run_task = asyncio.create_task(pipeline.run(sai))

    for chunk in audio_chunks(path):
        await sai.add_audio(chunk)
        await asyncio.sleep(len(chunk) / 24_000)   # real-time pacing

    # just stop pushing; session ends automatically
    await run_task        # wait for pipeline to finish

await stream_audio(AUDIO_PATH)

[User]: And lots of times you need to give people more than one link at a time.
[Assistant]: Et souvent, vous devez donner aux gens plusieurs liens à la fois.
[User]: A band could give their fans a couple new videos from a live concert, a behind-the-scenes photo gallery.
[Assistant]: Un groupe pourrait donner à ses fans quelques nouvelles vidéos d'un concert live, ainsi qu'une galerie de photos des coulisses.
[User]: An album to purchase.
[Assistant]:

Un album à acheter.
[User]: like these next few links.
[Assistant]: comme ces quelques liens suivants.

结论

在本笔记本中，您探索了使用OpenAI API和智能体SDK将语音转换为文本的多种方法，从简单的文件上传到完全交互式的实时流媒体。每种工作流程在不同场景下各具优势，请选择最符合您产品需求的方案。

关键要点

根据使用场景选择对应方法：
• 离线批处理任务 → 基于文件的转录。
• 近实时更新 → HTTP流式传输。
• 对话式低延迟体验 → WebSocket或智能体SDK。
权衡利弊： 延迟、实现工作量、支持的格式和会话限制都因方法而异。
保持最新：模型和SDK持续改进；新功能定期发布。

后续步骤

尝试使用这个笔记本！
将您选择的工作流集成到您的应用程序中。
向我们发送反馈！社区意见有助于推动下一轮模型升级。

2025年4月29日

使用OpenAI API比较语音转文字方法

概述

📊 快速概览

安装（一次性）

认证

1 · 使用音频文件进行语音转文字

何时使用

工作原理

优势

局限性

2 · 音频文件转文本：流式处理

何时使用

优势

局限性

3 · 实时转录API

何时使用

工作原理

优势

局限性

4 · Agents SDK 实时转录

何时使用

工作原理

结论

关键要点

后续步骤

参考资料