2025年3月24日

Multi-Language One-Way Translation with the Realtime API

One of the most exciting things about the Realtime API is that the emotion, tone and pace of speech are all passed to the model for inference. Traditional cascaded voice systems (involving STT and TTS) introduce an intermediate transcription step, relying on SSML or prompting to approximate prosody, which inherently loses fidelity. The speaker's expressiveness is literally lost in translation. Because it can process raw audio, the Realtime API preserves those audio attributes through inference, minimizing latency and enriching responses with tonal and inflectional cues. Because of this, the Realtime API makes LLM-powered speech translation closer to a live interpreter than ever before.

This cookbook demonstrates how to use OpenAI's Realtime API to build a multi-lingual, one-way translation workflow with WebSockets. It is implemented using the Realtime + WebSockets integration in a speaker application and a WebSocket server to mirror the translated audio to a listener application.

A real-world use case for this demo is a multilingual, conversational translation where a speaker talks into the speaker app and listeners hear translations in their selected native language via the listener app. Imagine a conference room with a speaker talking in English and a participant with headphones in choosing to listen to a Tagalog translation. Due to the current turn-based nature of audio models, the speaker must pause briefly to allow the model to process and translate speech. However, as models become faster and more efficient, this latency will decrease significantly and the translation will become more seamless.

让我们探索展示该应用工作原理的主要功能和代码片段。如果想在本地运行该应用,您可以在随附代码库中找到相关代码。

高层架构概览

This project has two applications - a speaker and listener app. The speaker app takes in audio from the browser, forks the audio and creates a unique Realtime session for each language and sends it to the OpenAI Realtime API via WebSocket. Translated audio streams back and is mirrored via a separate WebSocket server to the listener app. The listener app receives all translated audio streams simultaneously, but only the selected language is played. This architecture is designed for a POC and is not intended for a production use case. Let's dive into the workflow!

Architecture

第一步:语言与提示设置

我们需要为每种语言创建独立的流 - 每种语言都需要与Realtime API建立独特的提示词和会话。我们在translation_prompts.js中定义这些提示词。

The Realtime API is powered by GPT-4o Realtime or GPT-4o mini Realtime which are turn-based and trained for conversational speech use cases. In order to ensure the model returns translated audio (i.e. instead of answering a question, we want a direct translation of that question), we want to steer the model with few-shot examples of questions in the prompts. If you're translating for a specific reason or context, or have specialized vocabulary that will help the model understand context of the translation, include that in the prompt as well. If you want the model to speak with a specific accent or otherwise steer the voice, you can follpow tips from our cookbook on Steering Text-to-Speech for more dynamic audio generation.

我们可以动态输入任何语言的语音。

// Define language codes and import their corresponding instructions from our prompt config file
const languageConfigs = [
  { code: 'fr', instructions: french_instructions },
  { code: 'es', instructions: spanish_instructions },
  { code: 'tl', instructions: tagalog_instructions },
  { code: 'en', instructions: english_instructions },
  { code: 'zh', instructions: mandarin_instructions },
];

步骤2:设置Speaker应用

SpeakerApp

我们需要处理连接到实时API的客户端实例的设置和管理,使应用程序能够处理和流式传输不同语言的音频。clientRefs保存了一个RealtimeClient实例的映射表,每个实例都与一个语言代码相关联(例如'fr'代表法语,'es'代表西班牙语),表示每个与实时API建立的唯一客户端连接。

const clientRefs = useRef(
    languageConfigs.reduce((acc, { code }) => {
      acc[code] = new RealtimeClient({
        apiKey: OPENAI_API_KEY,
        dangerouslyAllowAPIKeyInBrowser: true,
      });
      return acc;
    }, {} as Record<string, RealtimeClient>)
  ).current;
 
  // Update languageConfigs to include client references
  const updatedLanguageConfigs = languageConfigs.map(config => ({
    ...config,
    clientRef: { current: clientRefs[config.code] }
  }));

注意:dangerouslyAllowAPIKeyInBrowser选项设置为true是因为我们在浏览器中使用OpenAI API密钥进行演示,但在生产环境中您应该使用通过OpenAI REST API生成的临时API密钥

我们需要实际启动与实时API的连接,并将音频数据发送到服务器。当用户在扬声器页面点击“连接”时,我们开始该流程。

connectConversation 函数负责协调连接过程,确保所有必要组件都已初始化并准备就绪。

const connectConversation = useCallback(async () => {
    try {
        setIsLoading(true);
        const wavRecorder = wavRecorderRef.current;
        await wavRecorder.begin();
        await connectAndSetupClients();
        setIsConnected(true);
    } catch (error) {
        console.error('Error connecting to conversation:', error);
    } finally {
        setIsLoading(false);
    }
}, []);

connectAndSetupClients 确保我们使用正确的模型和语音。本演示中,我们使用的是 gpt-4o-realtime-preview-2024-12-17 和 coral。

   // Function to connect and set up all clients
  const connectAndSetupClients = async () => {
    for (const { clientRef } of updatedLanguageConfigs) {
      const client = clientRef.current;
      await client.realtime.connect({ model: DEFAULT_REALTIME_MODEL });
      await client.updateSession({ voice: DEFAULT_REALTIME_VOICE });
    }
  };

步骤3:音频流

通过WebSocket发送音频需要管理输入和输出的PCM16音频流(更多详情)。我们使用wavtools库来抽象这一过程,该库可用于在浏览器中录制和流式传输音频数据。这里我们使用WavRecorder来捕获浏览器中的音频。

本演示支持手动和语音活动检测(VAD)两种录音模式,说话者可以自由切换。为获得更清晰的音频采集效果,我们建议在此处使用手动模式。

const startRecording = async () => {
    setIsRecording(true);
    const wavRecorder = wavRecorderRef.current;
 
    await wavRecorder.record((data) => {
      // Send mic PCM to all clients
      updatedLanguageConfigs.forEach(({ clientRef }) => {
        clientRef.current.appendInputAudio(data.mono);
      });
    });
  };

步骤4:显示转录文本

We listen for response.audio_transcript.done events to update the transcripts of the audio. These input transcripts are generated by the Whisper model in parallel to the GPT-4o Realtime inference that is doing the translations on raw audio.

我们为每种可选语言同时运行实时会话,因此可以获取每种语言的转录文本(无论听众应用程序中选择的是哪种语言)。这些转录文本可以通过切换“显示转录”按钮来展示。

步骤5:设置监听应用

Listeners can choose from a dropdown menu of translation streams and after connecting, dynamically change languages. The demo application uses French, Spanish, Tagalog, English, and Mandarin but OpenAI supports 57+ languages.

The app connects to a simple Socket.IO server that acts as a relay for audio data. When translated audio is streamed back to from the Realtime API, we mirror those audio streams to the listener page and allow users to select a language and listen to translated streams.

这里的关键函数是connectServer,它用于连接服务器并设置音频流。

  // Function to connect to the server and set up audio streaming
  const connectServer = useCallback(async () => {
    if (socketRef.current) return;
    try {
      const socket = io('http://localhost:3001');
      socketRef.current = socket;
      await wavStreamPlayerRef.current.connect();
      socket.on('connect', () => {
        console.log('Listener connected:', socket.id);
        setIsConnected(true);
      });
      socket.on('disconnect', () => {
        console.log('Listener disconnected');
        setIsConnected(false);
      });
    } catch (error) {
      console.error('Error connecting to server:', error);
    }
  }, []);

概念验证到生产环境

这是一个演示示例,旨在提供灵感。我们在此使用WebSockets以便于本地开发。但在生产环境中,我们建议使用WebRTC(它能提供更优质的音频流传输和更低延迟),并通过OpenAI REST API生成的临时API密钥连接到实时API。

Current Realtime models are turn based - this is best for conversational use cases as opposed to the uninterrupted, UN-style live translation that we really want for a one-directional streaming use case. For this demo, we can capture additional audio from the speaker app as soon as the model returns translated audio (i.e. capturing more input audio while the translated audio played from the listener app), but there is a limit to the length of audio we can capture at a time. The speaker needs to pause to let the translation catch up.

结论

In summary, this POC is a demonstration of a one-way translation use of the Realtime API but the idea of forking audio for multiple uses can expand beyond translation. Other workflows might be simultaneous sentiment analysis, live guardrails or generating subtitles.