Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API

本笔记本展示了如何将GPT的视觉能力应用于视频处理。虽然GPT-4.1-mini不能直接接收视频输入，但我们可以利用其视觉功能和100万token的上下文窗口，一次性描述整个视频的静态帧。我们将通过两个示例进行演示：

使用GPT-4.1-mini获取视频描述
使用GPT-4o TTS API为视频生成旁白

from IPython.display import display, Image, Audio import cv2 # We're using OpenCV to read video, to install !pip install opencv-python import base64 import time from openai import OpenAI import os client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

1. 利用GPT的视觉能力获取视频描述

首先，我们使用OpenCV从一个包含野牛和狼的自然视频中提取帧：

video = cv2.VideoCapture("data/bison.mp4") base64Frames = [] while video.isOpened(): success, frame = video.read() if not success: break _, buffer = cv2.imencode(".jpg", frame) base64Frames.append(base64.b64encode(buffer).decode("utf-8")) video.release() print(len(base64Frames), "frames read.")

显示帧以确保我们正确读取了它们：

获取视频帧后，我们精心设计提示词并向GPT发送请求（注意无需发送所有帧，GPT也能理解视频内容）：

response = client.responses.create( model="gpt-4.1-mini", input=[ { "role": "user", "content": [ { "type": "input_text", "text": ( "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video." ) }, *[ { "type": "input_image", "image_url": f"data:image/jpeg;base64,{frame}" } for frame in base64Frames[0::25] ] ] } ], ) print(response.output_text)

Witness the raw power and strategy of nature in this intense wildlife encounter captured in stunning detail. A determined pack of wolves surrounds a lone bison on a snowy plain, showcasing the relentless dynamics of predator and prey in the wild. As the wolves close in, the bison stands its ground amidst the swirling snow, illustrating a gripping battle for survival. This rare footage offers an up-close look at the resilience and instincts that govern life in the animal kingdom, making it a must-watch for nature enthusiasts and wildlife lovers alike. Experience the drama, tension, and beauty of this extraordinary moment frozen in time.

2. 使用GPT-4.1和GPT-4o TTS API为视频生成旁白

让我们以大卫·爱登堡的风格为这个视频创作旁白。使用相同的视频帧，我们提示GPT为我们提供一个简短的脚本：

result = client.responses.create( model="gpt-4.1-mini", input=[ { "role": "user", "content": [ { "type": "input_text", "text": ( "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration." ) }, *[ { "type": "input_image", "image_url": f"data:image/jpeg;base64,{frame}" } for frame in base64Frames[0::25] ] ] } ] ) print(result.output_text)

In the frozen expanse of the winter landscape, a coordinated pack of wolves moves with calculated precision. Their target, a lone bison, is powerful but vulnerable when isolated. The wolves encircle their prey, their numbers overwhelming, displaying the brutal reality of survival in the wild. As the bison struggles to break free, reinforcements from the herd arrive just in time, charging into the pack. A dramatic clash unfolds, where strength meets strategy in the perpetual battle for life. Here, in the heart of nature’s harshest conditions, every moment is a testament to endurance and the delicate balance of predator and prey.

现在，我们可以使用GPT-4o TTS模型，并为其提供一组关于语音效果的指令。您可以在OpenAI.fm上尝试不同的语音模型和指令器。然后我们可以传入之前用GPT-4.1-mini生成的脚本来生成配音音频：

instructions = """ Voice Affect: Calm, measured, and warmly engaging; convey awe and quiet reverence for the natural world. Tone: Inquisitive and insightful, with a gentle sense of wonder and deep respect for the subject matter. Pacing: Even and steady, with slight lifts in rhythm when introducing a new species or unexpected behavior; natural pauses to allow the viewer to absorb visuals. Emotion: Subtly emotive—imbued with curiosity, empathy, and admiration without becoming sentimental or overly dramatic. Emphasis: Highlight scientific and descriptive language (“delicate wings shimmer in the sunlight,” “a symphony of unseen life,” “ancient rituals played out beneath the canopy”) to enrich imagery and understanding. Pronunciation: Clear and articulate, with precise enunciation and slightly rounded vowels to ensure accessibility and authority. Pauses: Insert thoughtful pauses before introducing key facts or transitions (“And then... with a sudden rustle...”), allowing space for anticipation and reflection. """ audio_response = response = client.audio.speech.create( model="gpt-4o-mini-tts", voice="echo", instructions=instructions, input=result.output_text, response_format="wav" ) audio_bytes = audio_response.content Audio(data=audio_bytes)

2025年4月22日

利用GPT-4.1-mini的视觉能力和GPT-4o TTS API处理并解说视频

1. 利用GPT的视觉能力获取视频描述

2. 使用GPT-4.1和GPT-4o TTS API为视频生成旁白