2024年4月24日

使用Batch API进行批量处理

新的批量API允许以更低的价格和更高的速率限制创建异步批量任务

批量任务将在24小时内完成,但根据全球使用情况可能会更快处理。

Batch API的理想使用场景包括:

  • 为市场平台或博客内容添加标签、标题或丰富内容
  • 对支持工单进行分类并提供解答建议
  • 对客户反馈的大规模数据集进行情感分析
  • Generating summaries or translations for collections of documents or articles

还有更多!

本指南将通过几个实际示例向您展示如何使用Batch API。

我们将从一个使用gpt-4o-mini对电影进行分类的示例开始,然后介绍如何利用该模型的视觉能力为图像生成描述。

请注意,通过批量API可以使用多个模型,并且您可以在批量API调用中使用与聊天补全端点相同的参数。

# Make sure you have the latest version of the SDK available to use the Batch API
%pip install openai --upgrade
import json
from openai import OpenAI
import pandas as pd
from IPython.display import Image, display
# Initializing OpenAI client - see https://platform.openai.com/docs/quickstart?context=python
client = OpenAI()

第一个示例:电影分类

在这个示例中,我们将使用gpt-4o-mini从电影描述中提取电影类别。我们还将从该描述中提取一句摘要。

我们将使用JSON模式以结构化格式提取类别作为字符串数组和一句话摘要。

对于每部电影,我们希望得到如下所示的结果:

{
    categories: ['category1', 'category2', 'category3'],
    summary: '1-sentence summary'
}

加载数据

在本示例中,我们将使用IMDB排名前1000的电影数据集。

dataset_path = "data/imdb_top_1000.csv"

df = pd.read_csv(dataset_path)
df.head()
海报链接 系列标题 上映年份 认证 时长 类型 IMDB评分 概述 元评分 导演 主演1 主演2 主演3 主演4 投票数 票房
0 https://m.media-amazon.com/images/M/MV5BMDFkYT... 肖申克的救赎 1994 A 142 分钟 剧情 9.3 两名被监禁的男子在多年相处中建立了深厚友谊... 80.0 弗兰克·德拉邦特 蒂姆·罗宾斯 摩根·弗里曼 鲍勃·冈顿 威廉·桑德勒 2343110 28,341,469
1 https://m.media-amazon.com/images/M/MV5BM2MyNj... 教父 1972 A 175 分钟 犯罪, 剧情 9.2 一个黑手党家族年迈的族长将家族帝国的控制权转移给他不情愿的儿子... 100.0 弗朗西斯·福特·科波拉 马龙·白兰度 阿尔·帕西诺 詹姆斯·肯恩 黛安·基顿 1620367 134,966,411
2 https://m.media-amazon.com/images/M/MV5BMTMxNT... 黑暗骑士 2008 UA 152 分钟 动作, 犯罪, 剧情 9.0 当被称为小丑的威胁在哥谭市肆虐时... 84.0 克里斯托弗·诺兰 克里斯蒂安·贝尔 希斯·莱杰 艾伦·艾克哈特 迈克尔·凯恩 2303232 534,858,444
3 https://m.media-amazon.com/images/M/MV5BMWMwMG... 教父2 1974 A 202分钟 犯罪, 剧情 9.0 维托·柯里昂早年的生活和职业生涯... 90.0 弗朗西斯·福特·科波拉 阿尔·帕西诺 罗伯特·德尼罗 罗伯特·杜瓦尔 黛安·基顿 1129952 57,300,000
4 https://m.media-amazon.com/images/M/MV5BMWU4N2... 十二怒汉 1957 U 96 分钟 犯罪, 剧情 9.0 一位陪审团持异议者试图阻止一起误判... 96.0 西德尼·吕美特 亨利·方达 李·科布 马丁·鲍尔萨姆 约翰·菲德勒 689845 4,360,000

处理步骤

在这里,我们将通过首先尝试使用Chat Completions端点来准备我们的请求。

当我们对结果满意后,就可以继续创建批处理文件了。

categorize_system_prompt = '''
Your goal is to extract movie categories from movie descriptions, as well as a 1-sentence summary for these movies.
You will be provided with a movie description, and you will output a json object containing the following information:

{
    categories: string[] // Array of categories based on the movie description,
    summary: string // 1-sentence summary of the movie based on the movie description
}

Categories refer to the genre or type of the movie, like "action", "romance", "comedy", etc. Keep category names simple and use only lower case letters.
Movies can have several categories, but try to keep it under 3-4. Only mention the categories that are the most obvious based on the description.
'''

def get_categories(description):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.1,
    # This is to enable JSON mode, making sure responses are valid json objects
    response_format={ 
        "type": "json_object"
    },
    messages=[
        {
            "role": "system",
            "content": categorize_system_prompt
        },
        {
            "role": "user",
            "content": description
        }
    ],
    )

    return response.choices[0].message.content
# Testing on a few examples
for _, row in df[:5].iterrows():
    description = row['Overview']
    title = row['Series_Title']
    result = get_categories(description)
    print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
    print("\n\n----------------------------\n\n")
TITLE: The Shawshank Redemption
OVERVIEW: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.

RESULT: {
    "categories": ["drama"],
    "summary": "Two imprisoned men develop a deep bond over the years, ultimately finding redemption through their shared acts of kindness."
}


----------------------------


TITLE: The Godfather
OVERVIEW: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

RESULT: {
    "categories": ["crime", "drama"],
    "summary": "An aging crime lord hands over his empire to his hesitant son."
}


----------------------------


TITLE: The Dark Knight
OVERVIEW: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

RESULT: {
    "categories": ["action", "thriller", "superhero"],
    "summary": "Batman faces a formidable challenge as the Joker unleashes chaos on Gotham City."
}


----------------------------


TITLE: The Godfather: Part II
OVERVIEW: The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.

RESULT: {
    "categories": ["crime", "drama"],
    "summary": "The film depicts the early life of Vito Corleone and the rise of his son Michael within the family crime syndicate in 1920s New York City."
}


----------------------------


TITLE: 12 Angry Men
OVERVIEW: A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.

RESULT: {
    "categories": ["drama", "thriller"],
    "summary": "A jury holdout fights to ensure justice is served by challenging his fellow jurors to reevaluate the evidence."
}


----------------------------


创建批处理文件

批量文件采用jsonl格式,每个请求应包含一行(json对象)。 每个请求定义如下:

{
    "custom_id": <REQUEST_ID>,
    "method": "POST",
    "url": "/v1/chat/completions",
    "body": {
        "model": <MODEL>,
        "messages": <MESSAGES>,
        // other parameters
    }
}

注意:每个批次的请求ID应保持唯一。由于返回结果的顺序可能与初始输入文件不同,您可以使用该ID将结果与原始输入文件进行匹配。

# Creating an array of json tasks

tasks = []

for index, row in df.iterrows():
    
    description = row['Overview']
    
    task = {
        "custom_id": f"task-{index}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            # This is what you would have in your Chat Completions API call
            "model": "gpt-4o-mini",
            "temperature": 0.1,
            "response_format": { 
                "type": "json_object"
            },
            "messages": [
                {
                    "role": "system",
                    "content": categorize_system_prompt
                },
                {
                    "role": "user",
                    "content": description
                }
            ],
        }
    }
    
    tasks.append(task)
# Creating the file

file_name = "data/batch_tasks_movies.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')
batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)
print(batch_file)
FileObject(id='file-lx16f1KyIxQ2UHVvkG3HLfNR', bytes=1127310, created_at=1721144107, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)
batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

检查批次状态

注意:此过程最多可能需要24小时,但通常会更早完成。

您可以持续检查,直到状态变为'completed'。

batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
result_file_name = "data/batch_job_results_movies.jsonl"

with open(result_file_name, 'wb') as file:
    file.write(result)
# Loading data from saved file
results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # Parsing the JSON string into a dict and appending to the list of results
        json_object = json.loads(line.strip())
        results.append(json_object)

读取结果

提醒:结果顺序与输入文件中的顺序不一致。 请务必检查custom_id以将结果与输入请求进行匹配

# Reading only the first results
for res in results[:5]:
    task_id = res['custom_id']
    # Getting index from task id
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    movie = df.iloc[int(index)]
    description = movie['Overview']
    title = movie['Series_Title']
    print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
    print("\n\n----------------------------\n\n")
TITLE: American Psycho
OVERVIEW: A wealthy New York City investment banking executive, Patrick Bateman, hides his alternate psychopathic ego from his co-workers and friends as he delves deeper into his violent, hedonistic fantasies.

RESULT: {
    "categories": ["thriller", "psychological", "drama"],
    "summary": "A wealthy investment banker in New York City conceals his psychopathic alter ego while indulging in violent and hedonistic fantasies."
}


----------------------------


TITLE: Lethal Weapon
OVERVIEW: Two newly paired cops who are complete opposites must put aside their differences in order to catch a gang of drug smugglers.

RESULT: {
    "categories": ["action", "comedy", "crime"],
    "summary": "An action-packed comedy about two mismatched cops teaming up to take down a drug smuggling gang."
}


----------------------------


TITLE: A Star Is Born
OVERVIEW: A musician helps a young singer find fame as age and alcoholism send his own career into a downward spiral.

RESULT: {
    "categories": ["drama", "music"],
    "summary": "A musician's career spirals downward as he helps a young singer find fame amidst struggles with age and alcoholism."
}


----------------------------


TITLE: From Here to Eternity
OVERVIEW: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.

RESULT: {
    "categories": ["drama", "romance", "war"],
    "summary": "A drama set in Hawaii in 1941, where a private faces punishment for not boxing on his unit's team, amidst a forbidden love affair between his captain's wife and second-in-command."
}


----------------------------


TITLE: The Jungle Book
OVERVIEW: Bagheera the Panther and Baloo the Bear have a difficult time trying to convince a boy to leave the jungle for human civilization.

RESULT: {
    "categories": ["adventure", "animation", "family"],
    "summary": "An adventure-filled animated movie about a panther and a bear trying to persuade a boy to leave the jungle for human civilization."
}


----------------------------


加载数据

在这个示例中,我们将使用亚马逊家具数据集。

dataset_path = "data/amazon_furniture_dataset.csv"
df = pd.read_csv(dataset_path)
df.head()
ASIN码 链接 标题 品牌 价格 库存状态 分类 主图 图片集 UPC码 ... 颜色 材质 款式 重要信息 产品概览 商品详情 描述 规格参数 唯一ID 抓取时间
0 B0CJHKVG6P https://www.amazon.com/dp/B0CJHKVG6P GOYMFK 1pc 独立式鞋架,多层... GOYMFK $24.99 仅剩13件库存 - 尽快下单。 ['家居与厨房', '收纳整理', '... https://m.media-amazon.com/images/I/416WaLx10j... ['https://m.media-amazon.com/images/I/416WaLx1... NaN ... 白色 金属 现代风格 [] [{'品牌': ' GOYMFK '}, {'颜色': ' 白色 '}, ... ['多层设计:提供充足存储空间... 可收纳多双鞋子、外套、帽子等物品... ['品牌: GOYMFK', '颜色: 白色', '材质: 金... 02593e81-5c09-5069-8516-b0b29f439ded 2024-02-02 15:15:08
1 B0B66QHB23 https://www.amazon.com/dp/B0B66QHB23 subrtex 皮革餐厅椅套装... subrtex NaN NaN ['家居与厨房', '家具', '餐厅家具... https://m.media-amazon.com/images/I/31SejUEWY7... ['https://m.media-amazon.com/images/I/31SejUEW... NaN ... 黑色 海绵 黑色橡胶木 [] NaN ['【简易组装】: 2把餐厅椅套装... subrtex 餐厅椅套装(2把) ['品牌: subrtex', '颜色: 黑色', '产品尺寸... 5938d217-b8c5-5d3e-b1cf-e28e340f292e 2024-02-02 15:15:09
2 B0BXRTWLYK https://www.amazon.com/dp/B0BXRTWLYK 植物换盆垫 MUYETOL 防水移植垫... MUYETOL $5.98 有货 ['庭院、草坪与花园', '户外装饰', '门... https://m.media-amazon.com/images/I/41RgefVq70... ['https://m.media-amazon.com/images/I/41RgefVq... NaN ... 绿色 聚乙烯 现代风格 [] [{'品牌': ' MUYETOL '}, {'尺寸': ' 26.8*26.8 ... ['植物换盆垫尺寸: 26.8" x 26.8", 方形... NaN ['品牌: MUYETOL', '尺寸: 26.8*26.8', '商品重量... b2ede786-3f51-5a45-9a5b-bcf856958cd8 2024-02-02 15:15:09
3 B0C1MRB2M8 https://www.amazon.com/dp/B0C1MRB2M8 匹克球门垫,吸水迎宾门垫... VEWETOL $13.99 库存仅剩10件 - 尽快下单 ['庭院、草坪与花园', '户外装饰', '门... https://m.media-amazon.com/images/I/61vz1Igler... ['https://m.media-amazon.com/images/I/61vz1Igl... NaN ... A5589 橡胶 现代风格 [] [{'品牌': ' VEWETOL '}, {'尺寸': ' 16*24英寸 ... ['规格: 16x24英寸 ', " 高品质... 这款装饰门垫采用精致纹理设计... ['品牌: VEWETOL', '尺寸: 16*24英寸', '材质... 8fd9377b-cfa6-5f10-835c-6b8eca2816b5 2024-02-02 15:15:10
4 B0CG1N9QRC https://www.amazon.com/dp/B0CG1N9QRC JOIN IRON 可折叠电视托盘餐桌套装(4件套)... JOIN IRON 商店 $89.99 通常5-6周内发货 ['家居与厨房', '家具', '游戏与娱乐... https://m.media-amazon.com/images/I/41p4d4VJnN... ['https://m.media-amazon.com/images/I/41p4d4VJ... NaN ... 灰色4件套 铁质 X经典款式 [] NaN ['包含4个折叠电视托盘桌和1个配套... 四件套折叠托盘带匹配储物... ['品牌: JOIN IRON', '形状: 长方形', '包... bdc9aa30-9439-50dc-8e89-213ea211d66a 2024-02-02 15:15:11

5 行 × 25 列

处理步骤

再次,我们将首先使用Chat Completions端点准备请求,然后创建批处理文件。

caption_system_prompt = '''
Your goal is to generate short, descriptive captions for images of items.
You will be provided with an item image and the name of that item and you will output a caption that captures the most important information about the item.
If there are multiple items depicted, refer to the name provided to understand which item you should describe.
Your generated caption should be short (1 sentence), and include only the most important information about the item.
The most important information could be: the type of item, the style (if mentioned), the material or color if especially relevant and/or any distinctive features.
Keep it short and to the point.
'''

def get_caption(img_url, title):
    response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=300,
    messages=[
        {
            "role": "system",
            "content": caption_system_prompt
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": title
                },
                # The content type should be "image_url" to use gpt-4-turbo's vision capabilities
                {
                    "type": "image_url",
                    "image_url": {
                        "url": img_url
                    }
                },
            ],
        }
    ]
    )

    return response.choices[0].message.content
# Testing on a few images
for _, row in df[:5].iterrows():
    img_url = row['primary_image']
    caption = get_caption(img_url, row['title'])
    img = Image(url=img_url)
    display(img)
    print(f"CAPTION: {caption}\n\n")
CAPTION: A stylish white free-standing shoe rack featuring multiple layers and eight double hooks, perfect for organizing shoes and accessories in living rooms, bathrooms, or hallways.


CAPTION: Set of 2 black leather dining chairs featuring a sleek design with vertical stitching and sturdy wooden legs.


CAPTION: The MUYETOL Plant Repotting Mat is a waterproof, portable, and foldable gardening work mat measuring 26.8" x 26.8", designed for easy soil changing and indoor transplanting.


CAPTION: Absorbent non-slip doormat featuring the phrase "It's a good day to play PICKLEBALL" with paddle graphics, measuring 16x24 inches.


CAPTION: Set of 4 foldable TV trays in grey, featuring a compact design with a stand for easy storage, perfect for small spaces.


创建批处理作业

与第一个示例类似,我们将创建一个json任务数组来生成jsonl文件,并用它来创建批处理作业。

# Creating an array of json tasks

tasks = []

for index, row in df.iterrows():
    
    title = row['title']
    img_url = row['primary_image']
    
    task = {
        "custom_id": f"task-{index}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            # This is what you would have in your Chat Completions API call
            "model": "gpt-4o-mini",
            "temperature": 0.2,
            "max_tokens": 300,
            "messages": [
                {
                    "role": "system",
                    "content": caption_system_prompt
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": title
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": img_url
                            }
                        },
                    ],
                }
            ]            
        }
    }
    
    tasks.append(task)
# Creating the file

file_name = "data/batch_tasks_furniture.jsonl"

with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')
# Uploading the file 

batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)
# Creating the job

batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)
batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)

获取结果

与第一个示例一样,我们可以在批处理作业完成后检索结果。

提醒:结果顺序与输入文件中的顺序不一致。 请务必检查custom_id以将结果与输入请求进行匹配

# Retrieving result file

result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
result_file_name = "data/batch_job_results_furniture.jsonl"

with open(result_file_name, 'wb') as file:
    file.write(result)
# Loading data from saved file

results = []
with open(result_file_name, 'r') as file:
    for line in file:
        # Parsing the JSON string into a dict and appending to the list of results
        json_object = json.loads(line.strip())
        results.append(json_object)
# Reading only the first results
for res in results[:5]:
    task_id = res['custom_id']
    # Getting index from task id
    index = task_id.split('-')[-1]
    result = res['response']['body']['choices'][0]['message']['content']
    item = df.iloc[int(index)]
    img_url = item['primary_image']
    img = Image(url=img_url)
    display(img)
    print(f"CAPTION: {result}\n\n")
CAPTION: Brushed brass pedestal towel rack with a sleek, modern design, featuring multiple bars for hanging towels, measuring 25.75 x 14.44 x 32 inches.


CAPTION: Black round end table featuring a tempered glass top and a metal frame, with a lower shelf for additional storage.


CAPTION: Black collapsible and height-adjustable telescoping stool, portable and designed for makeup artists and hairstylists, shown in various stages of folding for easy transport.


CAPTION: Ergonomic pink gaming chair featuring breathable fabric, adjustable height, lumbar support, a footrest, and a swivel recliner function.


CAPTION: A set of two Glitzhome adjustable bar stools featuring a mid-century modern design with swivel seats, PU leather upholstery, and wooden backrests.


总结

在本教程中,我们看到了两个使用新版Batch API的示例,但请注意Batch API的工作方式与Chat Completions端点相同,支持相同的参数和大多数最新模型(gpt-4o、gpt-4o-mini、gpt-4-turbo、gpt-3.5-turbo...)。

通过使用此API,您可以显著降低成本,因此我们建议将所有可以异步执行的工作负载切换为此新API的批处理作业。