OpenAI APIs - 补全#
SGLang 提供与 OpenAI 兼容的 API,以便从 OpenAI 服务顺利过渡到自托管的本地模型。完整的 API 参考可在 OpenAI API 参考 中找到。
本教程涵盖以下流行的API:
chat/completionscompletionsbatches
查看其他教程,了解用于视觉语言模型的视觉API和用于嵌入模型的嵌入API。
启动服务器#
在终端中启动服务器并等待其初始化。
[1]:
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
server_process = execute_shell_command(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0"
)
wait_for_server("http://localhost:30000")
[2025-01-03 02:37:47] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=641729045, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:38:01 TP0] Init torch distributed begin.
[2025-01-03 02:38:01 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:38:01 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:38:02 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.33it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.30it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.90it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.61it/s]
[2025-01-03 02:38:04 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-03 02:38:05 TP0] Memory pool end. avail mem=8.34 GB
[2025-01-03 02:38:05 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:05<00:00, 4.05it/s]
[2025-01-03 02:38:10 TP0] Capture cuda graph end. Time elapsed: 5.68 s
[2025-01-03 02:38:11 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-03 02:38:11] INFO: Started server process [4044271]
[2025-01-03 02:38:11] INFO: Waiting for application startup.
[2025-01-03 02:38:11] INFO: Application startup complete.
[2025-01-03 02:38:11] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-03 02:38:12] INFO: 127.0.0.1:56294 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:38:12] INFO: 127.0.0.1:56304 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:38:12 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:13] INFO: 127.0.0.1:56318 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:38:13] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
聊天完成#
用法#
服务器完全实现了OpenAI API。如果Hugging Face分词器中指定了聊天模板,它将自动应用该模板。你也可以在启动服务器时使用--chat-template指定自定义的聊天模板。
[2]:
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(f"Response: {response}")
[2025-01-03 02:38:17 TP0] Prefill batch. #new-seq: 1, #new-token: 42, #cached-token: 1, cache hit rate: 2.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:17 TP0] Decode batch. #running-req: 1, #token: 76, token usage: 0.00, gen throughput (token/s): 6.15, #queue-req: 0
[2025-01-03 02:38:17] INFO: 127.0.0.1:56322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
参数#
聊天完成API接受OpenAI聊天完成API的参数。有关更多详细信息,请参阅OpenAI聊天完成API。
这是一个详细的聊天完成请求的示例:
[3]:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "system",
"content": "You are a knowledgeable historian who provides concise responses.",
},
{"role": "user", "content": "Tell me about ancient Rome"},
{
"role": "assistant",
"content": "Ancient Rome was a civilization centered in Italy.",
},
{"role": "user", "content": "What were their major achievements?"},
],
temperature=0.3, # Lower temperature for more focused responses
max_tokens=128, # Reasonable length for a concise response
top_p=0.95, # Slightly higher for better fluency
presence_penalty=0.2, # Mild penalty to avoid repetition
frequency_penalty=0.2, # Mild penalty for more natural language
n=1, # Single response is usually more stable
seed=42, # Keep for reproducibility
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:38:17 TP0] Prefill batch. #new-seq: 1, #new-token: 51, #cached-token: 25, cache hit rate: 20.63%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:17 TP0] frequency_penalty, presence_penalty, and repetition_penalty are not supported when using the default overlap scheduler. They will be ignored. Please add `--disable-overlap` when launching the server if you need these features. The speed will be slower in that case.
[2025-01-03 02:38:18 TP0] Decode batch. #running-req: 1, #token: 106, token usage: 0.00, gen throughput (token/s): 126.11, #queue-req: 0
[2025-01-03 02:38:18 TP0] Decode batch. #running-req: 1, #token: 146, token usage: 0.00, gen throughput (token/s): 141.58, #queue-req: 0
[2025-01-03 02:38:18 TP0] Decode batch. #running-req: 1, #token: 186, token usage: 0.00, gen throughput (token/s): 142.56, #queue-req: 0
[2025-01-03 02:38:18] INFO: 127.0.0.1:56322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1. **Law and Governance**: The Twelve Tables (450 BCE) and the Julian and Theodosian Codes established a sophisticated system of laws, governance, and administration.
2. **Engineering and Architecture**: Romans developed concrete (Opus caementicium), aqueducts, roads (e.g., Appian Way), bridges, and monumental buildings like the Colosseum and Pantheon.
3. **Military Conquests**: Rome expanded its territories through a series of wars, creating a vast empire that lasted for centuries.
4. **Language and Literature**: Latin became the language of government,
流模式也被支持。
[4]:
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
[2025-01-03 02:38:18] INFO: 127.0.0.1:56322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-01-03 02:38:18 TP0] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 30, cache hit rate: 33.73%, token usage: 0.00, #running-req: 0, #queue-req: 0
It looks like you're saying that this is a test, but I'm not sure what you're testing[2025-01-03 02:38:18 TP0] Decode batch. #running-req: 1, #token: 62, token usage: 0.00, gen throughput (token/s): 136.02, #queue-req: 0
or what the purpose of this test is. If you'd like to clarify, I'd be happy to try and assist.
完成#
用法#
Completions API 类似于 Chat Completions API,但没有 messages 参数或聊天模板。
[5]:
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="List 3 countries and their capitals.",
temperature=0,
max_tokens=64,
n=1,
stop=None,
)
print_highlight(f"Response: {response}")
[2025-01-03 02:38:19 TP0] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 1, cache hit rate: 32.57%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:19 TP0] Decode batch. #running-req: 1, #token: 24, token usage: 0.00, gen throughput (token/s): 139.38, #queue-req: 0
[2025-01-03 02:38:19 TP0] Decode batch. #running-req: 1, #token: 64, token usage: 0.00, gen throughput (token/s): 146.54, #queue-req: 0
[2025-01-03 02:38:19] INFO: 127.0.0.1:56322 - "POST /v1/completions HTTP/1.1" 200 OK
参数#
completions API 接受 OpenAI Completions API 的参数。更多详情请参考 OpenAI Completions API。
以下是一个详细的完成请求示例:
[6]:
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="Write a short story about a space explorer.",
temperature=0.7, # Moderate temperature for creative writing
max_tokens=150, # Longer response for a story
top_p=0.9, # Balanced diversity in word choice
stop=["\n\n", "THE END"], # Multiple stop sequences
presence_penalty=0.3, # Encourage novel elements
frequency_penalty=0.3, # Reduce repetitive phrases
n=1, # Generate one completion
seed=123, # For reproducible results
)
print_highlight(f"Response: {response}")
[2025-01-03 02:38:19 TP0] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 1, cache hit rate: 31.35%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:19 TP0] frequency_penalty, presence_penalty, and repetition_penalty are not supported when using the default overlap scheduler. They will be ignored. Please add `--disable-overlap` when launching the server if you need these features. The speed will be slower in that case.
[2025-01-03 02:38:19 TP0] Decode batch. #running-req: 1, #token: 41, token usage: 0.00, gen throughput (token/s): 139.23, #queue-req: 0
[2025-01-03 02:38:20 TP0] Decode batch. #running-req: 1, #token: 81, token usage: 0.00, gen throughput (token/s): 145.14, #queue-req: 0
[2025-01-03 02:38:20] INFO: 127.0.0.1:56322 - "POST /v1/completions HTTP/1.1" 200 OK
结构化输出(JSON,正则表达式,EBNF)#
您可以指定一个JSON模式、正则表达式或EBNF来约束模型输出。模型输出将保证遵循给定的约束。每个请求只能指定一个约束参数(json_schema、regex或ebnf)。
SGLang 支持两种语法后端:
Outlines (默认): 支持JSON模式和正则表达式约束。
XGrammar: 支持JSON模式和EBNF约束。
XGrammar 目前使用 GGML BNF 格式
使用 --grammar-backend xgrammar 标志初始化 XGrammar 后端
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: outlines)
JSON#
[7]:
import json
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Give me the information of the capital of France in the JSON format.",
},
],
temperature=0,
max_tokens=128,
response_format={
"type": "json_schema",
"json_schema": {"name": "foo", "schema": json.loads(json_schema)},
},
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:38:20 TP0] Prefill batch. #new-seq: 1, #new-token: 19, #cached-token: 30, cache hit rate: 37.61%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:20] INFO: 127.0.0.1:56322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
正则表达式(使用默认的“outlines”后端)#
[8]:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0,
max_tokens=128,
extra_body={"regex": "(Paris|London)"},
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:38:20 TP0] Prefill batch. #new-seq: 1, #new-token: 12, #cached-token: 30, cache hit rate: 42.75%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:20] INFO: 127.0.0.1:56322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
EBNF(使用“xgrammar”后端)#
[9]:
# terminate the existing server(that's using default outlines backend) for this demo
terminate_process(server_process)
# start new server with xgrammar backend
server_process = execute_shell_command(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0 --grammar-backend xgrammar"
)
wait_for_server("http://localhost:30000")
# EBNF example
ebnf_grammar = r"""
root ::= "Hello" | "Hi" | "Hey"
"""
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful EBNF test bot."},
{"role": "user", "content": "Say a greeting."},
],
temperature=0,
max_tokens=32,
extra_body={"ebnf": ebnf_grammar},
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:38:29] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=284622170, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:38:42 TP0] Init torch distributed begin.
[2025-01-03 02:38:43 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:38:43 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:38:43 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.12it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.05it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.29it/s]
[2025-01-03 02:38:47 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-03 02:38:47 TP0] Memory pool end. avail mem=8.34 GB
[2025-01-03 02:38:47 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:04<00:00, 4.73it/s]
[2025-01-03 02:38:52 TP0] Capture cuda graph end. Time elapsed: 4.87 s
[2025-01-03 02:38:52 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-03 02:38:53] INFO: Started server process [4045220]
[2025-01-03 02:38:53] INFO: Waiting for application startup.
[2025-01-03 02:38:53] INFO: Application startup complete.
[2025-01-03 02:38:53] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-03 02:38:54] INFO: 127.0.0.1:45182 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:38:54] INFO: 127.0.0.1:45198 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:38:54 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:55] INFO: 127.0.0.1:45202 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:38:55] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
[2025-01-03 02:38:59 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 1, cache hit rate: 1.79%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:38:59] INFO: 127.0.0.1:57854 - "POST /v1/chat/completions HTTP/1.1" 200 OK
批次#
聊天完成和完成的批次API也受支持。您可以将您的请求上传到jsonl文件中,创建一个批次作业,并在批次作业完成时检索结果(这需要更长时间但成本更低)。
批次API包括:
batchesbatches/{batch_id}/cancelbatches/{batch_id}
这是一个用于聊天完成的批处理作业示例,完成类似。
[10]:
import json
import time
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")
requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Tell me a joke about programming"}
],
"max_tokens": 50,
},
},
{
"custom_id": "request-2",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is Python?"}],
"max_tokens": 50,
},
},
]
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
file_response = client.files.create(file=f, purpose="batch")
batch_response = client.batches.create(
input_file_id=file_response.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Batch job created with ID: {batch_response.id}")
[2025-01-03 02:38:59] INFO: 127.0.0.1:57866 - "POST /v1/files HTTP/1.1" 200 OK
[2025-01-03 02:38:59] INFO: 127.0.0.1:57866 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-01-03 02:38:59 TP0] Prefill batch. #new-seq: 2, #new-token: 30, #cached-token: 50, cache hit rate: 37.50%, token usage: 0.00, #running-req: 0, #queue-req: 0
[11]:
while batch_response.status not in ["completed", "failed", "cancelled"]:
time.sleep(3)
print(f"Batch job status: {batch_response.status}...trying again in 3 seconds...")
batch_response = client.batches.retrieve(batch_response.id)
if batch_response.status == "completed":
print("Batch job completed successfully!")
print(f"Request counts: {batch_response.request_counts}")
result_file_id = batch_response.output_file_id
file_response = client.files.content(result_file_id)
result_content = file_response.read().decode("utf-8")
results = [
json.loads(line) for line in result_content.split("\n") if line.strip() != ""
]
for result in results:
print_highlight(f"Request {result['custom_id']}:")
print_highlight(f"Response: {result['response']}")
print_highlight("Cleaning up files...")
# Only delete the result file ID since file_response is just content
client.files.delete(result_file_id)
else:
print_highlight(f"Batch job failed with status: {batch_response.status}")
if hasattr(batch_response, "errors"):
print_highlight(f"Errors: {batch_response.errors}")
[2025-01-03 02:38:59 TP0] Decode batch. #running-req: 1, #token: 70, token usage: 0.00, gen throughput (token/s): 7.71, #queue-req: 0
Batch job status: validating...trying again in 3 seconds...
[2025-01-03 02:39:02] INFO: 127.0.0.1:57866 - "GET /v1/batches/batch_e635fc54-3d49-4540-bf2e-99ee24e869e3 HTTP/1.1" 200 OK
Batch job completed successfully!
Request counts: BatchRequestCounts(completed=2, failed=0, total=2)
[2025-01-03 02:39:02] INFO: 127.0.0.1:57866 - "GET /v1/files/backend_result_file-356a17aa-806b-41bb-82a1-1f8660d8dff5/content HTTP/1.1" 200 OK
[2025-01-03 02:39:02] INFO: 127.0.0.1:57866 - "DELETE /v1/files/backend_result_file-356a17aa-806b-41bb-82a1-1f8660d8dff5 HTTP/1.1" 200 OK
完成批处理作业需要一些时间。您可以使用这两个API来检索批处理作业的状态或取消批处理作业。
batches/{batch_id}: 获取批处理作业状态。batches/{batch_id}/cancel: 取消批处理作业。
这是一个检查批处理作业状态的示例。
[12]:
import json
import time
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")
requests = []
for i in range(100):
requests.append(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": f"{i}: You are a helpful AI assistant",
},
{
"role": "user",
"content": "Write a detailed story about topic. Make it very long.",
},
],
"max_tokens": 500,
},
}
)
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
uploaded_file = client.files.create(file=f, purpose="batch")
batch_job = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")
time.sleep(10)
max_checks = 5
for i in range(max_checks):
batch_details = client.batches.retrieve(batch_id=batch_job.id)
print_highlight(
f"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}"
)
print_highlight(
f"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>"
)
time.sleep(3)
[2025-01-03 02:39:02] INFO: 127.0.0.1:57868 - "POST /v1/files HTTP/1.1" 200 OK
[2025-01-03 02:39:02] INFO: 127.0.0.1:57868 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-01-03 02:39:02 TP0] Prefill batch. #new-seq: 6, #new-token: 180, #cached-token: 150, cache hit rate: 43.13%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:39:02 TP0] Prefill batch. #new-seq: 94, #new-token: 2820, #cached-token: 2350, cache hit rate: 45.26%, token usage: 0.00, #running-req: 6, #queue-req: 0
[2025-01-03 02:39:02 TP0] Decode batch. #running-req: 100, #token: 5125, token usage: 0.01, gen throughput (token/s): 668.34, #queue-req: 0
[2025-01-03 02:39:03 TP0] Decode batch. #running-req: 100, #token: 9125, token usage: 0.02, gen throughput (token/s): 11871.74, #queue-req: 0
[2025-01-03 02:39:03 TP0] Decode batch. #running-req: 100, #token: 13125, token usage: 0.03, gen throughput (token/s): 11616.00, #queue-req: 0
[2025-01-03 02:39:03 TP0] Decode batch. #running-req: 100, #token: 17125, token usage: 0.04, gen throughput (token/s): 11353.26, #queue-req: 0
[2025-01-03 02:39:04 TP0] Decode batch. #running-req: 100, #token: 21125, token usage: 0.05, gen throughput (token/s): 11100.41, #queue-req: 0
[2025-01-03 02:39:04 TP0] Decode batch. #running-req: 100, #token: 25125, token usage: 0.06, gen throughput (token/s): 10830.95, #queue-req: 0
[2025-01-03 02:39:04 TP0] Decode batch. #running-req: 100, #token: 29125, token usage: 0.07, gen throughput (token/s): 10593.56, #queue-req: 0
[2025-01-03 02:39:05 TP0] Decode batch. #running-req: 100, #token: 33125, token usage: 0.07, gen throughput (token/s): 10381.56, #queue-req: 0
[2025-01-03 02:39:05 TP0] Decode batch. #running-req: 100, #token: 37125, token usage: 0.08, gen throughput (token/s): 10150.74, #queue-req: 0
[2025-01-03 02:39:06 TP0] Decode batch. #running-req: 100, #token: 41125, token usage: 0.09, gen throughput (token/s): 9935.48, #queue-req: 0
[2025-01-03 02:39:06 TP0] Decode batch. #running-req: 100, #token: 45125, token usage: 0.10, gen throughput (token/s): 9709.21, #queue-req: 0
[2025-01-03 02:39:07 TP0] Decode batch. #running-req: 100, #token: 49125, token usage: 0.11, gen throughput (token/s): 9479.91, #queue-req: 0
[2025-01-03 02:39:07 TP0] Decode batch. #running-req: 0, #token: 0, token usage: 0.00, gen throughput (token/s): 9248.37, #queue-req: 0
[2025-01-03 02:39:12] INFO: 127.0.0.1:50938 - "GET /v1/batches/batch_30280d02-914c-40aa-bff7-552c235147c2 HTTP/1.1" 200 OK
[2025-01-03 02:39:15] INFO: 127.0.0.1:50938 - "GET /v1/batches/batch_30280d02-914c-40aa-bff7-552c235147c2 HTTP/1.1" 200 OK
[2025-01-03 02:39:18] INFO: 127.0.0.1:50938 - "GET /v1/batches/batch_30280d02-914c-40aa-bff7-552c235147c2 HTTP/1.1" 200 OK
[2025-01-03 02:39:21] INFO: 127.0.0.1:50938 - "GET /v1/batches/batch_30280d02-914c-40aa-bff7-552c235147c2 HTTP/1.1" 200 OK
[2025-01-03 02:39:24] INFO: 127.0.0.1:50938 - "GET /v1/batches/batch_30280d02-914c-40aa-bff7-552c235147c2 HTTP/1.1" 200 OK
这里是一个取消批处理作业的示例。
[13]:
import json
import time
from openai import OpenAI
import os
client = OpenAI(base_url="http://127.0.0.1:30000/v1", api_key="None")
requests = []
for i in range(500):
requests.append(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/chat/completions",
"body": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": f"{i}: You are a helpful AI assistant",
},
{
"role": "user",
"content": "Write a detailed story about topic. Make it very long.",
},
],
"max_tokens": 500,
},
}
)
input_file_path = "batch_requests.jsonl"
with open(input_file_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
with open(input_file_path, "rb") as f:
uploaded_file = client.files.create(file=f, purpose="batch")
batch_job = client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print_highlight(f"Created batch job with ID: {batch_job.id}")
print_highlight(f"Initial status: {batch_job.status}")
time.sleep(10)
try:
cancelled_job = client.batches.cancel(batch_id=batch_job.id)
print_highlight(f"Cancellation initiated. Status: {cancelled_job.status}")
assert cancelled_job.status == "cancelling"
# Monitor the cancellation process
while cancelled_job.status not in ["failed", "cancelled"]:
time.sleep(3)
cancelled_job = client.batches.retrieve(batch_job.id)
print_highlight(f"Current status: {cancelled_job.status}")
# Verify final status
assert cancelled_job.status == "cancelled"
print_highlight("Batch job successfully cancelled")
except Exception as e:
print_highlight(f"Error during cancellation: {e}")
raise e
finally:
try:
del_response = client.files.delete(uploaded_file.id)
if del_response.deleted:
print_highlight("Successfully cleaned up input file")
if os.path.exists(input_file_path):
os.remove(input_file_path)
print_highlight("Successfully deleted local batch_requests.jsonl file")
except Exception as e:
print_highlight(f"Error cleaning up: {e}")
raise e
[2025-01-03 02:39:27] INFO: 127.0.0.1:51646 - "POST /v1/files HTTP/1.1" 200 OK
[2025-01-03 02:39:27] INFO: 127.0.0.1:51646 - "POST /v1/batches HTTP/1.1" 200 OK
[2025-01-03 02:39:27 TP0] Prefill batch. #new-seq: 8, #new-token: 8, #cached-token: 432, cache hit rate: 49.09%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:39:27 TP0] Prefill batch. #new-seq: 273, #new-token: 5522, #cached-token: 9493, cache hit rate: 59.15%, token usage: 0.01, #running-req: 8, #queue-req: 0
[2025-01-03 02:39:27 TP0] Prefill batch. #new-seq: 219, #new-token: 6570, #cached-token: 5475, cache hit rate: 54.17%, token usage: 0.02, #running-req: 281, #queue-req: 0
[2025-01-03 02:39:28 TP0] Decode batch. #running-req: 500, #token: 35525, token usage: 0.08, gen throughput (token/s): 933.30, #queue-req: 0
[2025-01-03 02:39:29 TP0] Decode batch. #running-req: 500, #token: 55525, token usage: 0.12, gen throughput (token/s): 26392.65, #queue-req: 0
[2025-01-03 02:39:30 TP0] Decode batch. #running-req: 500, #token: 75525, token usage: 0.17, gen throughput (token/s): 24744.27, #queue-req: 0
[2025-01-03 02:39:31 TP0] Decode batch. #running-req: 500, #token: 95525, token usage: 0.21, gen throughput (token/s): 23596.21, #queue-req: 0
[2025-01-03 02:39:32 TP0] Decode batch. #running-req: 500, #token: 115525, token usage: 0.26, gen throughput (token/s): 22639.74, #queue-req: 0
[2025-01-03 02:39:33 TP0] Decode batch. #running-req: 500, #token: 135525, token usage: 0.30, gen throughput (token/s): 21620.11, #queue-req: 0
[2025-01-03 02:39:34 TP0] Decode batch. #running-req: 500, #token: 155525, token usage: 0.35, gen throughput (token/s): 20783.25, #queue-req: 0
[2025-01-03 02:39:35 TP0] Decode batch. #running-req: 500, #token: 175525, token usage: 0.39, gen throughput (token/s): 20005.33, #queue-req: 0
[2025-01-03 02:39:36 TP0] Decode batch. #running-req: 500, #token: 195525, token usage: 0.44, gen throughput (token/s): 19274.02, #queue-req: 0
[2025-01-03 02:39:37 TP0] Decode batch. #running-req: 500, #token: 215525, token usage: 0.48, gen throughput (token/s): 18557.02, #queue-req: 0
[2025-01-03 02:39:37] INFO: 127.0.0.1:39504 - "POST /v1/batches/batch_796c2185-e21c-4f37-99e4-420cda617506/cancel HTTP/1.1" 200 OK
[2025-01-03 02:39:40] INFO: 127.0.0.1:39504 - "GET /v1/batches/batch_796c2185-e21c-4f37-99e4-420cda617506 HTTP/1.1" 200 OK
[2025-01-03 02:39:40] INFO: 127.0.0.1:39504 - "DELETE /v1/files/backend_input_file-8f5a5227-2e75-4416-939c-b7d85b1f4add HTTP/1.1" 200 OK
[14]:
terminate_process(server_process)