OpenAI APIs - 视觉#
SGLang 提供与 OpenAI 兼容的 API,以便从 OpenAI 服务顺利过渡到自托管的本地模型。API 的完整参考可在 OpenAI API 参考 中找到。本教程涵盖了视觉语言模型的视觉 API。
SGLang 支持视觉语言模型,如 Llama 3.2、LLaVA-OneVision 和 QWen-VL2
启动服务器#
在终端中启动服务器并等待其初始化。
记得添加 --chat-template llama_3_vision 来指定视觉聊天模板,否则服务器仅支持文本。我们需要为视觉语言模型指定 --chat-template,因为 Hugging Face 分词器中提供的聊天模板仅支持文本。
[1]:
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
embedding_process = execute_shell_command(
"""
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--port=30000 --chat-template=llama_3_vision
"""
)
wait_for_server("http://localhost:30000")
[2025-01-03 02:31:54] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', chat_template='llama_3_vision', is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=186352855, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:32:01] Use chat template for the OpenAI-compatible API server: llama_3_vision
[2025-01-03 02:32:09 TP0] Overlap scheduler is disabled for multimodal models.
[2025-01-03 02:32:09 TP0] Automatically turn off --chunked-prefill-size for mllama.
[2025-01-03 02:32:09 TP0] Init torch distributed begin.
[2025-01-03 02:32:09 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:32:09 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:32:10 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.28it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.32it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.30it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.46it/s]
[2025-01-03 02:32:13 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=58.65 GB
[2025-01-03 02:32:14 TP0] Memory pool end. avail mem=11.86 GB
[2025-01-03 02:32:14 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:06<00:00, 3.43it/s]
[2025-01-03 02:32:21 TP0] Capture cuda graph end. Time elapsed: 6.71 s
[2025-01-03 02:32:23 TP0] max_total_num_tokens=299646, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-03 02:32:23] INFO: Started server process [4035454]
[2025-01-03 02:32:23] INFO: Waiting for application startup.
[2025-01-03 02:32:23] INFO: Application startup complete.
[2025-01-03 02:32:23] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-01-03 02:32:23] INFO: 127.0.0.1:53740 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:32:24] INFO: 127.0.0.1:53752 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:32:24 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:32:24] INFO: 127.0.0.1:53758 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:32:24] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
使用cURL#
一旦服务器启动,您可以使用curl或requests发送测试请求。
[2]:
import subprocess
curl_command = """
curl -s http://localhost:30000/v1/chat/completions \
-d '{
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
}
}
]
}
],
"max_tokens": 300
}'
"""
response = subprocess.check_output(curl_command, shell=True).decode()
print_highlight(response)
[2025-01-03 02:32:33 TP0] Prefill batch. #new-seq: 1, #new-token: 6463, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:32:33] INFO: 127.0.0.1:55348 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"4df620b6042b4b4b835981f325f0924b","object":"chat.completion","created":1735871553,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man in a yellow jacket ironing clothes on an ironing board attached to the back of a yellow taxi cab.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":6463,"total_tokens":6491,"completion_tokens":28,"prompt_tokens_details":null}}
使用Python Requests#
[3]:
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
},
},
],
}
],
"max_tokens": 300,
}
response = requests.post(url, json=data)
print_highlight(response.text)
[2025-01-03 02:32:34 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 49.97%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-03 02:32:34 TP0] Decode batch. #running-req: 1, #token: 6469, token usage: 0.02, gen throughput (token/s): 3.61, #queue-req: 0
[2025-01-03 02:32:34] INFO: 127.0.0.1:55352 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"id":"6b7740c0d1e645dc91402da5fb5457ba","object":"chat.completion","created":1735871554,"model":"meta-llama/Llama-3.2-11B-Vision-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man ironing clothes on an ironing board that is set up on the back of a yellow taxi cab.","tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":6463,"total_tokens":6490,"completion_tokens":27,"prompt_tokens_details":null}}
使用OpenAI Python客户端#
[4]:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?",
},
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true"
},
},
],
}
],
max_tokens=300,
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:32:35 TP0] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 6452, cache hit rate: 66.58%, token usage: 0.02, #running-req: 0, #queue-req: 0
[2025-01-03 02:32:35 TP0] Decode batch. #running-req: 1, #token: 6483, token usage: 0.02, gen throughput (token/s): 31.18, #queue-req: 0
[2025-01-03 02:32:35] INFO: 127.0.0.1:55358 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The image shows a man ironing clothes on an ironing board that is attached to the back of a yellow taxi cab.
多图像输入#
如果模型支持,服务器还支持多张图片和交错排列的文本和图片。
[5]:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true",
},
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png",
},
},
{
"type": "text",
"text": "I have two very different images. They are not related at all. "
"Please describe the first image in one sentence, and then describe the second image in another sentence.",
},
],
}
],
temperature=0,
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:32:36 TP0] Prefill batch. #new-seq: 1, #new-token: 12895, #cached-token: 0, cache hit rate: 39.99%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:32:36 TP0] Decode batch. #running-req: 1, #token: 12930, token usage: 0.04, gen throughput (token/s): 25.67, #queue-req: 0
[2025-01-03 02:32:37] INFO: 127.0.0.1:55368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
The first image shows a man in a yellow shirt ironing a shirt on the back of a yellow taxi cab, with a small icon of a computer code snippet in the bottom left corner. The second image shows a large orange "S" and "G" and "L" on a white background.
[6]:
terminate_process(embedding_process)
聊天模板#
如前所述,如果您未指定视觉模型的--chat-template,服务器将使用Hugging Face的默认模板,该模板仅支持文本。
我们列出了流行的视觉模型及其聊天模板:
meta-llama/Llama-3.2-Vision 使用
llama_3_vision。Qwen/Qwen2-VL-7B-Instruct 使用
qwen2-vl。LlaVA-OneVision 使用
chatml-llava。LLaVA-NeXT 使用
chatml-llava。Llama3-LLaVA-NeXT 使用
llava_llama_3。LLaVA-v1.5 / 1.6 使用
vicuna_v1.1。