快速入门:发送请求#
本笔记本提供了安装后使用SGLang进行聊天补全的快速入门指南。
对于视觉语言模型,请参见OpenAI APIs - Vision。
对于嵌入模型,请参见OpenAI APIs - Embedding和Encode (embedding model)。
对于奖励模型,请参见Classify (reward model)。
启动服务器#
此代码块等同于执行
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
在您的终端中并等待服务器准备就绪。一旦服务器运行,您可以使用curl或requests发送测试请求。服务器实现了OpenAI兼容的API。
[1]:
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
server_process = execute_shell_command(
"""
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0
"""
)
wait_for_server("http://localhost:30000")
[2025-01-03 02:30:16] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=986080263, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:30:30 TP0] Init torch distributed begin.
[2025-01-03 02:30:30 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:30:30 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:30:31 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.11it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.04it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.27it/s]
[2025-01-03 02:30:34 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-03 02:30:34 TP0] Memory pool end. avail mem=8.34 GB
[2025-01-03 02:30:35 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:05<00:00, 4.09it/s]
[2025-01-03 02:30:40 TP0] Capture cuda graph end. Time elapsed: 5.63 s
[2025-01-03 02:30:41 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-03 02:30:41] INFO: Started server process [4033388]
[2025-01-03 02:30:41] INFO: Waiting for application startup.
[2025-01-03 02:30:41] INFO: Application startup complete.
[2025-01-03 02:30:41] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-03 02:30:41] INFO: 127.0.0.1:41380 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:30:42] INFO: 127.0.0.1:41386 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:30:42 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:30:43] INFO: 127.0.0.1:41402 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:30:43] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
使用cURL#
[2]:
import subprocess, json
curl_command = """
curl -s http://localhost:30000/v1/chat/completions \
-d '{"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
"""
response = json.loads(subprocess.check_output(curl_command, shell=True))
print_highlight(response)
[2025-01-03 02:30:46 TP0] Prefill batch. #new-seq: 1, #new-token: 41, #cached-token: 1, cache hit rate: 2.04%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:30:47] INFO: 127.0.0.1:41408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{'id': '717f4bb30d854b7293314aecd6a154d4', 'object': 'chat.completion', 'created': 1735871447, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}], 'usage': {'prompt_tokens': 42, 'total_tokens': 50, 'completion_tokens': 8, 'prompt_tokens_details': None}}
使用Python Requests#
[3]:
import requests
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
}
response = requests.post(url, json=data)
print_highlight(response.json())
[2025-01-03 02:30:47 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 41, cache hit rate: 46.15%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:30:47] INFO: 127.0.0.1:41422 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{'id': 'e1c5e0ca011f4d0abe223bd28f2825bf', 'object': 'chat.completion', 'created': 1735871447, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The capital of France is Paris.', 'tool_calls': None}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}], 'usage': {'prompt_tokens': 42, 'total_tokens': 50, 'completion_tokens': 8, 'prompt_tokens_details': None}}
使用OpenAI Python客户端#
[4]:
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print_highlight(response)
[2025-01-03 02:30:47 TP0] Prefill batch. #new-seq: 1, #new-token: 13, #cached-token: 30, cache hit rate: 53.73%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:30:47 TP0] Decode batch. #running-req: 1, #token: 60, token usage: 0.00, gen throughput (token/s): 6.24, #queue-req: 0
[2025-01-03 02:30:47] INFO: 127.0.0.1:41430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ChatCompletion(id='9ec11fbf4d4e42ab9e23db2ea8f633db', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\n\n1. Country: Japan\n Capital: Tokyo\n\n2. Country: Australia\n Capital: Canberra\n\n3. Country: Brazil\n Capital: Brasília', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), matched_stop=128009)], created=1735871447, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=43, prompt_tokens=43, total_tokens=86, completion_tokens_details=None, prompt_tokens_details=None))
流处理#
[5]:
import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
# Use stream=True for streaming responses
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
stream=True,
)
# Handle the streaming output
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
[2025-01-03 02:30:47] INFO: 127.0.0.1:41438 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-01-03 02:30:47 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 42, cache hit rate: 64.41%, token usage: 0.00, #running-req: 0, #queue-req: 0
Here are 3 countries and their capitals:
1. Country:[2025-01-03 02:30:48 TP0] Decode batch. #running-req: 1, #token: 57, token usage: 0.00, gen throughput (token/s): 110.83, #queue-req: 0
Japan
Capital: Tokyo
2. Country: Australia
Capital: Canberra
3. Country: Brazil
Capital: Brasília
使用原生生成API#
你也可以使用原生的/generate端点进行请求,这提供了更多的灵活性。API参考可以在Sampling Parameters找到。
[6]:
import requests
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print_highlight(response.json())
[2025-01-03 02:30:48 TP0] Prefill batch. #new-seq: 1, #new-token: 3, #cached-token: 3, cache hit rate: 63.93%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:30:48 TP0] Decode batch. #running-req: 1, #token: 17, token usage: 0.00, gen throughput (token/s): 139.38, #queue-req: 0
[2025-01-03 02:30:48] INFO: 127.0.0.1:41442 - "POST /generate HTTP/1.1" 200 OK
{'text': ' a city of romance, art, fashion, and history. Paris is a must-visit destination for anyone who loves culture, architecture, and cuisine. From the', 'meta_info': {'id': '7c968cfc52ac493c819fa569c35100f7', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 6, 'completion_tokens': 32, 'cached_tokens': 3}}
流处理#
[7]:
import requests, json
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
"stream": True,
},
stream=True,
)
prev = 0
for chunk in response.iter_lines(decode_unicode=False):
chunk = chunk.decode("utf-8")
if chunk and chunk.startswith("data:"):
if chunk == "data: [DONE]":
break
data = json.loads(chunk[5:].strip("\n"))
output = data["text"]
print(output[prev:], end="", flush=True)
prev = len(output)
[2025-01-03 02:30:48] INFO: 127.0.0.1:41454 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:30:48 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 5, cache hit rate: 64.55%, token usage: 0.00, #running-req: 0, #queue-req: 0
a city of romance, art, fashion, and cuisine. Paris is a must-visit[2025-01-03 02:30:48 TP0] Decode batch. #running-req: 1, #token: 25, token usage: 0.00, gen throughput (token/s): 138.48, #queue-req: 0
destination for anyone who loves history, architecture, and culture. From the
[8]:
terminate_process(server_process)