结构化输出(JSON,正则表达式,EBNF)#
您可以指定一个JSON模式、正则表达式或EBNF来约束模型输出。模型输出将保证遵循给定的约束。每个请求只能指定一个约束参数(json_schema、regex或ebnf)。
SGLang 支持两种语法后端:
Outlines (默认): 支持JSON模式和正则表达式约束。
XGrammar: 支持JSON模式和EBNF约束,目前使用GGML BNF格式。
我们建议尽可能使用XGrammar,因为它的性能更好。更多详情,请参见XGrammar技术概述。
要使用Xgrammar,只需在启动服务器时添加--grammar-backend xgrammar。如果未指定后端,将默认使用Outlines。
OpenAI 兼容 API#
[1]:
from sglang.utils import (
execute_shell_command,
wait_for_server,
terminate_process,
print_highlight,
)
import openai
server_process = execute_shell_command(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0 --grammar-backend xgrammar"
)
wait_for_server("http://localhost:30000")
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
[2025-01-03 02:32:51] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=940779991, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:33:05 TP0] Init torch distributed begin.
[2025-01-03 02:33:05 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:33:05 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:33:06 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.12it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.04it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.27it/s]
[2025-01-03 02:33:10 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-03 02:33:10 TP0] Memory pool end. avail mem=8.34 GB
[2025-01-03 02:33:10 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:05<00:00, 3.84it/s]
[2025-01-03 02:33:16 TP0] Capture cuda graph end. Time elapsed: 6.00 s
[2025-01-03 02:33:16 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-03 02:33:17] INFO: Started server process [4036932]
[2025-01-03 02:33:17] INFO: Waiting for application startup.
[2025-01-03 02:33:17] INFO: Application startup complete.
[2025-01-03 02:33:17] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-03 02:33:17] INFO: 127.0.0.1:50658 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:33:18] INFO: 127.0.0.1:50664 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:33:18 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:33:18] INFO: 127.0.0.1:50680 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:33:18] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
JSON#
你可以直接定义一个JSON模式或使用Pydantic来定义和验证响应。
使用 Pydantic
[2]:
from pydantic import BaseModel, Field
# Define the schema using Pydantic
class CapitalInfo(BaseModel):
name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
population: int = Field(..., description="Population of the capital city")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Give me the information of the capital of France in the JSON format.",
},
],
temperature=0,
max_tokens=128,
response_format={
"type": "json_schema",
"json_schema": {
"name": "foo",
# convert the pydantic model to json schema
"schema": CapitalInfo.model_json_schema(),
},
},
)
response_content = response.choices[0].message.content
# validate the JSON response by the pydantic model
capital_info = CapitalInfo.model_validate_json(response_content)
print_highlight(f"Validated response: {capital_info.model_dump_json()}")
[2025-01-03 02:33:23 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 1, cache hit rate: 1.79%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:33:23] INFO: 127.0.0.1:51206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Validated response: {"name":"Paris","population":2147000}
直接使用JSON Schema
[3]:
import json
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Give me the information of the capital of France in the JSON format.",
},
],
temperature=0,
max_tokens=128,
response_format={
"type": "json_schema",
"json_schema": {"name": "foo", "schema": json.loads(json_schema)},
},
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:33:23 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 48, cache hit rate: 46.67%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:33:23] INFO: 127.0.0.1:51206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
{"name": "Paris", "population": 2147000}
EBNF#
[4]:
ebnf_grammar = """
root ::= city | description
city ::= "London" | "Paris" | "Berlin" | "Rome"
description ::= city " is " status
status ::= "the capital of " country
country ::= "England" | "France" | "Germany" | "Italy"
"""
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful geography bot."},
{
"role": "user",
"content": "Give me the information of the capital of France.",
},
],
temperature=0,
max_tokens=32,
extra_body={"ebnf": ebnf_grammar},
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:33:23 TP0] Prefill batch. #new-seq: 1, #new-token: 27, #cached-token: 25, cache hit rate: 47.13%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:33:23 TP0] Decode batch. #running-req: 1, #token: 55, token usage: 0.00, gen throughput (token/s): 5.89, #queue-req: 0
[2025-01-03 02:33:23] INFO: 127.0.0.1:51206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Paris is the capital of France
[5]:
terminate_process(server_process)
server_process = execute_shell_command(
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0"
)
wait_for_server("http://localhost:30000")
[2025-01-03 02:33:32] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=731246754, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:33:46 TP0] Init torch distributed begin.
[2025-01-03 02:33:46 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:33:46 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:33:47 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.16it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.08it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.32it/s]
[2025-01-03 02:33:50 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.72 GB
[2025-01-03 02:33:50 TP0] Memory pool end. avail mem=8.34 GB
[2025-01-03 02:33:50 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:05<00:00, 4.51it/s]
[2025-01-03 02:33:56 TP0] Capture cuda graph end. Time elapsed: 5.10 s
[2025-01-03 02:33:56 TP0] max_total_num_tokens=444500, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-03 02:33:56] INFO: Started server process [4037907]
[2025-01-03 02:33:56] INFO: Waiting for application startup.
[2025-01-03 02:33:56] INFO: Application startup complete.
[2025-01-03 02:33:56] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-03 02:33:57] INFO: 127.0.0.1:60076 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:33:57] INFO: 127.0.0.1:60086 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:33:57 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:33:58] INFO: 127.0.0.1:60094 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:33:58] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
正则表达式#
[6]:
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0,
max_tokens=128,
extra_body={"regex": "(Paris|London)"},
)
print_highlight(response.choices[0].message.content)
[2025-01-03 02:34:02 TP0] Prefill batch. #new-seq: 1, #new-token: 41, #cached-token: 1, cache hit rate: 2.04%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:34:02] INFO: 127.0.0.1:35774 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Paris
[7]:
terminate_process(server_process)
原生API和SGLang运行时(SRT)#
[8]:
server_process = execute_shell_command(
"""
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010 --grammar-backend xgrammar
"""
)
wait_for_server("http://localhost:30010")
[2025-01-03 02:34:10] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=217284190, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:34:23 TP0] Init torch distributed begin.
[2025-01-03 02:34:23 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:34:23 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:34:23 TP0] Using model weights format ['*.safetensors']
[2025-01-03 02:34:24 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.79it/s]
[2025-01-03 02:34:24 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=76.39 GB
[2025-01-03 02:34:24 TP0] Memory pool end. avail mem=7.45 GB
[2025-01-03 02:34:24 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:05<00:00, 4.24it/s]
[2025-01-03 02:34:30 TP0] Capture cuda graph end. Time elapsed: 5.44 s
[2025-01-03 02:34:30 TP0] max_total_num_tokens=2193171, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-01-03 02:34:30] INFO: Started server process [4038858]
[2025-01-03 02:34:30] INFO: Waiting for application startup.
[2025-01-03 02:34:30] INFO: Application startup complete.
[2025-01-03 02:34:30] INFO: Uvicorn running on http://127.0.0.1:30010 (Press CTRL+C to quit)
[2025-01-03 02:34:31] INFO: 127.0.0.1:59556 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:34:31] INFO: 127.0.0.1:59566 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:34:31 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:34:32] INFO: 127.0.0.1:59576 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:34:32] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
JSON#
使用 Pydantic
[9]:
import requests
import json
from pydantic import BaseModel, Field
# Define the schema using Pydantic
class CapitalInfo(BaseModel):
name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
population: int = Field(..., description="Population of the capital city")
# Make API request
response = requests.post(
"http://localhost:30010/generate",
json={
"text": "Here is the information of the capital of France in the JSON format.\n",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"json_schema": json.dumps(CapitalInfo.model_json_schema()),
},
},
)
print_highlight(response.json())
response_data = json.loads(response.json()["text"])
# validate the response by the pydantic model
capital_info = CapitalInfo.model_validate(response_data)
print_highlight(f"Validated response: {capital_info.model_dump_json()}")
[2025-01-03 02:34:36 TP0] Prefill batch. #new-seq: 1, #new-token: 14, #cached-token: 1, cache hit rate: 4.55%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:34:36] INFO: 127.0.0.1:59580 - "POST /generate HTTP/1.1" 200 OK
{'text': '{\n "name": "Paris",\n "population": 2140000\n}', 'meta_info': {'id': '1e25ac54d7734087a2e2495def503eaf', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 15, 'completion_tokens': 19, 'cached_tokens': 1}}
Validated response: {"name":"Paris","population":2140000}
直接使用JSON Schema
[10]:
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
# JSON
response = requests.post(
"http://localhost:30010/generate",
json={
"text": "Here is the information of the capital of France in the JSON format.\n",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"json_schema": json_schema,
},
},
)
print_highlight(response.json())
[2025-01-03 02:34:36 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 14, cache hit rate: 40.54%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:34:36 TP0] Decode batch. #running-req: 1, #token: 29, token usage: 0.00, gen throughput (token/s): 6.12, #queue-req: 0
[2025-01-03 02:34:36] INFO: 127.0.0.1:59582 - "POST /generate HTTP/1.1" 200 OK
{'text': '{\n "name": "Paris",\n "population": 2140000\n}', 'meta_info': {'id': '5531f7cc3bf74ffaaaacfdc30dff37da', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 15, 'completion_tokens': 19, 'cached_tokens': 14}}
EBNF#
[11]:
import requests
response = requests.post(
"http://localhost:30010/generate",
json={
"text": "Give me the information of the capital of France.",
"sampling_params": {
"max_new_tokens": 128,
"temperature": 0,
"n": 3,
"ebnf": (
"root ::= city | description\n"
'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
'description ::= city " is " status\n'
'status ::= "the capital of " country\n'
'country ::= "England" | "France" | "Germany" | "Italy"'
),
},
"stream": False,
"return_logprob": False,
},
)
print_highlight(response.json())
[2025-01-03 02:34:37 TP0] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 1, cache hit rate: 33.33%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:34:37 TP0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 30, cache hit rate: 56.79%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:34:37] INFO: 127.0.0.1:59584 - "POST /generate HTTP/1.1" 200 OK
[{'text': 'Paris is the capital of France', 'meta_info': {'id': '6eef5b4e55bc428f9c40d83219027c50', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 11, 'completion_tokens': 7, 'cached_tokens': 10}}, {'text': 'Paris is the capital of France', 'meta_info': {'id': 'fa8b5c1dece5457992f3da83e277bd6d', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 11, 'completion_tokens': 7, 'cached_tokens': 10}}, {'text': 'Paris is the capital of France', 'meta_info': {'id': '19c875de0a5543899cb6548833b0a617', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 11, 'completion_tokens': 7, 'cached_tokens': 10}}]
[12]:
terminate_process(server_process)
server_process = execute_shell_command(
"""
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010
"""
)
wait_for_server("http://localhost:30010")
[2025-01-03 02:34:44] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=671209329, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:34:58 TP0] Init torch distributed begin.
[2025-01-03 02:34:58 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:34:58 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:34:59 TP0] Using model weights format ['*.safetensors']
[2025-01-03 02:34:59 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.79it/s]
[2025-01-03 02:34:59 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=76.39 GB
[2025-01-03 02:34:59 TP0] Memory pool end. avail mem=7.45 GB
[2025-01-03 02:34:59 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:06<00:00, 3.74it/s]
[2025-01-03 02:35:06 TP0] Capture cuda graph end. Time elapsed: 6.15 s
[2025-01-03 02:35:06 TP0] max_total_num_tokens=2193171, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-01-03 02:35:06] INFO: Started server process [4039834]
[2025-01-03 02:35:06] INFO: Waiting for application startup.
[2025-01-03 02:35:06] INFO: Application startup complete.
[2025-01-03 02:35:06] INFO: Uvicorn running on http://127.0.0.1:30010 (Press CTRL+C to quit)
[2025-01-03 02:35:07] INFO: 127.0.0.1:59482 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:35:07] INFO: 127.0.0.1:59492 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:35:07 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:35:08] INFO: 127.0.0.1:59498 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:35:08] The server is fired up and ready to roll!
NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
正则表达式#
[13]:
response = requests.post(
"http://localhost:30010/generate",
json={
"text": "Paris is the capital of",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 64,
"regex": "(France|England)",
},
},
)
print_highlight(response.json())
[2025-01-03 02:35:12 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 1, cache hit rate: 7.69%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:35:12] INFO: 127.0.0.1:42456 - "POST /generate HTTP/1.1" 200 OK
{'text': 'France', 'meta_info': {'id': '5c9c42ca4bbd4593b9333a902784648f', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 6, 'completion_tokens': 2, 'cached_tokens': 1}}
[14]:
terminate_process(server_process)
离线引擎API#
[15]:
import sglang as sgl
llm_xgrammar = sgl.Engine(
model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", grammar_backend="xgrammar"
)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.16it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.07it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.30it/s]
100%|██████████| 23/23 [00:06<00:00, 3.46it/s]
JSON#
使用 Pydantic
[16]:
import json
from pydantic import BaseModel, Field
prompts = [
"Give me the information of the capital of China in the JSON format.",
"Give me the information of the capital of France in the JSON format.",
"Give me the information of the capital of Ireland in the JSON format.",
]
# Define the schema using Pydantic
class CapitalInfo(BaseModel):
name: str = Field(..., pattern=r"^\w+$", description="Name of the capital city")
population: int = Field(..., description="Population of the capital city")
sampling_params = {
"temperature": 0.1,
"top_p": 0.95,
"json_schema": json.dumps(CapitalInfo.model_json_schema()),
}
outputs = llm_xgrammar.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}") # validate the output by the pydantic model
capital_info = CapitalInfo.model_validate_json(output["text"])
print_highlight(f"Validated output: {capital_info.model_dump_json()}")
===============================
Prompt: Give me the information of the capital of China in the JSON format.
Validated output: {"name":"Beijing","population":21500000}
===============================
Prompt: Give me the information of the capital of France in the JSON format.
Validated output: {"name":"Paris","population":2141000}
===============================
Prompt: Give me the information of the capital of Ireland in the JSON format.
Validated output: {"name":"Dublin","population":527617}
直接使用JSON Schema
[17]:
prompts = [
"Give me the information of the capital of China in the JSON format.",
"Give me the information of the capital of France in the JSON format.",
"Give me the information of the capital of Ireland in the JSON format.",
]
json_schema = json.dumps(
{
"type": "object",
"properties": {
"name": {"type": "string", "pattern": "^[\\w]+$"},
"population": {"type": "integer"},
},
"required": ["name", "population"],
}
)
sampling_params = {"temperature": 0.1, "top_p": 0.95, "json_schema": json_schema}
outputs = llm_xgrammar.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Give me the information of the capital of China in the JSON format.
Generated text: {"name": "Beijing", "population": 21500000}
Generated text: {"name": "Beijing", "population": 21500000}
===============================
Prompt: Give me the information of the capital of France in the JSON format.
Generated text: {"name": "Paris", "population": 2141000}
Generated text: {"name": "Paris", "population": 2141000}
===============================
Prompt: Give me the information of the capital of Ireland in the JSON format.
Generated text: {"name": "Dublin", "population": 527617}
Generated text: {"name": "Dublin", "population": 527617}
EBNF#
[18]:
prompts = [
"Give me the information of the capital of France.",
"Give me the information of the capital of Germany.",
"Give me the information of the capital of Italy.",
]
sampling_params = {
"temperature": 0.8,
"top_p": 0.95,
"ebnf": (
"root ::= city | description\n"
'city ::= "London" | "Paris" | "Berlin" | "Rome"\n'
'description ::= city " is " status\n'
'status ::= "the capital of " country\n'
'country ::= "England" | "France" | "Germany" | "Italy"'
),
}
outputs = llm_xgrammar.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Give me the information of the capital of France.
Generated text: Paris is the capital of France
Generated text: Paris is the capital of France
===============================
Prompt: Give me the information of the capital of Germany.
Generated text: Berlin is the capital of Germany
Generated text: Berlin is the capital of Germany
===============================
Prompt: Give me the information of the capital of Italy.
Generated text: London is the capital of Italy
Generated text: London is the capital of Italy
[19]:
llm_xgrammar.shutdown()
llm_outlines = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/actions-runner/_work/_tool/Python/3.9.21/x64/lib/python3.9/multiprocessing/resource_tracker.py:96: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
warnings.warn('resource_tracker: process died unexpectedly, '
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.14it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.04it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.27it/s]
100%|██████████| 23/23 [00:04<00:00, 4.66it/s]
正则表达式#
[20]:
prompts = [
"Please provide information about London as a major global city:",
"Please provide information about Paris as a major global city:",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95, "regex": "(France|England)"}
outputs = llm_outlines.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print_highlight("===============================")
print_highlight(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Please provide information about London as a major global city:
Generated text: England
Generated text: England
===============================
Prompt: Please provide information about Paris as a major global city:
Generated text: France
Generated text: France
[21]:
llm_outlines.shutdown()