原生API#

除了与OpenAI兼容的API外,SGLang运行时还提供了其本地的服务器API。我们介绍以下这些API:

  • /generate (文本生成模型)

  • /get_model_info

  • /get_server_info

  • /health

  • /health_generate

  • /flush_cache

  • /update_weights

  • /encode(嵌入模型)

  • /classify(奖励模型)

我们主要在以下示例中使用requests来测试这些API。你也可以使用curl

启动服务器#

[1]:
from sglang.utils import (
    execute_shell_command,
    wait_for_server,
    terminate_process,
    print_highlight,
)

import requests

server_process = execute_shell_command(
    """
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010
"""
)

wait_for_server("http://localhost:30010")
[2025-01-03 02:39:52] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=864522048, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:40:05 TP0] Init torch distributed begin.
[2025-01-03 02:40:05 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:40:06 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:40:06 TP0] Using model weights format ['*.safetensors']
[2025-01-03 02:40:06 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.13it/s]

[2025-01-03 02:40:07 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=76.39 GB
[2025-01-03 02:40:07 TP0] Memory pool end. avail mem=7.45 GB
[2025-01-03 02:40:07 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████| 23/23 [00:05<00:00,  3.86it/s]
[2025-01-03 02:40:13 TP0] Capture cuda graph end. Time elapsed: 5.96 s
[2025-01-03 02:40:13 TP0] max_total_num_tokens=2193171, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-01-03 02:40:13] INFO:     Started server process [4046262]
[2025-01-03 02:40:13] INFO:     Waiting for application startup.
[2025-01-03 02:40:13] INFO:     Application startup complete.
[2025-01-03 02:40:13] INFO:     Uvicorn running on http://127.0.0.1:30010 (Press CTRL+C to quit)
[2025-01-03 02:40:14] INFO:     127.0.0.1:49750 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:40:14] INFO:     127.0.0.1:49758 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:40:14 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:40:15] INFO:     127.0.0.1:49762 - "POST /generate HTTP/1.1" 200 OK
[2025-01-03 02:40:15] The server is fired up and ready to roll!


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.

生成(文本生成模型)#

生成补全。这与OpenAI API中的/v1/completions类似。详细参数可以在采样参数中找到。

[2]:
url = "http://localhost:30010/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print_highlight(response.json())
[2025-01-03 02:40:19 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 1, cache hit rate: 6.67%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:40:19 TP0] Decode batch. #running-req: 1, #token: 41, token usage: 0.00, gen throughput (token/s): 6.96, #queue-req: 0
[2025-01-03 02:40:19 TP0] Decode batch. #running-req: 1, #token: 81, token usage: 0.00, gen throughput (token/s): 561.64, #queue-req: 0
[2025-01-03 02:40:19] INFO:     127.0.0.1:56122 - "POST /generate HTTP/1.1" 200 OK
{'text': " Paris\nWhat is the capital of Canada? Ottawa\n\nFirstly, I couldn't verify where Paris is located. The main city of France (AE2004) is actually Montpellier, a major city in the Occitanie region. West is Damascus and South is Algiers. \n\nSecondly, you require Ottawa to provide the location, which should be seen as the major capital, in this case, Canada.", 'meta_info': {'id': '7d1adabbd42a4cab8a0920dc0b061059', 'finish_reason': {'type': 'stop', 'matched': 128009}, 'prompt_tokens': 8, 'completion_tokens': 87, 'cached_tokens': 1}}

获取模型信息#

获取模型的信息。

  • model_path: 模型的路径/名称。

  • is_generation: 模型是否用作生成模型或嵌入模型。

  • tokenizer_path: 分词器的路径/名称。

[3]:
url = "http://localhost:30010/get_model_info"

response = requests.get(url)
response_json = response.json()
print_highlight(response_json)
assert response_json["model_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json["is_generation"] is True
assert response_json["tokenizer_path"] == "meta-llama/Llama-3.2-1B-Instruct"
assert response_json.keys() == {"model_path", "is_generation", "tokenizer_path"}
[2025-01-03 02:40:19] INFO:     127.0.0.1:56138 - "GET /get_model_info HTTP/1.1" 200 OK
{'model_path': 'meta-llama/Llama-3.2-1B-Instruct', 'tokenizer_path': 'meta-llama/Llama-3.2-1B-Instruct', 'is_generation': True}

获取服务器信息#

获取服务器信息,包括CLI参数、令牌限制和内存池大小。

  • 注意:get_server_info 合并了以下已弃用的端点:

    • get_server_args

    • get_memory_pool_size

    • get_max_total_num_tokens

[4]:
# get_server_info

url = "http://localhost:30010/get_server_info"

response = requests.get(url)
print_highlight(response.text)
[2025-01-03 02:40:19] INFO:     127.0.0.1:56148 - "GET /get_server_info HTTP/1.1" 200 OK
{"model_path":"meta-llama/Llama-3.2-1B-Instruct","tokenizer_path":"meta-llama/Llama-3.2-1B-Instruct","tokenizer_mode":"auto","load_format":"auto","trust_remote_code":false,"dtype":"auto","kv_cache_dtype":"auto","quantization":null,"context_length":null,"device":"cuda","served_model_name":"meta-llama/Llama-3.2-1B-Instruct","chat_template":null,"is_embedding":false,"revision":null,"skip_tokenizer_init":false,"return_token_ids":false,"host":"127.0.0.1","port":30010,"mem_fraction_static":0.88,"max_running_requests":null,"max_total_tokens":null,"chunked_prefill_size":8192,"max_prefill_tokens":16384,"schedule_policy":"lpm","schedule_conservativeness":1.0,"cpu_offload_gb":0,"prefill_only_one_req":false,"tp_size":1,"stream_interval":1,"random_seed":864522048,"constrained_json_whitespace_pattern":null,"watchdog_timeout":300,"download_dir":null,"base_gpu_id":0,"log_level":"info","log_level_http":null,"log_requests":false,"show_time_cost":false,"enable_metrics":false,"decode_log_interval":40,"api_key":null,"file_storage_pth":"SGLang_storage","enable_cache_report":false,"dp_size":1,"load_balance_method":"round_robin","ep_size":1,"dist_init_addr":null,"nnodes":1,"node_rank":0,"json_model_override_args":"{}","lora_paths":null,"max_loras_per_batch":8,"attention_backend":"flashinfer","sampling_backend":"flashinfer","grammar_backend":"outlines","speculative_draft_model_path":null,"speculative_algorithm":null,"speculative_num_steps":5,"speculative_num_draft_tokens":64,"speculative_eagle_topk":8,"enable_double_sparsity":false,"ds_channel_config_path":null,"ds_heavy_channel_num":32,"ds_heavy_token_num":256,"ds_heavy_channel_type":"qk","ds_sparse_decode_threshold":4096,"disable_radix_cache":false,"disable_jump_forward":false,"disable_cuda_graph":false,"disable_cuda_graph_padding":false,"disable_outlines_disk_cache":false,"disable_custom_all_reduce":false,"disable_mla":false,"disable_overlap_schedule":false,"enable_mixed_chunk":false,"enable_dp_attention":false,"enable_ep_moe":false,"enable_torch_compile":false,"torch_compile_max_bs":32,"cuda_graph_max_bs":160,"torchao_config":"","enable_nan_detection":false,"enable_p2p_check":false,"triton_attention_reduce_in_fp32":false,"triton_attention_num_kv_splits":8,"num_continuous_decode_steps":1,"delete_ckpt_after_loading":false,"status":"ready","max_total_num_tokens":2193171,"version":"0.4.1.post3"}

健康检查#

  • /health: 检查服务器的健康状况。

  • /health_generate: 通过生成一个令牌来检查服务器的健康状况。

[5]:
url = "http://localhost:30010/health_generate"

response = requests.get(url)
print_highlight(response.text)
[2025-01-03 02:40:19 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, cache hit rate: 6.25%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:40:19] INFO:     127.0.0.1:56158 - "GET /health_generate HTTP/1.1" 200 OK
[6]:
url = "http://localhost:30010/health"

response = requests.get(url)
print_highlight(response.text)
[2025-01-03 02:40:19] INFO:     127.0.0.1:56172 - "GET /health HTTP/1.1" 200 OK

刷新缓存#

刷新基数缓存。当模型权重通过/update_weights API更新时,它将自动触发。

[7]:
# flush cache

url = "http://localhost:30010/flush_cache"

response = requests.post(url)
print_highlight(response.text)
[2025-01-03 02:40:19] INFO:     127.0.0.1:56186 - "POST /flush_cache HTTP/1.1" 200 OK
Cache flushed.
Please check backend logs for more details. (When there are running or waiting requests, the operation will not be performed.)
[2025-01-03 02:40:19 TP0] Cache flushed successfully!

从磁盘更新权重#

从磁盘更新模型权重而无需重启服务器。仅适用于具有相同架构和参数大小的模型。

SGLang 支持 update_weights_from_disk API,用于在训练期间进行持续评估(将检查点保存到磁盘并从磁盘更新权重)。

[8]:
# successful update with same architecture and size

url = "http://localhost:30010/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B"}

response = requests.post(url, json=data)
print_highlight(response.text)
assert response.json()["success"] is True
assert response.json()["message"] == "Succeeded to update model weights."
assert response.json().keys() == {"success", "message"}
[2025-01-03 02:40:19] Start update_weights. Load format=auto
[2025-01-03 02:40:19 TP0] Update engine weights online from disk begin. avail mem=5.17 GB
[2025-01-03 02:40:19 TP0] Using model weights format ['*.safetensors']
[2025-01-03 02:40:19 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.33it/s]

[2025-01-03 02:40:20 TP0] Update weights end.
[2025-01-03 02:40:20 TP0] Cache flushed successfully!
[2025-01-03 02:40:20] INFO:     127.0.0.1:56200 - "POST /update_weights_from_disk HTTP/1.1" 200 OK
{"success":true,"message":"Succeeded to update model weights."}
[9]:
# failed update with different parameter size or wrong name

url = "http://localhost:30010/update_weights_from_disk"
data = {"model_path": "meta-llama/Llama-3.2-1B-wrong"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(response_json)
assert response_json["success"] is False
assert response_json["message"] == (
    "Failed to get weights iterator: "
    "meta-llama/Llama-3.2-1B-wrong"
    " (repository not found)."
)
[2025-01-03 02:40:20] Start update_weights. Load format=auto
[2025-01-03 02:40:20 TP0] Update engine weights online from disk begin. avail mem=5.17 GB
[2025-01-03 02:40:20 TP0] Failed to get weights iterator: meta-llama/Llama-3.2-1B-wrong (repository not found).
[2025-01-03 02:40:20] INFO:     127.0.0.1:56210 - "POST /update_weights_from_disk HTTP/1.1" 400 Bad Request
{'success': False, 'message': 'Failed to get weights iterator: meta-llama/Llama-3.2-1B-wrong (repository not found).'}

编码(嵌入模型)#

将文本编码为嵌入。请注意,此API仅适用于嵌入模型,对于生成模型会引发错误。因此,我们启动了一个新的服务器来服务嵌入模型。

[10]:
terminate_process(server_process)

embedding_process = execute_shell_command(
    """
python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \
    --port 30020 --host 0.0.0.0 --is-embedding
"""
)

wait_for_server("http://localhost:30020")
[2025-01-03 02:40:29] server_args=ServerArgs(model_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Alibaba-NLP/gte-Qwen2-7B-instruct', chat_template=None, is_embedding=True, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30020, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=125125513, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:40:34] Downcasting torch.float32 to torch.float16.
[2025-01-03 02:40:42 TP0] Downcasting torch.float32 to torch.float16.
[2025-01-03 02:40:42 TP0] Overlap scheduler is disabled for embedding models.
[2025-01-03 02:40:42 TP0] Downcasting torch.float32 to torch.float16.
[2025-01-03 02:40:42 TP0] Init torch distributed begin.
[2025-01-03 02:40:42 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:40:42 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:40:43 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:00<00:04,  1.29it/s]
Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:01<00:05,  1.04s/it]
Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:03<00:05,  1.34s/it]
Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:05<00:04,  1.52s/it]
Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:07<00:03,  1.61s/it]
Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:08<00:01,  1.63s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:10<00:00,  1.71s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:10<00:00,  1.54s/it]

[2025-01-03 02:40:54 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=64.40 GB
[2025-01-03 02:40:54 TP0] Memory pool end. avail mem=7.42 GB
[2025-01-03 02:40:54 TP0] max_total_num_tokens=1028801, max_prefill_tokens=16384, max_running_requests=4019, context_len=131072
[2025-01-03 02:40:54] INFO:     Started server process [4047210]
[2025-01-03 02:40:54] INFO:     Waiting for application startup.
[2025-01-03 02:40:54] INFO:     Application startup complete.
[2025-01-03 02:40:54] INFO:     Uvicorn running on http://0.0.0.0:30020 (Press CTRL+C to quit)
[2025-01-03 02:40:55] INFO:     127.0.0.1:38224 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:40:55] INFO:     127.0.0.1:38228 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:40:55 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:40:56] INFO:     127.0.0.1:38232 - "POST /encode HTTP/1.1" 200 OK
[2025-01-03 02:40:56] The server is fired up and ready to roll!


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
[11]:
# successful encode for embedding model

url = "http://localhost:30020/encode"
data = {"model": "Alibaba-NLP/gte-Qwen2-7B-instruct", "text": "Once upon a time"}

response = requests.post(url, json=data)
response_json = response.json()
print_highlight(f"Text embedding (first 10): {response_json['embedding'][:10]}")
[2025-01-03 02:41:00 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:41:00] INFO:     127.0.0.1:50778 - "POST /encode HTTP/1.1" 200 OK
Text embedding (first 10): [0.00830841064453125, 0.0006804466247558594, -0.00807952880859375, -0.000682830810546875, 0.01438140869140625, -0.009002685546875, 0.01239013671875, 0.0020999908447265625, 0.006214141845703125, -0.0030345916748046875]

分类(奖励模型)#

SGLang Runtime 还支持奖励模型。这里我们使用奖励模型来对成对生成的质量进行分类。

[12]:
terminate_process(embedding_process)

# Note that SGLang now treats embedding models and reward models as the same type of models.
# This will be updated in the future.

reward_process = execute_shell_command(
    """
python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --port 30030 --host 0.0.0.0 --is-embedding
"""
)

wait_for_server("http://localhost:30030")
[2025-01-03 02:41:07] server_args=ServerArgs(model_path='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', tokenizer_path='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Skywork/Skywork-Reward-Llama-3.1-8B-v0.2', chat_template=None, is_embedding=True, revision=None, skip_tokenizer_init=False, return_token_ids=False, host='0.0.0.0', port=30030, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=914900526, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
[2025-01-03 02:41:20 TP0] Overlap scheduler is disabled for embedding models.
[2025-01-03 02:41:20 TP0] Init torch distributed begin.
[2025-01-03 02:41:21 TP0] Load weight begin. avail mem=78.81 GB
[2025-01-03 02:41:21 TP0] Ignore import error when loading sglang.srt.models.grok. unsupported operand type(s) for |: 'type' and 'NoneType'
[2025-01-03 02:41:21 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.13it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.44it/s]

[2025-01-03 02:41:24 TP0] Load weight end. type=LlamaForSequenceClassification, dtype=torch.bfloat16, avail mem=64.70 GB
[2025-01-03 02:41:24 TP0] Memory pool end. avail mem=8.44 GB
[2025-01-03 02:41:25 TP0] max_total_num_tokens=452516, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072
[2025-01-03 02:41:25] INFO:     Started server process [4048149]
[2025-01-03 02:41:25] INFO:     Waiting for application startup.
[2025-01-03 02:41:25] INFO:     Application startup complete.
[2025-01-03 02:41:25] INFO:     Uvicorn running on http://0.0.0.0:30030 (Press CTRL+C to quit)
[2025-01-03 02:41:25] INFO:     127.0.0.1:44674 - "GET /v1/models HTTP/1.1" 200 OK
[2025-01-03 02:41:26] INFO:     127.0.0.1:44676 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-03 02:41:26 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:41:26] INFO:     127.0.0.1:44690 - "POST /encode HTTP/1.1" 200 OK
[2025-01-03 02:41:26] The server is fired up and ready to roll!


NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
[13]:
from transformers import AutoTokenizer

PROMPT = (
    "What is the range of the numeric output of a sigmoid node in a neural network?"
)

RESPONSE1 = "The output of a sigmoid node is bounded between -1 and 1."
RESPONSE2 = "The output of a sigmoid node is bounded between 0 and 1."

CONVS = [
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE1}],
    [{"role": "user", "content": PROMPT}, {"role": "assistant", "content": RESPONSE2}],
]

tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
prompts = tokenizer.apply_chat_template(CONVS, tokenize=False)

url = "http://localhost:30030/classify"
data = {"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "text": prompts}

responses = requests.post(url, json=data).json()
for response in responses:
    print_highlight(f"reward: {response['embedding'][0]}")
[2025-01-03 02:41:36 TP0] Prefill batch. #new-seq: 1, #new-token: 68, #cached-token: 1, cache hit rate: 1.32%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-03 02:41:36 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 62, cache hit rate: 43.45%, token usage: 0.00, #running-req: 1, #queue-req: 0
[2025-01-03 02:41:36] INFO:     127.0.0.1:52720 - "POST /classify HTTP/1.1" 200 OK
reward: -24.375
reward: 1.09375
[14]:
terminate_process(reward_process)