OpenAI兼容服务器¶

vLLM提供了一个HTTP服务器，实现了OpenAI的Completions API、Chat API等功能！这项功能允许您托管模型并通过HTTP客户端与之交互。

在终端中，您可以安装 vLLM，然后使用vllm serve命令启动服务器。（您也可以使用我们的Docker镜像。）

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

要调用服务器，请在你喜欢的文本编辑器中创建一个使用HTTP客户端的脚本。包含你想发送给模型的任何消息。然后运行该脚本。以下是使用官方OpenAI Python客户端的示例脚本。

Code

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

提示

vLLM支持一些OpenAI不支持的参数，例如top_k。你可以通过OpenAI客户端的extra_body参数将这些参数传递给vLLM，例如使用extra_body={"top_k": 50}来设置top_k。

重要

默认情况下，服务器会应用Hugging Face模型仓库中的generation_config.json文件（如果存在）。这意味着某些采样参数的默认值可能会被模型创建者推荐的值覆盖。

要禁用此行为，请在启动服务器时传递 --generation-config vllm。

支持的API¶

我们目前支持以下OpenAI API接口：

Completions API (/v1/completions)
- 仅适用于文本生成模型。
- 注意：不支持 suffix 参数。
Chat Completions API (/v1/chat/completions)
- 仅适用于具有聊天模板的文本生成模型。
- 注意：parallel_tool_calls 和 user 参数将被忽略。
Embeddings API (/v1/embeddings)
- 仅适用于embedding models。
Transcriptions API (/v1/audio/transcriptions)
- 仅适用于自动语音识别(ASR)模型。
Translation API (/v1/audio/translations)
- 仅适用于自动语音识别(ASR)模型。

此外，我们还有以下自定义API接口：

Tokenizer API (/tokenize, /detokenize)
- 适用于任何带有分词器的模型。
Pooling API (/pooling)
- 适用于所有池化模型。
分类API (/classify)
- 仅适用于分类模型。
Score API (/score)
- 适用于嵌入模型和交叉编码器模型。
重排序API (/rerank, /v1/rerank, /v2/rerank)
- 实现了Jina AI的v1重排序API
- 同时兼容 Cohere的v1和v2重排序API
- Jina和Cohere的API非常相似；Jina在rerank端点的响应中包含了额外信息。
- 仅适用于cross-encoder models。

聊天模板¶

为了让语言模型支持聊天协议，vLLM要求模型在其分词器配置中包含一个聊天模板。该聊天模板是一个Jinja2模板，用于指定角色、消息和其他聊天特定标记在输入中的编码方式。

一个针对NousResearch/Meta-Llama-3-8B-Instruct的聊天模板示例可在此处找到

某些模型即使经过指令/聊天微调，也未提供聊天模板。对于这些模型，您可以在--chat-template参数中手动指定聊天模板，可以是模板文件路径，也可以是字符串形式的模板。若缺少聊天模板，服务器将无法处理聊天请求，所有聊天请求都会报错。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM社区为热门模型提供了一系列聊天模板。您可以在 examples目录下找到它们。

随着多模态聊天API的加入，OpenAI规范现在接受一种新的聊天消息格式，该格式同时指定了type和text字段。下面提供了一个示例：

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
    ]
)

大多数LLM的聊天模板期望content字段是一个字符串，但一些较新的模型如meta-llama/Llama-Guard-3-1B要求内容按照请求中的OpenAI模式进行格式化。vLLM提供了自动检测的最佳支持，会记录类似"检测到聊天模板内容格式为..."的字符串，并在内部将传入请求转换为匹配检测到的格式，可能是以下之一：

"string": A string.
- 示例："Hello world"
"openai": A list of dictionaries, similar to OpenAI schema.
- 示例: [{"type": "text", "text": "Hello world!"}]

如果结果不符合预期，您可以通过设置--chat-template-content-format CLI参数来覆盖要使用的格式。

额外参数¶

vLLM支持一组不属于OpenAI API的参数。要使用这些参数，您可以将它们作为额外参数传递给OpenAI客户端。或者，如果您直接使用HTTP调用，也可以直接将它们合并到JSON负载中。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_body={
        "guided_choice": ["positive", "negative"]
    }
)

额外HTTP请求头¶

目前仅支持X-Request-Id HTTP请求头。可通过--enable-request-id-headers参数启用该功能。

Code

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    }
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    }
)
print(completion._request_id)

API参考¶

Completions API¶

我们的Completions API与OpenAI的Completions API兼容；您可以使用官方OpenAI Python客户端与之交互。

代码示例: examples/online_serving/openai_completion_client.py

额外参数¶

支持以下采样参数。

Code

    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    allowed_token_ids: Optional[list[int]] = None
    prompt_logprobs: Optional[int] = None

支持以下额外参数：

Code

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    response_format: Optional[AnyResponseFormat] = Field(
        default=None,
        description=(
            "Similar to chat completion, this parameter specifies the format "
            "of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
            ", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
        ),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description="If specified, the output will follow the JSON schema.",
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be one of "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))

    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))

    cache_salt: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit). Not supported by vLLM engine V0."))

    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

聊天API¶

我们的Chat API与OpenAI的Chat Completions API兼容；您可以使用官方OpenAI Python客户端与之交互。

我们支持与视觉和音频相关的参数；更多信息请参阅我们的多模态输入指南。

注意：不支持 image_url.detail 参数。

代码示例: examples/online_serving/openai_chat_completion_client.py

额外参数¶

支持以下采样参数。

Code

    best_of: Optional[int] = None
    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    prompt_logprobs: Optional[int] = None
    allowed_token_ids: Optional[list[int]] = None
    bad_words: list[str] = Field(default_factory=list)

支持以下额外参数：

Code

    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=
        ("If true, the generation prompt will be added to the chat template. "
         "This is a parameter used by chat template in tokenizer config of the "
         "model."),
    )
    continue_final_message: bool = Field(
        default=False,
        description=
        ("If this is set, the chat will be formatted so that the final "
         "message in the chat is open-ended, without any EOS tokens. The "
         "model will continue this message rather than starting a new one. "
         "This allows you to \"prefill\" part of the model's response for it. "
         "Cannot be used at the same time as `add_generation_prompt`."),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    documents: Optional[list[dict[str, str]]] = Field(
        default=None,
        description=
        ("A list of dicts representing documents that will be accessible to "
         "the model if it is performing RAG (retrieval-augmented generation)."
         " If the template does not support RAG, this argument will have no "
         "effect. We recommend that each document should be a dict containing "
         "\"title\" and \"text\" keys."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description=("If specified, the output will follow the JSON schema."),
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    structural_tag: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the structural tag schema."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be either "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))
    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))
    cache_salt: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit). Not supported by vLLM engine V0."))
    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

Embeddings API¶

我们的Embeddings API与OpenAI的Embeddings API兼容；您可以使用官方的OpenAI Python客户端与之交互。

如果模型具有聊天模板，您可以将inputs替换为messages列表（与Chat API相同的模式），这些消息将被视为模型的单个提示。

代码示例: examples/online_serving/openai_embedding_client.py

您可以通过为服务器定义自定义聊天模板并在请求中传递messages列表，将多模态输入传递给嵌入模型。请参考以下示例进行说明。

VLM2VecDSE-Qwen2-MRL

启动模型服务：

vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
  --trust-remote-code \
  --max-model-len 4096 \
  --chat-template examples/template_vlm2vec.jinja

重要

由于VLM2Vec与Phi-3.5-Vision具有相同的模型架构，我们必须显式传递--runner pooling参数，以便在嵌入模式下而非文本生成模式下运行该模型。

该自定义聊天模板与原模型版本完全不同，可在此处查看： examples/template_vlm2vec.jinja

由于请求模式未由OpenAI客户端定义，我们使用底层的requests库向服务器发送请求：

Code

import requests

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={
        "model": "TIGER-Lab/VLM2Vec-Full",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }],
        "encoding_format": "float",
    },
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])

启动模型服务：

vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
  --trust-remote-code \
  --max-model-len 8192 \
  --chat-template examples/template_dse_qwen2_vl.jinja

重要

与VLM2Vec类似，我们必须显式传递--runner pooling。

此外，MrLight/dse-qwen2-2b-mrl-v1需要为嵌入提供EOS令牌，这通过自定义聊天模板处理： examples/template_dse_qwen2_vl.jinja

重要

MrLight/dse-qwen2-2b-mrl-v1 需要一张最小尺寸的占位图片用于文本查询嵌入。详情请参阅下方完整代码示例。

完整示例: examples/online_serving/openai_chat_embedding_client_for_multimodal.py

额外参数¶

支持以下池化参数。

默认支持以下额外参数：

Code

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )

对于类聊天输入（即如果传递了messages），则支持以下额外参数：

Code

    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )

转录API¶

我们的转录API与OpenAI的转录API兼容；您可以使用官方OpenAI Python客户端与之交互。

注意

要使用转录API，请通过pip install vllm[audio]安装额外的音频依赖项。

代码示例： examples/online_serving/openai_transcription_client.py

API强制限制¶

通过环境变量VLLM_MAX_AUDIO_CLIP_FILESIZE_MB设置vllm可接受的最大音频文件大小（单位：MB）。默认值为25 MB。

额外参数¶

支持以下采样参数。

Code

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

    top_p: Optional[float] = None
    """Enables nucleus (top-p) sampling, where tokens are selected from the
    smallest possible set whose cumulative probability exceeds `p`.
    """

    top_k: Optional[int] = None
    """Limits sampling to the `k` most probable tokens at each step."""

    min_p: Optional[float] = None
    """Filters out tokens with a probability lower than `min_p`, ensuring a
    minimum likelihood threshold during sampling.
    """

    seed: Optional[int] = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    frequency_penalty: Optional[float] = 0.0
    """The frequency penalty to use for sampling."""

    repetition_penalty: Optional[float] = None
    """The repetition penalty to use for sampling."""

    presence_penalty: Optional[float] = 0.0
    """The presence penalty to use for sampling."""

支持以下额外参数：

Code

    # Flattened stream option to simplify form data.
    stream_include_usage: Optional[bool] = False
    stream_continuous_usage_stats: Optional[bool] = False

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

Translations API¶

Our Translation API is compatible with OpenAI's Translations API; you can use the official OpenAI Python client to interact with it. Whisper models can translate audio from one of the 55 non-English supported languages into English. Please mind that the popular openai/whisper-large-v3-turbo model does not support translating.

注意

To use the Translation API, please install with extra audio dependencies using pip install vllm[audio].

代码示例： examples/online_serving/openai_translation_client.py

额外参数¶

支持以下采样参数。

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

支持以下额外参数：

    language: Optional[str] = None
    """The language of the input audio we translate from.

    Supplying the input language in
    [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) format
    will improve accuracy.
    """

    stream: Optional[bool] = False
    """Custom field not present in the original OpenAI definition. When set,
    it will enable output to be streamed in a similar fashion as the Chat
    Completion endpoint.
    """
    # Flattened stream option to simplify form data.
    stream_include_usage: Optional[bool] = False
    stream_continuous_usage_stats: Optional[bool] = False

Tokenizer API¶

我们的Tokenizer API是对HuggingFace风格的分词器的简单封装。它包含两个端点：

/tokenize 对应调用 tokenizer.encode()。
/detokenize 对应调用 tokenizer.decode()。

池化API¶

我们的Pooling API使用pooling模型对输入提示进行编码，并返回相应的隐藏状态。

输入格式与Embeddings API相同，但输出数据可以包含任意嵌套列表，而不仅限于一维浮点数列表。

代码示例: examples/online_serving/openai_pooling_client.py

分类API¶

我们的分类API直接支持Hugging Face序列分类模型，例如ai21labs/Jamba-tiny-reward-dev和jason9693/Qwen2.5-1.5B-apeach。

我们会自动通过as_seq_cls_model()封装任何其他transformer模型，该函数会在最后一个token上进行池化操作，附加一个RowParallelLinear头部，并应用softmax函数来生成每个类别的概率。

代码示例: examples/online_serving/openai_classification_client.py

示例请求¶

您可以通过传递字符串数组来对多个文本进行分类：

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": [
      "Loved the new café—coffee was great.",
      "This update broke everything. Frustrating."
    ]
  }'

Response

{
  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
  "object": "list",
  "created": 1745383065,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    },
    {
      "index": 1,
      "label": "Spoiled",
      "probs": [
        0.26448777318000793,
        0.7355121970176697
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 20,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

你也可以直接将字符串传递给input字段：

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": "Loved the new café—coffee was great."
  }'

Response

{
  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
  "object": "list",
  "created": 1745383213,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

额外参数¶

支持以下池化参数。

支持以下额外参数：

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

评分API¶

我们的Score API可以应用交叉编码器模型或嵌入模型来预测句子或多模态对的分数。当使用嵌入模型时，分数对应于每对嵌入之间的余弦相似度。通常，句子对的分数表示两个句子之间的相似度，范围从0到1。

您可以在sbert.net找到交叉编码器模型的文档。

代码示例: examples/online_serving/openai_cross_encoder_score.py

单次推理¶

你可以向text_1和text_2传递字符串，形成一个句子对。

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'

Response

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

批量推理¶

你可以向text_1传递一个字符串，向text_2传递一个列表，从而形成多个句子对，其中每个句子对由text_1和text_2中的一个字符串构建而成。句子对的总数为len(text_2)。

Request

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

Response

{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

你可以向text_1和text_2传递列表，形成多个句子对，其中每个对由text_1中的一个字符串和text_2中对应的字符串构建（类似于zip()）。总对数等于len(text_2)。

Request

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

Response

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以通过在请求中传递包含多模态输入（如图像等）列表的content，将多模态输入传递给评分模型。请参考以下示例进行说明。

JinaVL-Reranker

启动模型服务：

vllm serve jinaai/jina-reranker-m0

由于请求模式未由OpenAI客户端定义，我们使用底层的requests库向服务器发送请求：

Code

import requests

response = requests.post(
    "http://localhost:8000/v1/score",
    json={
        "model": "jinaai/jina-reranker-m0",
        "text_1": "slm markdown",
        "text_2": {
          "content": [
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                      },
                  },
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
                      },
                  },
              ]
          }
        },
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])

完整示例: examples/online_serving/openai_cross_encoder_score_for_multimodal.py

额外参数¶

支持以下池化参数。

支持以下额外参数：

    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

重排序API¶

我们的重排序API可以应用嵌入模型或交叉编码器模型来预测单个查询与文档列表中每个文档之间的相关性分数。通常，句子对的分数指的是两个句子或多模态输入（如图像等）之间的相似度，范围在0到1之间。

您可以在sbert.net上找到交叉编码器模型的文档。

重排序端点支持流行的重排序模型，例如BAAI/bge-reranker-base以及其他支持score任务的模型。此外，/rerank、/v1/rerank和/v2/rerank端点同时兼容Jina AI的重排序API接口和Cohere的重排序API接口，以确保与主流开源工具的兼容性。

代码示例： examples/online_serving/jinaai_rerank_client.py

示例请求¶

请注意，top_n请求参数是可选的，默认值为documents字段的长度。结果文档将按相关性排序，index属性可用于确定原始顺序。

Request

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'

Response

{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}

额外参数¶

支持以下池化参数。

支持以下额外参数：

    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

Ray Serve LLM¶

Ray Serve LLM 实现了 vLLM 引擎的可扩展、生产级服务。它与 vLLM 紧密集成，并通过自动扩缩容、负载均衡和背压等特性进行了功能扩展。

核心功能：

提供与OpenAI兼容的HTTP API以及Python风格的API接口。
无需修改代码即可从单GPU扩展到多节点集群。
通过Ray仪表盘和指标提供可观测性和自动扩缩容策略。

以下示例展示了如何使用Ray Serve LLM部署像DeepSeek R1这样的大型模型： examples/online_serving/ray_serve_deepseek.py。

了解更多关于Ray Serve LLM的信息，请参阅官方Ray Serve LLM文档。

OpenAI兼容服务器¶

支持的API¶

聊天模板¶

额外参数¶

额外HTTP请求头¶

API参考¶

Completions API¶

额外参数¶

聊天API¶

额外参数¶

Embeddings API¶

多模态输入¶

额外参数¶

转录API¶

API强制限制¶

额外参数¶

Translations API¶

额外参数¶

Tokenizer API¶

池化API¶

分类API¶

示例请求¶

额外参数¶

评分API¶

单次推理¶

批量推理¶

多模态输入¶

额外参数¶

重排序API¶

示例请求¶

额外参数¶

Ray Serve LLM¶