Huggingface

LiteLLM 支持以下类型的 Hugging Face 模型：

无服务器推理 API（免费）- 已加载并可使用：https://huggingface.co/models?inference=warm&pipeline_tag=text-generation
专用推理端点（付费）- 手动部署：https://ui.endpoints.huggingface.co/
所有通过 Hugging Face 的推理服务提供的 LLM 使用 Text-generation-inference。

使用方法

你需要告诉 LiteLLM 何时调用 Huggingface。这可以通过在 model 前添加 "huggingface/" 前缀来实现，例如 completion(model="huggingface/<model_name>",...)。

默认情况下，LiteLLM 会假设 Hugging Face 调用遵循 Messages API，该 API 完全兼容 OpenAI Chat Completion API。

import os
from litellm import completion

# [可选] 设置环境变量
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

# 例如：从无服务器推理 API 调用 'https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct'
response = litellm.completion(
    model="huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{ "content": "Hello, how are you?","role": "user"}],
    stream=True
)

print(response)

将模型添加到你的 config.yaml 中

model_list:
  - model_name: llama-3.1-8B-instruct
    litellm_params:
      model: huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct
      api_key: os.environ/HUGGINGFACE_API_KEY

启动代理

$ litellm --config /path/to/config.yaml --debug

测试！

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Authorization: Bearer sk-1234' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "llama-3.1-8B-instruct",
    "messages": [
      {
          "role": "user",
          "content": "I like you!"
      }
      ],
}'

在模型名称后附加 text-classification

例如：huggingface/text-classification/<model-name>

import os
from litellm import completion

# [可选] 设置环境变量
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

messages = [{ "content": "I like you, I love you!","role": "user"}]

# 例如：调用托管在 HF Inference 端点上的 'shahrukhx01/question-vs-statement-classifier'
response = completion(
  model="huggingface/text-classification/shahrukhx01/question-vs-statement-classifier",
  messages=messages,
  api_base="https://my-endpoint.endpoints.huggingface.cloud",
)

print(response)

将模型添加到你的 config.yaml 中

model_list:
  - model_name: bert-classifier
    litellm_params:
      model: huggingface/text-classification/shahrukhx01/question-vs-statement-classifier
      api_key: os.environ/HUGGINGFACE_API_KEY
      api_base: "https://my-endpoint.endpoints.huggingface.cloud"

启动代理

$ litellm --config /path/to/config.yaml --debug

测试！

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Authorization: Bearer sk-1234' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "bert-classifier",
    "messages": [
      {
          "role": "user",
          "content": "I like you!"
      }
      ],
}'

使用步骤

在此处创建你自己的 Hugging Face 专用端点：https://ui.endpoints.huggingface.co/
将 api_base 设置为你部署的 API 基础
在模型前添加 huggingface/ 前缀，以便 litellm 知道这是一个 Hugging Face 部署的推理端点

import os
import litellm

os.environ["HUGGINGFACE_API_KEY"] = ""

# TGI 模型：调用 https://huggingface.co/glaiveai/glaive-coder-7b
# 在模型前添加 'huggingface/' 前缀，设置 huggingface 为提供者
# 将 api_base 设置为你从 hugging face 部署的 API 端点
response = litellm.completion(
    model="huggingface/glaiveai/glaive-coder-7b",
    messages=[{ "content": "Hello, how are you?","role": "user"}],
    api_base="https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud"
)
print(response)

将模型添加到你的config.yaml中

model_list:
  - model_name: glaive-coder
    litellm_params:
      model: huggingface/glaiveai/glaive-coder-7b
      api_key: os.environ/HUGGINGFACE_API_KEY
      api_base: "https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud"

启动代理

$ litellm --config /path/to/config.yaml --debug

测试它！

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Authorization: Bearer sk-1234' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "glaive-coder",
    "messages": [
      {
          "role": "user",
          "content": "I like you!"
      }
      ],
}'

流式传输

你需要告诉LiteLLM何时调用Huggingface。这是通过在model前添加"huggingface/"前缀来完成的，例如completion(model="huggingface/<model_name>",...)。

import os
from litellm import completion

# [OPTIONAL] 设置环境变量
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

# 例如，调用托管在HF推理端点上的'facebook/blenderbot-400M-distill'
response = completion(
  model="huggingface/facebook/blenderbot-400M-distill",
  messages=messages,
  api_base="https://my-endpoint.huggingface.cloud",
  stream=True
)

print(response)
for chunk in response:
  print(chunk)

嵌入

LiteLLM支持Hugging Face的text-embedding-inference格式。

from litellm import embedding
import os
os.environ['HUGGINGFACE_API_KEY'] = ""
response = embedding(
    model='huggingface/microsoft/codebert-base',
    input=["good morning from litellm"]
)

高级

设置API KEYS + API BASE

如果需要，你可以设置api key + api base，将其设置在os环境中。发送方式的代码

import os
os.environ["HUGGINGFACE_API_KEY"] = ""
os.environ["HUGGINGFACE_API_BASE"] = ""

查看Log probs

使用`decoder_input_details` - OpenAI `echo`

echo参数由OpenAI Completions支持 - 使用litellm.text_completion()进行此操作

from litellm import text_completion
response = text_completion(
    model="huggingface/bigcode/starcoder",
    prompt="good morning",
    max_tokens=10, logprobs=10,
    echo=True
)

输出

{
  "id": "chatcmpl-3fc71792-c442-4ba1-a611-19dd0ac371ad",
  "object": "text_completion",
  "created": 1698801125.936519,
  "model": "bigcode/starcoder",
  "choices": [
    {
      "text": ", I'm going to make you a sand",
      "index": 0,
      "logprobs": {
        "tokens": [
          "good",
          " morning",
          ",",
          " I",
          "'m",
          " going",
          " to",
          " make",
          " you",
          " a",
          " s",
          "and"
        ],
        "token_logprobs": [
          "None",
          -14.96875,
          -2.2285156,
          -2.734375,
          -2.0957031,
          -2.0917969,
          -0.09429932,
          -3.1132812,
          -1.3203125,
          -1.2304688,
          -1.6201172,
          -0.010292053
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "completion_tokens": 9,
    "prompt_tokens": 2,
    "total_tokens": 11
  }
}

带有提示格式的模型

对于具有特殊提示模板（例如Llama2）的模型，我们将其提示格式化为适合其模板。

具有原生支持提示模板的模型

模型名称	适用于模型	函数调用	所需OS变量
mistralai/Mistral-7B-Instruct-v0.1	mistralai/Mistral-7B-Instruct-v0.1	`completion(model='huggingface/mistralai/Mistral-7B-Instruct-v0.1', messages=messages, api_base="your_api_endpoint")`	`os.environ['HUGGINGFACE_API_KEY']`
meta-llama/Llama-2-7b-chat	所有meta-llama llama2聊天模型	`completion(model='huggingface/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")`	`os.environ['HUGGINGFACE_API_KEY']`
tiiuae/falcon-7b-instruct	所有falcon指令模型	`completion(model='huggingface/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")`	`os.environ['HUGGINGFACE_API_KEY']`
mosaicml/mpt-7b-chat	所有mpt聊天模型	`completion(model='huggingface/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")`	`os.environ['HUGGINGFACE_API_KEY']`
codellama/CodeLlama-34b-Instruct-hf	所有codellama指令模型	`completion(model='huggingface/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")`	`os.environ['HUGGINGFACE_API_KEY']`
WizardLM/WizardCoder-Python-34B-V1.0	所有wizardcoder模型	`completion(model='huggingface/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")`	`os.environ['HUGGINGFACE_API_KEY']`
Phind/Phind-CodeLlama-34B-v2	所有phind-codellama模型	`completion(model='huggingface/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")`	`os.environ['HUGGINGFACE_API_KEY']`

如果我们不支持您需要的模型怎么办？ 您也可以指定自定义提示格式，以防我们尚未涵盖您的模型。

这是否意味着您必须为所有模型指定提示？ 不需要。默认情况下，我们会将您的消息内容连接起来以形成提示。

默认提示模板

def default_pt(messages):
    return " ".join(message["content"] for message in messages)

LiteLLM中提示格式的代码

自定义提示模板

# 创建您自己的自定义提示模板
litellm.register_prompt_template(
    model="togethercomputer/LLaMA-2-7B-32K",
    roles={
            "system": {
                "pre_message": "[INST] <<SYS>>\n",
                "post_message": "\n<</SYS>>\n [/INST]\n"
            },
            "user": {
                "pre_message": "[INST] ",
                "post_message": " [/INST]\n"
            },
            "assistant": {
                "post_message": "\n"
            }
        }
    )

def test_huggingface_custom_model():
    model = "huggingface/togethercomputer/LLaMA-2-7B-32K"
    response = completion(model=model, messages=messages, api_base="https://ecd4sb5n09bo4ei2.us-east-1.aws.endpoints.huggingface.cloud")
    print(response['choices'][0]['message']['content'])
    return response

test_huggingface_custom_model()

实现代码

在huggingface上部署模型

您可以使用以下步骤使用Hugging Face的任何聊天/文本模型：

从Hugging Face推理端点复制您的模型ID/URL
- 转到 https://ui.endpoints.huggingface.co/
- 复制您要使用的特定模型的URL
```
<Image img={require('../../img/hf_inference_endpoint.png')} alt="HF_Dashboard" style={{ maxWidth: '50%', height: 'auto' }}/>
```
将其设置为您的模型名称
将HUGGINGFACE_API_KEY设置为环境变量

需要帮助在huggingface上部署模型吗？查看此指南。

输出

与OpenAI格式相同，但还包括logprobs。查看代码

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "\ud83d\ude31\n\nComment: @SarahSzabo I'm",
        "role": "assistant",
        "logprobs": -22.697942825499993
      }
    }
  ],
  "created": 1693436637.38206,
  "model": "https://ji16r2iys9a8rjk2.us-east-1.aws.endpoints.huggingface.cloud",
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 11,
    "total_tokens": 25
  }
}

常见问题

这是否支持停止序列？

是的，我们支持停止序列——您可以传递Hugging Face（或任何提供商）允许的任意多个。

你们如何处理重复惩罚？

我们将openai中的存在惩罚参数映射到Hugging Face上的重复惩罚参数。查看代码。

我们欢迎任何改进我们Hugging Face集成的建议——创建一个问题/加入Discord！

Huggingface

使用方法​

流式传输​

嵌入​

高级​

设置API KEYS + API BASE​

查看Log probs​

使用decoder_input_details - OpenAI echo​

输出​

带有提示格式的模型​

具有原生支持提示模板的模型​

自定义提示模板​

在huggingface上部署模型​

输出