Hugging Face 大语言模型

有多种方式可以与来自Hugging Face的大语言模型进行交互，无论是本地部署还是通过Hugging Face的推理服务提供商。 Hugging Face本身提供了多个Python包来实现访问功能， LlamaIndex将其封装成LLM实体：

transformers 包：使用 llama_index.llms.HuggingFaceLLM
Hugging Face 推理服务提供商，由 huggingface_hub[inference] 封装：使用 llama_index.llms.HuggingFaceInferenceAPI

这两个元素存在许多可能的排列组合，因此本笔记本仅详细说明其中几种。让我们以 Hugging Face 的文本生成任务为例进行说明。

在下面这行中，我们安装本演示所需的软件包：

transformers[torch] 是 HuggingFaceLLM 所需的
huggingface_hub[inference] 是 HuggingFaceInferenceAPI 所需的
引号对于 Z shell 是必需的（zsh）

%pip install llama-index-llms-huggingface # for local inference
%pip install llama-index-llms-huggingface-api # for remote inference

!pip install "transformers[torch]" "huggingface_hub[inference]"

如果您在 Colab 上打开这个笔记本，您可能需要安装 LlamaIndex 🦙。

!pip install llama-index

现在我们已经设置好了，让我们来试试看：

设置 Hugging Face 账户

首先，您需要创建一个 Hugging Face 账户并获取令牌。您可以在此处注册。然后您需要在此处创建令牌。

export HUGGING_FACE_TOKEN=hf_your_token_here

import os
from typing import List, Optional

from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")
# NOTE: None default will fall back on Hugging Face's token storage
# when this token gets used within HuggingFaceInferenceAPI

通过推理提供商使用模型

使用开源模型最简单的方式是利用 Hugging Face 推理服务提供商。让我们使用 DeepSeek R1 模型，该模型非常适合处理复杂任务。

通过推理服务提供商，您可以在无服务器基础设施上使用模型。

remotely_run = HuggingFaceInferenceAPI(
    model_name="deepseek-ai/DeepSeek-R1-0528",
    token=HF_TOKEN,
    provider="auto",  # this will use the best provider available
)

我们也可以指定偏好的推理服务提供商。让我们使用together 提供商。

remotely_run = HuggingFaceInferenceAPI(
    model_name="Qwen/Qwen3-235B-A22B",
    token=HF_TOKEN,
    provider="together",  # this will use the best provider available
)

使用本地开源模型

首先，我们将使用一个针对本地推理优化的开源模型。该模型会被下载（如果是首次调用）到本地的 Hugging Face 模型缓存中，并实际在您本地机器的硬件上运行该模型。

我们将使用Gemma 3N E4B模型，该模型针对本地推理进行了优化。

locally_run = HuggingFaceLLM(model_name="google/gemma-3n-E4B-it")

使用专用推理端点

我们也可以为模型启动一个专用的推理端点，并使用它来运行模型。

endpoint_server = HuggingFaceInferenceAPI(
    model="https://(<your-endpoint>.eu-west-1.aws.endpoints.huggingface.cloud"
)

使用本地推理引擎（vLLM 或 TGI）

我们也可以使用像 vLLM 或 TGI 这样的本地推理引擎来运行模型。

# You can also connect to a model being served by a local or remote
# Text Generation Inference server
tgi_server = HuggingFaceInferenceAPI(model="http://localhost:8080")

基于 HuggingFaceInferenceAPI 的补全功能底层采用的是 Hugging Face 的文本生成任务。

completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)

 beyond!
The Infinity Wall Clock is a unique and stylish way to keep track of time. The clock is made of a durable, high-quality plastic and features a bright LED display. The Infinity Wall Clock is powered by batteries and can be mounted on any wall. It is a great addition to any home or office.

设置分词器

如果您正在修改LLM，您也应该更改全局分词器以匹配！

from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha").encode
)

如果您好奇的话，其他封装的 Hugging Face 推理 API 任务包括：

llama_index.llms.HuggingFaceInferenceAPI.chat: 对话任务
llama_index.embeddings.HuggingFaceInferenceAPIEmbedding: 特征提取任务

是的，Hugging Face 嵌入模型支持以下方式：

transformers[torch]: 由 HuggingFaceEmbedding 包装
huggingface_hub[inference]: 由 HuggingFaceInferenceAPIEmbedding 包装

以上两个子类都继承自 llama_index.embeddings.base.BaseEmbedding。