Hugging Face 大语言模型¶

与 Hugging Face 的大语言模型交互有多种方式，既可在本地运行，也可通过 Hugging Face 的推理服务提供商实现。Hugging Face 自身提供了多个 Python 包来实现访问功能，LlamaIndex 将其封装为 LLM 实体：

transformers 包：使用 llama_index.llms.HuggingFaceLLM
Hugging Face 推理服务提供商，由 huggingface_hub[inference] 封装：使用 llama_index.llms.HuggingFaceInferenceAPI

这两种方式存在众多可能的组合变体，因此本笔记本仅详述部分案例。我们以 Hugging Face 的文本生成任务为例进行说明。

在以下命令行中，我们安装本演示所需的软件包：

transformers[torch] 是 HuggingFaceLLM 的必备依赖
huggingface_hub[inference] 是 HuggingFaceInferenceAPI 的必备依赖
引号对于 Z shell（zsh）环境是必需的

In [ ]:

Copied!

%pip install llama-index-llms-huggingface # for local inference
%pip install llama-index-llms-huggingface-api # for remote inference
%pip install llama-index-llms-huggingface # for local inference
%pip install llama-index-llms-huggingface-api # for remote inference

In [ ]:

Copied!

!pip install "transformers[torch]" "huggingface_hub[inference]"
!pip install "transformers[torch]" "huggingface_hub[inference]"

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

现在我们已经完成设置，可以开始操作了：

设置 Hugging Face 账户¶

首先，您需要创建一个 Hugging Face 账户并获取令牌。您可以在此注册。然后需要在此创建令牌。

export HUGGING_FACE_TOKEN=hf_your_token_here

In [ ]:

Copied!





import os
from typing import List, Optional

from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")
# NOTE: None default will fall back on Hugging Face's token storage
# when this token gets used within HuggingFaceInferenceAPI
import os
from typing import List, Optional

from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")
# NOTE: None default will fall back on Hugging Face's token storage
# when this token gets used within HuggingFaceInferenceAPI

通过推理服务提供商使用模型¶

使用开源模型最简单的方式是借助 Hugging Face 的推理服务提供商。我们以擅长处理复杂任务的 DeepSeek R1 模型为例。

通过推理服务提供商，您可以在无服务器基础设施上直接调用模型。

In [ ]:

Copied!





remotely_run = HuggingFaceInferenceAPI(
    model_name="deepseek-ai/DeepSeek-R1-0528",
    token=HF_TOKEN,
    provider="auto",  # this will use the best provider available
)
remotely_run = HuggingFaceInferenceAPI(
    model_name="deepseek-ai/DeepSeek-R1-0528",
    token=HF_TOKEN,
    provider="auto",  # this will use the best provider available
)

我们也可以指定偏好的推理服务提供商。这里我们使用 together 提供商。

In [ ]:

Copied!





remotely_run = HuggingFaceInferenceAPI(
    model_name="Qwen/Qwen3-235B-A22B",
    token=HF_TOKEN,
    provider="together",  # this will use the best provider available
)
remotely_run = HuggingFaceInferenceAPI(
    model_name="Qwen/Qwen3-235B-A22B",
    token=HF_TOKEN,
    provider="together",  # this will use the best provider available
)

在本地使用开源模型¶

首先，我们将使用一个针对本地推理优化的开源模型。该模型会被下载（首次调用时）至本地Hugging Face模型缓存，并实际运行在您本地机器的硬件上。

我们将使用Gemma 3N E4B模型，该模型专为本地推理优化。

In [ ]:

Copied!

locally_run = HuggingFaceLLM(model_name="google/gemma-3n-E4B-it")
locally_run = HuggingFaceLLM(model_name="google/gemma-3n-E4B-it")

使用专用推理终端节点¶

我们还可以为模型启动一个专用的推理终端节点，并通过该节点运行模型。

In [ ]:

Copied!

endpoint_server = HuggingFaceInferenceAPI(
    model="https://(<your-endpoint>.eu-west-1.aws.endpoints.huggingface.cloud"
)
endpoint_server = HuggingFaceInferenceAPI(
    model="https://(.eu-west-1.aws.endpoints.huggingface.cloud"
)

使用本地推理引擎（vLLM 或 TGI）¶

我们也可以使用本地推理引擎如 vLLM 或 TGI 来运行模型。

In [ ]:

Copied!

# You can also connect to a model being served by a local or remote
# Text Generation Inference server
tgi_server = HuggingFaceInferenceAPI(model="http://localhost:8080")
# You can also connect to a model being served by a local or remote
# Text Generation Inference server
tgi_server = HuggingFaceInferenceAPI(model="http://localhost:8080")

在 HuggingFaceInferenceAPI 完成功能底层实现的是 Hugging Face 的
文本生成任务。

In [ ]:

Copied!

completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)
completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)

 beyond!
The Infinity Wall Clock is a unique and stylish way to keep track of time. The clock is made of a durable, high-quality plastic and features a bright LED display. The Infinity Wall Clock is powered by batteries and can be mounted on any wall. It is a great addition to any home or office.

设置分词器¶

如果您正在修改LLM模型，也应同步调整全局分词器以保持匹配！

In [ ]:

Copied!





from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha").encode
)
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha").encode
)

如果您感兴趣，其他已封装的 Hugging Face Inference API 任务包括：

llama_index.llms.HuggingFaceInferenceAPI.chat: 对话任务
llama_index.embeddings.HuggingFaceInferenceAPIEmbedding: 特征提取任务

是的，Hugging Face 的嵌入模型通过以下方式支持：

transformers[torch]: 由 HuggingFaceEmbedding 封装
huggingface_hub[inference]: 由 HuggingFaceInferenceAPIEmbedding 封装

上述两个类都是 llama_index.embeddings.base.BaseEmbedding 的子类。