Hugging Face 大语言模型¶
与 Hugging Face 的大语言模型交互有多种方式,既可在本地运行,也可通过 Hugging Face 的 推理服务提供商 实现。Hugging Face 自身提供了多个 Python 包来实现访问功能,LlamaIndex 将其封装为 LLM 实体:
transformers包: 使用llama_index.llms.HuggingFaceLLM- Hugging Face 推理服务提供商,
由
huggingface_hub[inference]封装: 使用llama_index.llms.HuggingFaceInferenceAPI
这两种方式存在 众多 可能的组合变体,因此本笔记本仅详述部分案例。我们以 Hugging Face 的 文本生成任务 为例进行说明。
在以下命令行中,我们安装本演示所需的软件包:
transformers[torch]是HuggingFaceLLM的必备依赖huggingface_hub[inference]是HuggingFaceInferenceAPI的必备依赖- 引号对于 Z shell(
zsh)环境是必需的
In [ ]:
Copied!
%pip install llama-index-llms-huggingface # for local inference
%pip install llama-index-llms-huggingface-api # for remote inference
%pip install llama-index-llms-huggingface # for local inference
%pip install llama-index-llms-huggingface-api # for remote inference
In [ ]:
Copied!
!pip install "transformers[torch]" "huggingface_hub[inference]"
!pip install "transformers[torch]" "huggingface_hub[inference]"
如果您在 Colab 上打开此 Notebook,可能需要安装 LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
现在我们已经完成设置,可以开始操作了:
In [ ]:
Copied!
import os
from typing import List, Optional
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")
# NOTE: None default will fall back on Hugging Face's token storage
# when this token gets used within HuggingFaceInferenceAPI
import os
from typing import List, Optional
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
HF_TOKEN: Optional[str] = os.getenv("HUGGING_FACE_TOKEN")
# NOTE: None default will fall back on Hugging Face's token storage
# when this token gets used within HuggingFaceInferenceAPI
In [ ]:
Copied!
remotely_run = HuggingFaceInferenceAPI(
model_name="deepseek-ai/DeepSeek-R1-0528",
token=HF_TOKEN,
provider="auto", # this will use the best provider available
)
remotely_run = HuggingFaceInferenceAPI(
model_name="deepseek-ai/DeepSeek-R1-0528",
token=HF_TOKEN,
provider="auto", # this will use the best provider available
)
我们也可以指定偏好的推理服务提供商。这里我们使用 together 提供商。
In [ ]:
Copied!
remotely_run = HuggingFaceInferenceAPI(
model_name="Qwen/Qwen3-235B-A22B",
token=HF_TOKEN,
provider="together", # this will use the best provider available
)
remotely_run = HuggingFaceInferenceAPI(
model_name="Qwen/Qwen3-235B-A22B",
token=HF_TOKEN,
provider="together", # this will use the best provider available
)
在本地使用开源模型¶
首先,我们将使用一个针对本地推理优化的开源模型。该模型会被下载(首次调用时)至本地Hugging Face模型缓存,并实际运行在您本地机器的硬件上。
我们将使用Gemma 3N E4B模型,该模型专为本地推理优化。
In [ ]:
Copied!
locally_run = HuggingFaceLLM(model_name="google/gemma-3n-E4B-it")
locally_run = HuggingFaceLLM(model_name="google/gemma-3n-E4B-it")
使用专用推理终端节点¶
我们还可以为模型启动一个专用的推理终端节点,并通过该节点运行模型。
In [ ]:
Copied!
endpoint_server = HuggingFaceInferenceAPI(
model="https://(<your-endpoint>.eu-west-1.aws.endpoints.huggingface.cloud"
)
endpoint_server = HuggingFaceInferenceAPI(
model="https://(.eu-west-1.aws.endpoints.huggingface.cloud"
)
In [ ]:
Copied!
# You can also connect to a model being served by a local or remote
# Text Generation Inference server
tgi_server = HuggingFaceInferenceAPI(model="http://localhost:8080")
# You can also connect to a model being served by a local or remote
# Text Generation Inference server
tgi_server = HuggingFaceInferenceAPI(model="http://localhost:8080")
在 HuggingFaceInferenceAPI 完成功能底层实现的是 Hugging Face 的
文本生成任务。
In [ ]:
Copied!
completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)
completion_response = remotely_run_recommended.complete("To infinity, and")
print(completion_response)
beyond! The Infinity Wall Clock is a unique and stylish way to keep track of time. The clock is made of a durable, high-quality plastic and features a bright LED display. The Infinity Wall Clock is powered by batteries and can be mounted on any wall. It is a great addition to any home or office.
设置分词器¶
如果您正在修改LLM模型,也应同步调整全局分词器以保持匹配!
In [ ]:
Copied!
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha").encode
)
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha").encode
)
如果您感兴趣,其他已封装的 Hugging Face Inference API 任务包括:
llama_index.llms.HuggingFaceInferenceAPI.chat: 对话任务llama_index.embeddings.HuggingFaceInferenceAPIEmbedding: 特征提取任务
是的,Hugging Face 的嵌入模型通过以下方式支持:
transformers[torch]: 由HuggingFaceEmbedding封装huggingface_hub[inference]: 由HuggingFaceInferenceAPIEmbedding封装
上述两个类都是 llama_index.embeddings.base.BaseEmbedding 的子类。