Xorbits 推理引擎¶

在本演示笔记本中，我们将通过三个步骤展示如何使用 Xorbits Inference（简称 Xinference）部署本地大语言模型。

示例中将使用 GGML 格式的 Llama 2 对话模型，但代码应能轻松迁移至 Xinference 支持的所有对话模型。以下是部分支持模型的示例：

名称	类型	语言	格式	参数量（十亿）	量化方式
llama-2-chat	RLHF 模型	英文	ggmlv3	7, 13, 70	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'
chatglm	SFT 模型	英/中	ggmlv3	6	'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'
chatglm2	SFT 模型	英/中	ggmlv3	6	'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'
wizardlm-v1.0	SFT 模型	英文	ggmlv3	7, 13, 33	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'
wizardlm-v1.1	SFT 模型	英文	ggmlv3	13	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'
vicuna-v1.3	SFT 模型	英文	ggmlv3	7, 13	'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'

完整的最新支持模型列表请参见 Xorbits Inference 的官方 GitHub 页面。

🤖 安装 Xinference¶

i. 在终端窗口中运行 pip install "xinference[all]" 命令。

ii. 安装完成后，请重启当前 Jupyter notebook。

iii. 在新终端窗口中运行 xinference 命令。

iv. 您将看到类似以下的输出内容：

INFO:xinference:Xinference successfully started. Endpoint: http://127.0.0.1:9997
INFO:xinference.core.service:Worker 127.0.0.1:21561 has been added successfully
INFO:xinference.deploy.worker:Xinference worker successfully started.

v. 在端点描述中，定位冒号后的端口号。上述示例中的端口号为 9997。

vi. 通过以下单元格设置端口号：

In [ ]:

Copied!

%pip install llama-index-llms-xinference
%pip install llama-index-llms-xinference

In [ ]:

Copied!

port = 9997  # replace with your endpoint port number
port = 9997  # replace with your endpoint port number

🚀 启动本地模型¶

在这一步骤中，我们首先从 llama_index 导入相关库

如果您在 Colab 上打开此 Notebook，可能需要安装 LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!





# If Xinference can not be imported, you may need to restart jupyter notebook
from llama_index.core import SummaryIndex
from llama_index.core import (
    TreeIndex,
    VectorStoreIndex,
    KeywordTableIndex,
    KnowledgeGraphIndex,
    SimpleDirectoryReader,
)
from llama_index.llms.xinference import Xinference
from xinference.client import RESTfulClient
from IPython.display import Markdown, display
# If Xinference can not be imported, you may need to restart jupyter notebook
from llama_index.core import SummaryIndex
from llama_index.core import (
    TreeIndex,
    VectorStoreIndex,
    KeywordTableIndex,
    KnowledgeGraphIndex,
    SimpleDirectoryReader,
)
from llama_index.llms.xinference import Xinference
from xinference.client import RESTfulClient
from IPython.display import Markdown, display

接着，我们启动并运行一个模型。这使我们能够在后续步骤中将模型与文档和查询关联起来。

欢迎调整参数以获得更佳性能！为达到最优效果，建议使用130亿参数以上的模型。不过对于这个简短演示而言，70亿参数的模型已完全够用。

以下是GGML格式的Llama 2聊天模型更多参数选项，按资源占用从低到高、性能从弱到强排列：

模型规模（单位：十亿参数）：

7, 13, 70

70亿/130亿模型的量化选项：

q2_K, q3_K_L, q3_K_M, q3_K_S, q4_0, q4_1, q4_K_M, q4_K_S, q5_0, q5_1, q5_K_M, q5_K_S, q6_K, q8_0

700亿模型的量化选项：

q4_0

In [ ]:

Copied!





# Define a client to send commands to xinference
client = RESTfulClient(f"http://localhost:{port}")

# Download and Launch a model, this may take a while the first time
model_uid = client.launch_model(
    model_name="llama-2-chat",
    model_size_in_billions=7,
    model_format="ggmlv3",
    quantization="q2_K",
)

# Initiate Xinference object to use the LLM
llm = Xinference(
    endpoint=f"http://localhost:{port}",
    model_uid=model_uid,
    temperature=0.0,
    max_tokens=512,
)
# Define a client to send commands to xinference
client = RESTfulClient(f"http://localhost:{port}")

# Download and Launch a model, this may take a while the first time
model_uid = client.launch_model(
    model_name="llama-2-chat",
    model_size_in_billions=7,
    model_format="ggmlv3",
    quantization="q2_K",
)

# Initiate Xinference object to use the LLM
llm = Xinference(
    endpoint=f"http://localhost:{port}",
    model_uid=model_uid,
    temperature=0.0,
    max_tokens=512,
)

🕺 数据索引...然后开聊！¶

在此步骤中，我们将模型与数据结合以创建查询引擎。该查询引擎随后可作为聊天机器人使用，根据给定数据回答我们的查询。

我们将使用VetorStoreIndex，因为它相对较快。话虽如此，欢迎尝试更换索引以获得不同体验。以下是上一步已导入的部分可用索引类型：

ListIndex, TreeIndex, VetorStoreIndex, KeywordTableIndex, KnowledgeGraphIndex

要更换索引类型，只需在后续代码中将VetorStoreIndex替换为其他索引即可。

所有可用索引的最新完整列表可在Llama Index的官方文档中查阅

In [ ]:

Copied!





# create index from the data
documents = SimpleDirectoryReader("../data/paul_graham").load_data()

# change index name in the following line
index = VectorStoreIndex.from_documents(documents=documents)

# create the query engine
query_engine = index.as_query_engine(llm=llm)
# create index from the data
documents = SimpleDirectoryReader("../data/paul_graham").load_data()

# change index name in the following line
index = VectorStoreIndex.from_documents(documents=documents)

# create the query engine
query_engine = index.as_query_engine(llm=llm)

在提问前，我们可以选择通过 Xinference 对象直接设置温度参数（temperature）和最大回答长度（以token计）。这样无需每次重建查询引擎，就能针对不同问题调整参数。

temperature 是介于0到1之间的数值，用于控制回答的随机性。数值越高创意性越强，但可能导致偏离主题的回复。设为零可确保每次获得相同响应。

max_tokens 是整数型参数，用于设定回答长度的上限。若发现回答被截断可适当调高该值，但需注意过长的响应可能超出上下文窗口并引发错误。

In [ ]:

Copied!





# optionally, update the temperature and max answer length (in tokens)
llm.__dict__.update({"temperature": 0.0})
llm.__dict__.update({"max_tokens": 2048})

# ask a question and display the answer
question = "What did the author do after his time at Y Combinator?"

response = query_engine.query(question)
display(Markdown(f"<b>{response}</b>"))
# optionally, update the temperature and max answer length (in tokens)
llm.__dict__.update({"temperature": 0.0})
llm.__dict__.update({"max_tokens": 2048})

# ask a question and display the answer
question = "What did the author do after his time at Y Combinator?"

response = query_engine.query(question)
display(Markdown(f"{response}"))