使用 Qdrant BM42 实现混合搜索¶

Qdrant 最近发布了一种新的稀疏嵌入轻量级方法 BM42。

在本笔记本中，我们将逐步介绍如何将 BM42 与 llama-index 结合使用，以实现高效的混合搜索。

安装配置¶

首先需要安装以下依赖包：

llama-index
llama-index-vector-stores-qdrant
fastembed 或 fastembed-gpu

若已安装相关库，llama-index 将自动在 GPU 上运行 fastembed 模型。具体安装说明请参阅官方完整指南。

In [ ]:

Copied!

%pip install llama-index llama-index-vector-stores-qdrant fastembed
%pip install llama-index llama-index-vector-stores-qdrant fastembed

（可选）测试 fastembed 包¶

为了确认安装成功（同时在使用 GPU 时验证其是否正常工作），我们可以运行以下代码。

这段代码会先下载（并缓存）模型到本地，然后进行嵌入处理。

In [ ]:

Copied!





from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(
    model_name="Qdrant/bm42-all-minilm-l6-v2-attentions",
    # if using fastembed-gpu with cuda+onnx installed
    # providers=["CudaExecutionProvider"],
)

embeddings = model.embed(["hello world", "goodbye world"])

indices, values = zip(
    *[
        (embedding.indices.tolist(), embedding.values.tolist())
        for embedding in embeddings
    ]
)

print(indices[0], values[0])
from fastembed import SparseTextEmbedding

model = SparseTextEmbedding(
    model_name="Qdrant/bm42-all-minilm-l6-v2-attentions",
    # if using fastembed-gpu with cuda+onnx installed
    # providers=["CudaExecutionProvider"],
)

embeddings = model.embed(["hello world", "goodbye world"])

indices, values = zip(
    *[
        (embedding.indices.tolist(), embedding.values.tolist())
        for embedding in embeddings
    ]
)

print(indices[0], values[0])

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

[613153351, 74040069] [0.3703993395381275, 0.3338314745830077]

构建混合索引¶

在 llama-index 中，我们仅需几行代码即可构建混合索引。

如果您过去曾尝试使用 splade 实现混合检索，您会发现当前方案的速度显著提升！

加载数据¶

这里，我们使用 llama-parse 来读取 Llama2 论文！通过 JSON 结果模式，可以获取每页的详细数据，包括布局和图像。目前我们将使用页码和文本内容。

您可以通过访问 https://cloud.llamaindex.ai 获取免费的 llama-parse API 密钥。

In [ ]:

Copied!

!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

In [ ]:

Copied!

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio

nest_asyncio.apply()

In [ ]:

Copied!





from llama_parse import LlamaParse
from llama_index.core import Document

parser = LlamaParse(result_type="text", api_key="llx-...")

# get per-page results, along with detailed layout info and metadata
json_data = parser.get_json_result("data/llama2.pdf")

documents = []
for document_json in json_data:
    for page in document_json["pages"]:
        documents.append(
            Document(text=page["text"], metadata={"page_number": page["page"]})
        )
from llama_parse import LlamaParse
from llama_index.core import Document

parser = LlamaParse(result_type="text", api_key="llx-...")

# get per-page results, along with detailed layout info and metadata
json_data = parser.get_json_result("data/llama2.pdf")

documents = []
for document_json in json_data:
    for page in document_json["pages"]:
        documents.append(
            Document(text=page["text"], metadata={"page_number": page["page"]})
        )

Started parsing the file under job_id cac11eca-4058-4a89-a94a-5603dea3d851

使用 Qdrant 构建索引¶

通过我们的节点，我们可以利用 Qdrant 和 BM42 构建索引！

在本例中，Qdrant 运行在 Docker 容器中。

你可以拉取最新版本：

docker pull qdrant/qdrant

然后启动容器：

docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

In [ ]:

Copied!





import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient("http://localhost:6333")
aclient = qdrant_client.AsyncQdrantClient("http://localhost:6333")

# delete collection if it exists
if client.collection_exists("llama2_bm42"):
    client.delete_collection("llama2_bm42")

vector_store = QdrantVectorStore(
    collection_name="llama2_bm42",
    client=client,
    aclient=aclient,
    fastembed_sparse_model="Qdrant/bm42-all-minilm-l6-v2-attentions",
)
import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient("http://localhost:6333")
aclient = qdrant_client.AsyncQdrantClient("http://localhost:6333")

# delete collection if it exists
if client.collection_exists("llama2_bm42"):
    client.delete_collection("llama2_bm42")

vector_store = QdrantVectorStore(
    collection_name="llama2_bm42",
    client=client,
    aclient=aclient,
    fastembed_sparse_model="Qdrant/bm42-all-minilm-l6-v2-attentions",
)

Both client and aclient are provided. If using `:memory:` mode, the data between clients is not synced.

In [ ]:

Copied!





from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    # our dense embedding model
    embed_model=OpenAIEmbedding(
        model_name="text-embedding-3-small", api_key="sk-proj-..."
    ),
    storage_context=storage_context,
)
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents,
    # our dense embedding model
    embed_model=OpenAIEmbedding(
        model_name="text-embedding-3-small", api_key="sk-proj-..."
    ),
    storage_context=storage_context,
)

如我们所见，密集嵌入和稀疏嵌入的生成速度都极快！

尽管稀疏模型是在本地 CPU 上运行的，但它体积小巧且运行迅捷。

测试索引功能¶

借助稀疏嵌入的强大能力，我们可以查询某些非常具体的事实，并获取准确的数据。

In [ ]:

Copied!





from llama_index.llms.openai import OpenAI

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    llm=OpenAI(model="gpt-4o", api_key="sk-proj-..."),
)
from llama_index.llms.openai import OpenAI

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    llm=OpenAI(model="gpt-4o", api_key="sk-proj-..."),
)

In [ ]:

Copied!

response = chat_engine.chat("What training hardware was used for Llama2?")
print(str(response))
response = chat_engine.chat("What training hardware was used for Llama2?")
print(str(response))

The training hardware for Llama 2 included Meta’s Research Super Cluster (RSC) and internal production clusters. Both clusters utilized NVIDIA A100 GPUs. There were two key differences between these clusters:

1. **Interconnect Type**:
   - RSC used NVIDIA Quantum InfiniBand.
   - The internal production cluster used a RoCE (RDMA over Converged Ethernet) solution based on commodity Ethernet switches.

2. **Per-GPU Power Consumption Cap**:
   - RSC had a power consumption cap of 400W per GPU.
   - The internal production cluster had a power consumption cap of 350W per GPU.

This setup allowed for a comparison of the suitability of these different types of interconnects for large-scale training.

In [ ]:

Copied!

response = chat_engine.chat("What is the main idea of Llama2?")
print(str(response))
response = chat_engine.chat("What is the main idea of Llama2?")
print(str(response))

The main idea of Llama 2 is to provide an updated and improved version of the original Llama model, designed to be more efficient, scalable, and safe for various applications, including research and commercial use. Here are the key aspects of Llama 2:

1. **Enhanced Pretraining**: Llama 2 is trained on a new mix of publicly available data, with a 40% increase in the size of the pretraining corpus compared to Llama 1. This aims to improve the model's performance and knowledge base.

2. **Improved Architecture**: The model incorporates several architectural enhancements, such as increased context length and grouped-query attention (GQA), to improve inference scalability and overall performance.

3. **Safety and Responsiveness**: Llama 2-Chat, a fine-tuned version of Llama 2, is optimized for dialogue use cases. It undergoes supervised fine-tuning and iterative refinement using Reinforcement Learning with Human Feedback (RLHF) to ensure safer and more helpful interactions.

4. **Open Release**: Meta is releasing Llama 2 models with 7B, 13B, and 70B parameters to the general public for research and commercial use, promoting transparency and collaboration in the AI community.

5. **Responsible Use**: The release includes guidelines and code examples to facilitate the safe deployment of Llama 2 and Llama 2-Chat, emphasizing the importance of safety testing and tuning tailored to specific applications.

Overall, Llama 2 aims to be a more robust, scalable, and safer large language model that can be widely used and further developed by the AI community.

In [ ]:

Copied!

response = chat_engine.chat("What was Llama2 evaluated and compared against?")
print(str(response))
response = chat_engine.chat("What was Llama2 evaluated and compared against?")
print(str(response))

Llama 2 was evaluated and compared against several other models, both open-source and closed-source, across a variety of benchmarks. Here are the key comparisons:

### Open-Source Models:
1. **Llama 1**: Llama 2 models were compared to their predecessors, Llama 1 models. For example, Llama 2 70B showed improvements of approximately 5 points on MMLU and 8 points on BBH compared to Llama 1 65B.
2. **MPT Models**: Llama 2 7B and 30B models outperformed MPT models of corresponding sizes in all categories except code benchmarks.
3. **Falcon Models**: Llama 2 7B and 34B models outperformed Falcon 7B and 40B models across all benchmark categories.

### Closed-Source Models:
1. **GPT-3.5**: Llama 2 70B was compared to GPT-3.5, showing close performance on MMLU and GSM8K but a significant gap on coding benchmarks.
2. **PaLM (540B)**: Llama 2 70B performed on par or better than PaLM (540B) on almost all benchmarks.
3. **GPT-4 and PaLM-2-L**: There remains a large performance gap between Llama 2 70B and these more advanced models.

### Benchmarks:
Llama 2 was evaluated on a variety of benchmarks, including:
1. **MMLU (Massive Multitask Language Understanding)**: Evaluated in a 5-shot setting.
2. **BBH (Big Bench Hard)**: Evaluated in a 3-shot setting.
3. **AGI Eval**: Evaluated in 3-5 shot settings, focusing on English tasks.
4. **GSM8K**: For math problem-solving.
5. **Human-Eval and MBPP**: For code generation.
6. **NaturalQuestions and TriviaQA**: For world knowledge.
7. **SQUAD and QUAC**: For reading comprehension.
8. **BoolQ, PIQA, SIQA, Hella-Swag, ARC-e, ARC-c, NQ, TQA**: Various other benchmarks for different aspects of language understanding and reasoning.

These evaluations demonstrate that Llama 2 models generally outperform their predecessors and other open-source models, while also being competitive with some of the leading closed-source models.

从现有存储加载¶

向量索引创建完成后，我们可以轻松地重新连接到它！

In [ ]:

Copied!





import qdrant_client
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient("http://localhost:6333")
aclient = qdrant_client.AsyncQdrantClient("http://localhost:6333")

# delete collection if it exists
if client.collection_exists("llama2_bm42"):
    client.delete_collection("llama2_bm42")

vector_store = QdrantVectorStore(
    collection_name="llama2_bm42",
    client=client,
    aclient=aclient,
    fastembed_sparse_model="Qdrant/bm42-all-minilm-l6-v2-attentions",
)

loaded_index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=OpenAIEmbedding(
        model="text-embedding-3-small", api_key="sk-proj-..."
    ),
)
import qdrant_client
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient("http://localhost:6333")
aclient = qdrant_client.AsyncQdrantClient("http://localhost:6333")

# delete collection if it exists
if client.collection_exists("llama2_bm42"):
    client.delete_collection("llama2_bm42")

vector_store = QdrantVectorStore(
    collection_name="llama2_bm42",
    client=client,
    aclient=aclient,
    fastembed_sparse_model="Qdrant/bm42-all-minilm-l6-v2-attentions",
)

loaded_index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=OpenAIEmbedding(
        model="text-embedding-3-small", api_key="sk-proj-..."
    ),
)