使用 Weaviate 进行 RAG¶

步骤	技术	执行
嵌入	Open AI	🌐 远程
向量存储	Weaviate	💻 本地
生成式 AI	Open AI	🌐 远程

食谱 🧑‍🍳 🐥 💚¶

这是一个代码食谱，使用 Weaviate 对由 Docling 解析的 PDF 文档执行 RAG。

在本 Notebook 中，我们将完成以下内容

使用 Docling 解析 arXiv 上的顶级机器学习论文
使用 Docling 对文档执行分层分块
使用 OpenAI 生成文本嵌入
使用 Weaviate 执行 RAG

要运行此 Notebook，您需要

一个 OpenAI API 密钥
访问 GPU

注意：为获得最佳结果，请使用 GPU 加速 运行此 Notebook。以下是运行此 Notebook 的两个选项

在配备 Apple Silicon 芯片的 MacBook 本地运行。 由于 Docling 使用 MPS 加速器，在本 Notebook 中转换所有文档在 MacBook M2 上大约需要 2 分钟。
在 Google Colab 上运行此 Notebook。 在 Google Colab T4 GPU 上转换此 Notebook 中的所有文档大约需要 8 分钟。

安装 Docling 和 Weaviate 客户端¶

注意：如果 Colab 在运行以下单元格后提示您重新启动会话，请单击“重新启动”并继续运行 Notebook 的其余部分。

In [ ]

已复制!





%%capture
%pip install docling~="2.7.0"
%pip install -U weaviate-client~="4.9.4"
%pip install rich
%pip install torch

import logging
import warnings

warnings.filterwarnings("ignore")

# Suppress Weaviate client logs
logging.getLogger("weaviate").setLevel(logging.ERROR)
%%capture %pip install docling~="2.7.0" %pip install -U weaviate-client~="4.9.4" %pip install rich %pip install torch import logging import warnings warnings.filterwarnings("ignore") # Suppress Weaviate client logs logging.getLogger("weaviate").setLevel(logging.ERROR)

🐥 第 1 部分：Docling¶

Docling 如此出色的部分原因在于它可以运行在商用硬件上。这意味着此 Notebook 可以在具有 GPU 加速的本地机器上运行。如果您使用的是配备 Apple Silicon 芯片的 MacBook，Docling 可以与 Metal Performance Shaders (MPS) 无缝集成。MPS 为 macOS 提供开箱即用的 GPU 加速，与 PyTorch 和 TensorFlow 无缝集成，在 Apple Silicon 上提供节能性能，并与所有 Metal 支持的 GPU 广泛兼容。

以下代码检查 GPU 是否可用，通过 CUDA 或 MPS。

In [2]

已复制!





import torch

# Check if GPU or MPS is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS GPU is enabled.")
else:
    raise OSError(
        "No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
    )
import torch # Check if GPU or MPS is available if torch.cuda.is_available(): device = torch.device("cuda") print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}") elif torch.backends.mps.is_available(): device = torch.device("mps") print("MPS GPU is enabled.") else: raise OSError( "No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured." )

MPS GPU is enabled.

在此，我们收集了在 arXiv 上以 PDF 形式发布的 10 篇有影响力的机器学习论文。由于 Docling 尚不具备 PDF 的标题提取功能，我们手动在相应的列表中添加了标题。

注意：使用 T4 GPU 转换所有 10 篇论文大约需要 8 分钟。

In [3]

已复制!





# Influential machine learning papers
source_urls = [
    "https://arxiv.org/pdf/1706.03762",
    "https://arxiv.org/pdf/1810.04805",
    "https://arxiv.org/pdf/1406.2661",
    "https://arxiv.org/pdf/1409.0473",
    "https://arxiv.org/pdf/1412.6980",
    "https://arxiv.org/pdf/1312.6114",
    "https://arxiv.org/pdf/1312.5602",
    "https://arxiv.org/pdf/1512.03385",
    "https://arxiv.org/pdf/1409.3215",
    "https://arxiv.org/pdf/1301.3781",
]

# And their corresponding titles (because Docling doesn't have title extraction yet!)
source_titles = [
    "Attention Is All You Need",
    "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
    "Generative Adversarial Nets",
    "Neural Machine Translation by Jointly Learning to Align and Translate",
    "Adam: A Method for Stochastic Optimization",
    "Auto-Encoding Variational Bayes",
    "Playing Atari with Deep Reinforcement Learning",
    "Deep Residual Learning for Image Recognition",
    "Sequence to Sequence Learning with Neural Networks",
    "A Neural Probabilistic Language Model",
]
# Influential machine learning papers source_urls = [ "https://arxiv.org/pdf/1706.03762", "https://arxiv.org/pdf/1810.04805", "https://arxiv.org/pdf/1406.2661", "https://arxiv.org/pdf/1409.0473", "https://arxiv.org/pdf/1412.6980", "https://arxiv.org/pdf/1312.6114", "https://arxiv.org/pdf/1312.5602", "https://arxiv.org/pdf/1512.03385", "https://arxiv.org/pdf/1409.3215", "https://arxiv.org/pdf/1301.3781", ] # And their corresponding titles (because Docling doesn't have title extraction yet!) source_titles = [ "Attention Is All You Need", "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "Generative Adversarial Nets", "Neural Machine Translation by Jointly Learning to Align and Translate", "Adam: A Method for Stochastic Optimization", "Auto-Encoding Variational Bayes", "Playing Atari with Deep Reinforcement Learning", "Deep Residual Learning for Image Recognition", "Sequence to Sequence Learning with Neural Networks", "A Neural Probabilistic Language Model", ]

将 PDF 转换为 Docling 文档¶

在这里，我们使用 Docling 的 .convert_all() 来解析一批 PDF。结果是一个 Docling 文档列表，我们可以用它进行文本提取。

注意：请忽略 ERR# 消息。

In [4]

已复制!

from docling.document_converter import DocumentConverter

# Instantiate the doc converter
doc_converter = DocumentConverter()

# Directly pass list of files or streams to `convert_all`
conv_results_iter = doc_converter.convert_all(source_urls)  # previously `convert`

# Iterate over the generator to get a list of Docling documents
docs = [result.document for result in conv_results_iter]
from docling.document_converter import DocumentConverter # Instantiate the doc converter doc_converter = DocumentConverter() # Directly pass list of files or streams to `convert_all` conv_results_iter = doc_converter.convert_all(source_urls) # previously `convert` # Iterate over the generator to get a list of Docling documents docs = [result.document for result in conv_results_iter]

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 84072.91it/s]

ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS

后处理提取的文档数据¶

对文档执行分层分块¶

我们使用 Docling 的 HierarchicalChunker() 对文档列表执行层级感知的分块。这旨在保留文档内的一些结构和关系，从而在我们的 RAG 流水线中实现更准确和相关的检索。

In [5]

已复制!





from docling_core.transforms.chunker import HierarchicalChunker

# Initialize lists for text, and titles
texts, titles = [], []

chunker = HierarchicalChunker()

# Process each document in the list
for doc, title in zip(docs, source_titles):  # Pair each document with its title
    chunks = list(
        chunker.chunk(doc)
    )  # Perform hierarchical chunking and get text from chunks
    for chunk in chunks:
        texts.append(chunk.text)
        titles.append(title)
from docling_core.transforms.chunker import HierarchicalChunker # Initialize lists for text, and titles texts, titles = [], [] chunker = HierarchicalChunker() # Process each document in the list for doc, title in zip(docs, source_titles): # Pair each document with its title chunks = list( chunker.chunk(doc) ) # Perform hierarchical chunking and get text from chunks for chunk in chunks: texts.append(chunk.text) titles.append(title)

因为我们将文档拆分成块，所以我们将文章标题连接到每个块的开头，以提供额外的上下文。

In [6]

已复制!

# Concatenate title and text
for i in range(len(texts)):
    texts[i] = f"{titles[i]} {texts[i]}"
# Concatenate title and text for i in range(len(texts)): texts[i] = f"{titles[i]} {texts[i]}"

💚 第 2 部分：Weaviate¶

创建并配置嵌入式 Weaviate 集合¶

我们将使用 OpenAI API 来生成文本嵌入以及作为我们 RAG 流水线中的生成模型。下面的代码根据您是在 Google Colab 中运行此 Notebook 还是将其作为常规 Jupyter Notebook 运行，动态获取您的 API 密钥。您只需将 openai_api_key_var 替换为您的环境变量名称或 Colab 密钥中 API 密钥的名称。

如果您在 Google Colab 中运行此 Notebook，请确保将您的 API 密钥添加为密钥。

In [7]

已复制!





# OpenAI API key variable name
openai_api_key_var = "OPENAI_API_KEY"  # Replace with the name of your secret/env var

# Fetch OpenAI API key
try:
    # If running in Colab, fetch API key from Secrets
    import google.colab
    from google.colab import userdata

    openai_api_key = userdata.get(openai_api_key_var)
    if not openai_api_key:
        raise ValueError(f"Secret '{openai_api_key_var}' not found in Colab secrets.")
except ImportError:
    # If not running in Colab, fetch API key from environment variable
    import os

    openai_api_key = os.getenv(openai_api_key_var)
    if not openai_api_key:
        raise OSError(
            f"Environment variable '{openai_api_key_var}' is not set. "
            "Please define it before running this script."
        )
# OpenAI API key variable name openai_api_key_var = "OPENAI_API_KEY" # Replace with the name of your secret/env var # Fetch OpenAI API key try: # If running in Colab, fetch API key from Secrets import google.colab from google.colab import userdata openai_api_key = userdata.get(openai_api_key_var) if not openai_api_key: raise ValueError(f"Secret '{openai_api_key_var}' not found in Colab secrets.") except ImportError: # If not running in Colab, fetch API key from environment variable import os openai_api_key = os.getenv(openai_api_key_var) if not openai_api_key: raise OSError( f"Environment variable '{openai_api_key_var}' is not set. " "Please define it before running this script." )

嵌入式 Weaviate 允许您直接从应用程序代码启动 Weaviate 实例，而无需使用 Docker 容器。如果您对其他部署方法感兴趣，例如使用 Docker-Compose 或 Kubernetes，请查看 Weaviate 文档中的此页面。

In [ ]

已复制!

import weaviate

# Connect to Weaviate embedded
client = weaviate.connect_to_embedded(headers={"X-OpenAI-Api-Key": openai_api_key})
import weaviate # Connect to Weaviate embedded client = weaviate.connect_to_embedded(headers={"X-OpenAI-Api-Key": openai_api_key})

In [ ]

已复制!





import weaviate.classes.config as wc

# Define the collection name
collection_name = "docling"

# Delete the collection if it already exists
if client.collections.exists(collection_name):
    client.collections.delete(collection_name)

# Create the collection
collection = client.collections.create(
    name=collection_name,
    vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-large",  # Specify your embedding model here
    ),
    # Enable generative model from Cohere
    generative_config=wc.Configure.Generative.openai(
        model="gpt-4o"  # Specify your generative model for RAG here
    ),
    # Define properties of metadata
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT),
        wc.Property(name="title", data_type=wc.DataType.TEXT, skip_vectorization=True),
    ],
)
import weaviate.classes.config as wc # Define the collection name collection_name = "docling" # Delete the collection if it already exists if client.collections.exists(collection_name): client.collections.delete(collection_name) # Create the collection collection = client.collections.create( name=collection_name, vectorizer_config=wc.Configure.Vectorizer.text2vec_openai( model="text-embedding-3-large", # Specify your embedding model here ), # Enable generative model from Cohere generative_config=wc.Configure.Generative.openai( model="gpt-4o" # Specify your generative model for RAG here ), # Define properties of metadata properties=[ wc.Property(name="text", data_type=wc.DataType.TEXT), wc.Property(name="title", data_type=wc.DataType.TEXT, skip_vectorization=True), ], )

将数据整理成 Weaviate 可接受的格式¶

将我们的数据从列表转换为字典列表，以便插入到我们的 Weaviate 集合中。

In [10]

已复制!





# Initialize the data object
data = []

# Create a dictionary for each row by iterating through the corresponding lists
for text, title in zip(texts, titles):
    data_point = {
        "text": text,
        "title": title,
    }
    data.append(data_point)
# Initialize the data object data = [] # Create a dictionary for each row by iterating through the corresponding lists for text, title in zip(texts, titles): data_point = { "text": text, "title": title, } data.append(data_point)

将数据插入 Weaviate 并生成嵌入¶

插入到我们的 Weaviate 集合时将生成嵌入。

In [ ]

已复制!

# Insert text chunks and metadata into vector DB collection
response = collection.data.insert_many(data)

if response.has_errors:
    print(response.errors)
else:
    print("Insert complete.")
collection response = collection.data.insert_many(data) if response.has_errors: print(response.errors) else: print("Insert complete.")

查询数据¶

在这里，我们执行一个简单的相似性搜索，以返回与我们的搜索查询最相似的嵌入块。

In [12]

已复制!





from weaviate.classes.query import MetadataQuery

response = collection.query.near_text(
    query="bert",
    limit=2,
    return_metadata=MetadataQuery(distance=True),
    return_properties=["text", "title"],
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.distance)
from weaviate.classes.query import MetadataQuery response = collection.query.near_text( query="bert", limit=2, return_metadata=MetadataQuery(distance=True), return_properties=["text", "title"], ) for o in response.objects: print(o.properties) print(o.metadata.distance)

{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A distinctive feature of BERT is its unified architecture across different tasks. There is mini-', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}
0.6578550338745117
{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}
0.6696287989616394

对已解析的文章执行 RAG¶

Weaviate 的 generate 模块允许您对您的嵌入数据执行 RAG，而无需使用单独的框架。

我们指定一个提示，其中包含我们希望在数据库中搜索的字段（在本例中是 text）、一个包含我们的搜索词的查询，以及用于生成结果的检索数量。

In [13]

已复制!





from rich.console import Console
from rich.panel import Panel

# Create a prompt where context from the Weaviate collection will be injected
prompt = "Explain how {text} works, using only the retrieved context."
query = "bert"

response = collection.generate.near_text(
    query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"]
)

# Prettify the output using Rich
console = Console()

console.print(
    Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red")
)
console.print(
    Panel(response.generated, title="Generated Content", border_style="bold green")
)
from rich.console import Console from rich.panel import Panel # Create a prompt where context from the Weaviate collection will be injected prompt = "Explain how {text} works, using only the retrieved context." query = "bert" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"] ) # Prettify the output using Rich console = Console() console.print( Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red") ) console.print( Panel(response.generated, title="Generated Content", border_style="bold green") )

╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮
│ Explain how bert works, using only the retrieved context.                                                       │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮
│ BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation    │
│ model designed to pretrain deep bidirectional representations from unlabeled text. It conditions on both left   │
│ and right context in all layers, unlike traditional left-to-right or right-to-left language models. This        │
│ pre-training involves two unsupervised tasks. The pre-trained BERT model can then be fine-tuned with just one   │
│ additional output layer to create state-of-the-art models for various tasks, such as question answering and     │
│ language inference, without needing substantial task-specific architecture modifications. A distinctive feature │
│ of BERT is its unified architecture across different tasks.                                                     │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

In [14]

已复制!





# Create a prompt where context from the Weaviate collection will be injected
prompt = "Explain how {text} works, using only the retrieved context."
query = "a generative adversarial net"

response = collection.generate.near_text(
    query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"]
)

# Prettify the output using Rich
console = Console()

console.print(
    Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red")
)
console.print(
    Panel(response.generated, title="Generated Content", border_style="bold green")
)
# Create a prompt where context from the Weaviate collection will be injected prompt = "Explain how {text} works, using only the retrieved context." query = "a generative adversarial net" response = collection.generate.near_text( query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"] ) # Prettify the output using Rich console = Console() console.print( Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red") ) console.print( Panel(response.generated, title="Generated Content", border_style="bold green") )

╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮
│ Explain how a generative adversarial net works, using only the retrieved context.                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮
│ Generative Adversarial Nets (GANs) operate within an adversarial framework where two models are trained         │
│ simultaneously: a generative model (G) and a discriminative model (D). The generative model aims to capture the │
│ data distribution and generate samples that mimic real data, while the discriminative model's task is to        │
│ distinguish between samples from the real data and those generated by G. This setup is akin to a game where the │
│ generative model acts like counterfeiters trying to produce indistinguishable fake currency, and the            │
│ discriminative model acts like the police trying to detect these counterfeits.                                  │
│                                                                                                                 │
│ The training process involves a minimax two-player game where G tries to maximize the probability of D making a │
│ mistake, while D tries to minimize it. When both models are defined by multilayer perceptrons, they can be      │
│ trained using backpropagation without the need for Markov chains or approximate inference networks. The         │
│ ultimate goal is for G to perfectly replicate the training data distribution, making D's output equal to 1/2    │
│ everywhere, indicating it cannot distinguish between real and generated data. This framework allows for         │
│ specific training algorithms and optimization techniques, such as backpropagation and dropout, to be            │
│ effectively utilized.                                                                                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

我们可以看到，对于简单查询，我们的 RAG 流水线表现相对较好，尤其考虑到数据集规模很小。将此方法扩展到转换更大的 PDF 样本将需要更多的计算资源（GPU）以及更高级的 Weaviate 部署（例如 Docker、Kubernetes 或 Weaviate Cloud）。有关可用的 Weaviate 配置的更多信息，请查看文档。