使用 Weaviate 进行 RAG¶
步骤 | 技术 | 执行 |
---|---|---|
嵌入 | Open AI | 🌐 远程 |
向量存储 | Weaviate | 💻 本地 |
生成式 AI | Open AI | 🌐 远程 |
食谱 🧑🍳 🐥 💚¶
这是一个代码食谱,使用 Weaviate 对由 Docling 解析的 PDF 文档执行 RAG。
在本 Notebook 中,我们将完成以下内容
要运行此 Notebook,您需要
- 一个 OpenAI API 密钥
- 访问 GPU
注意:为获得最佳结果,请使用 GPU 加速 运行此 Notebook。以下是运行此 Notebook 的两个选项
- 在配备 Apple Silicon 芯片的 MacBook 本地运行。 由于 Docling 使用 MPS 加速器,在本 Notebook 中转换所有文档在 MacBook M2 上大约需要 2 分钟。
- 在 Google Colab 上运行此 Notebook。 在 Google Colab T4 GPU 上转换此 Notebook 中的所有文档大约需要 8 分钟。
安装 Docling 和 Weaviate 客户端¶
注意:如果 Colab 在运行以下单元格后提示您重新启动会话,请单击“重新启动”并继续运行 Notebook 的其余部分。
%%capture
%pip install docling~="2.7.0"
%pip install -U weaviate-client~="4.9.4"
%pip install rich
%pip install torch
import logging
import warnings
warnings.filterwarnings("ignore")
# Suppress Weaviate client logs
logging.getLogger("weaviate").setLevel(logging.ERROR)
🐥 第 1 部分:Docling¶
Docling 如此出色的部分原因在于它可以运行在商用硬件上。这意味着此 Notebook 可以在具有 GPU 加速的本地机器上运行。如果您使用的是配备 Apple Silicon 芯片的 MacBook,Docling 可以与 Metal Performance Shaders (MPS) 无缝集成。MPS 为 macOS 提供开箱即用的 GPU 加速,与 PyTorch 和 TensorFlow 无缝集成,在 Apple Silicon 上提供节能性能,并与所有 Metal 支持的 GPU 广泛兼容。
以下代码检查 GPU 是否可用,通过 CUDA 或 MPS。
import torch
# Check if GPU or MPS is available
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
device = torch.device("mps")
print("MPS GPU is enabled.")
else:
raise OSError(
"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
)
MPS GPU is enabled.
在此,我们收集了在 arXiv 上以 PDF 形式发布的 10 篇有影响力的机器学习论文。由于 Docling 尚不具备 PDF 的标题提取功能,我们手动在相应的列表中添加了标题。
注意:使用 T4 GPU 转换所有 10 篇论文大约需要 8 分钟。
# Influential machine learning papers
source_urls = [
"https://arxiv.org/pdf/1706.03762",
"https://arxiv.org/pdf/1810.04805",
"https://arxiv.org/pdf/1406.2661",
"https://arxiv.org/pdf/1409.0473",
"https://arxiv.org/pdf/1412.6980",
"https://arxiv.org/pdf/1312.6114",
"https://arxiv.org/pdf/1312.5602",
"https://arxiv.org/pdf/1512.03385",
"https://arxiv.org/pdf/1409.3215",
"https://arxiv.org/pdf/1301.3781",
]
# And their corresponding titles (because Docling doesn't have title extraction yet!)
source_titles = [
"Attention Is All You Need",
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding",
"Generative Adversarial Nets",
"Neural Machine Translation by Jointly Learning to Align and Translate",
"Adam: A Method for Stochastic Optimization",
"Auto-Encoding Variational Bayes",
"Playing Atari with Deep Reinforcement Learning",
"Deep Residual Learning for Image Recognition",
"Sequence to Sequence Learning with Neural Networks",
"A Neural Probabilistic Language Model",
]
将 PDF 转换为 Docling 文档¶
在这里,我们使用 Docling 的 .convert_all()
来解析一批 PDF。结果是一个 Docling 文档列表,我们可以用它进行文本提取。
注意:请忽略 ERR#
消息。
from docling.document_converter import DocumentConverter
# Instantiate the doc converter
doc_converter = DocumentConverter()
# Directly pass list of files or streams to `convert_all`
conv_results_iter = doc_converter.convert_all(source_urls) # previously `convert`
# Iterate over the generator to get a list of Docling documents
docs = [result.document for result in conv_results_iter]
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 84072.91it/s]
ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS
from docling_core.transforms.chunker import HierarchicalChunker
# Initialize lists for text, and titles
texts, titles = [], []
chunker = HierarchicalChunker()
# Process each document in the list
for doc, title in zip(docs, source_titles): # Pair each document with its title
chunks = list(
chunker.chunk(doc)
) # Perform hierarchical chunking and get text from chunks
for chunk in chunks:
texts.append(chunk.text)
titles.append(title)
因为我们将文档拆分成块,所以我们将文章标题连接到每个块的开头,以提供额外的上下文。
# Concatenate title and text
for i in range(len(texts)):
texts[i] = f"{titles[i]} {texts[i]}"
我们将使用 OpenAI API 来生成文本嵌入以及作为我们 RAG 流水线中的生成模型。下面的代码根据您是在 Google Colab 中运行此 Notebook 还是将其作为常规 Jupyter Notebook 运行,动态获取您的 API 密钥。您只需将 openai_api_key_var
替换为您的环境变量名称或 Colab 密钥中 API 密钥的名称。
如果您在 Google Colab 中运行此 Notebook,请确保将您的 API 密钥添加为密钥。
# OpenAI API key variable name
openai_api_key_var = "OPENAI_API_KEY" # Replace with the name of your secret/env var
# Fetch OpenAI API key
try:
# If running in Colab, fetch API key from Secrets
import google.colab
from google.colab import userdata
openai_api_key = userdata.get(openai_api_key_var)
if not openai_api_key:
raise ValueError(f"Secret '{openai_api_key_var}' not found in Colab secrets.")
except ImportError:
# If not running in Colab, fetch API key from environment variable
import os
openai_api_key = os.getenv(openai_api_key_var)
if not openai_api_key:
raise OSError(
f"Environment variable '{openai_api_key_var}' is not set. "
"Please define it before running this script."
)
嵌入式 Weaviate 允许您直接从应用程序代码启动 Weaviate 实例,而无需使用 Docker 容器。如果您对其他部署方法感兴趣,例如使用 Docker-Compose 或 Kubernetes,请查看 Weaviate 文档中的此页面。
import weaviate
# Connect to Weaviate embedded
client = weaviate.connect_to_embedded(headers={"X-OpenAI-Api-Key": openai_api_key})
import weaviate.classes.config as wc
# Define the collection name
collection_name = "docling"
# Delete the collection if it already exists
if client.collections.exists(collection_name):
client.collections.delete(collection_name)
# Create the collection
collection = client.collections.create(
name=collection_name,
vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(
model="text-embedding-3-large", # Specify your embedding model here
),
# Enable generative model from Cohere
generative_config=wc.Configure.Generative.openai(
model="gpt-4o" # Specify your generative model for RAG here
),
# Define properties of metadata
properties=[
wc.Property(name="text", data_type=wc.DataType.TEXT),
wc.Property(name="title", data_type=wc.DataType.TEXT, skip_vectorization=True),
],
)
将数据整理成 Weaviate 可接受的格式¶
将我们的数据从列表转换为字典列表,以便插入到我们的 Weaviate 集合中。
# Initialize the data object
data = []
# Create a dictionary for each row by iterating through the corresponding lists
for text, title in zip(texts, titles):
data_point = {
"text": text,
"title": title,
}
data.append(data_point)
将数据插入 Weaviate 并生成嵌入¶
插入到我们的 Weaviate 集合时将生成嵌入。
# Insert text chunks and metadata into vector DB collection
response = collection.data.insert_many(data)
if response.has_errors:
print(response.errors)
else:
print("Insert complete.")
查询数据¶
在这里,我们执行一个简单的相似性搜索,以返回与我们的搜索查询最相似的嵌入块。
from weaviate.classes.query import MetadataQuery
response = collection.query.near_text(
query="bert",
limit=2,
return_metadata=MetadataQuery(distance=True),
return_properties=["text", "title"],
)
for o in response.objects:
print(o.properties)
print(o.metadata.distance)
{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A distinctive feature of BERT is its unified architecture across different tasks. There is mini-', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'} 0.6578550338745117 {'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'} 0.6696287989616394
对已解析的文章执行 RAG¶
Weaviate 的 generate
模块允许您对您的嵌入数据执行 RAG,而无需使用单独的框架。
我们指定一个提示,其中包含我们希望在数据库中搜索的字段(在本例中是 text
)、一个包含我们的搜索词的查询,以及用于生成结果的检索数量。
from rich.console import Console
from rich.panel import Panel
# Create a prompt where context from the Weaviate collection will be injected
prompt = "Explain how {text} works, using only the retrieved context."
query = "bert"
response = collection.generate.near_text(
query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"]
)
# Prettify the output using Rich
console = Console()
console.print(
Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red")
)
console.print(
Panel(response.generated, title="Generated Content", border_style="bold green")
)
╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮ │ Explain how bert works, using only the retrieved context. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮ │ BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation │ │ model designed to pretrain deep bidirectional representations from unlabeled text. It conditions on both left │ │ and right context in all layers, unlike traditional left-to-right or right-to-left language models. This │ │ pre-training involves two unsupervised tasks. The pre-trained BERT model can then be fine-tuned with just one │ │ additional output layer to create state-of-the-art models for various tasks, such as question answering and │ │ language inference, without needing substantial task-specific architecture modifications. A distinctive feature │ │ of BERT is its unified architecture across different tasks. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# Create a prompt where context from the Weaviate collection will be injected
prompt = "Explain how {text} works, using only the retrieved context."
query = "a generative adversarial net"
response = collection.generate.near_text(
query=query, limit=3, grouped_task=prompt, return_properties=["text", "title"]
)
# Prettify the output using Rich
console = Console()
console.print(
Panel(f"{prompt}".replace("{text}", query), title="Prompt", border_style="bold red")
)
console.print(
Panel(response.generated, title="Generated Content", border_style="bold green")
)
╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮ │ Explain how a generative adversarial net works, using only the retrieved context. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮ │ Generative Adversarial Nets (GANs) operate within an adversarial framework where two models are trained │ │ simultaneously: a generative model (G) and a discriminative model (D). The generative model aims to capture the │ │ data distribution and generate samples that mimic real data, while the discriminative model's task is to │ │ distinguish between samples from the real data and those generated by G. This setup is akin to a game where the │ │ generative model acts like counterfeiters trying to produce indistinguishable fake currency, and the │ │ discriminative model acts like the police trying to detect these counterfeits. │ │ │ │ The training process involves a minimax two-player game where G tries to maximize the probability of D making a │ │ mistake, while D tries to minimize it. When both models are defined by multilayer perceptrons, they can be │ │ trained using backpropagation without the need for Markov chains or approximate inference networks. The │ │ ultimate goal is for G to perfectly replicate the training data distribution, making D's output equal to 1/2 │ │ everywhere, indicating it cannot distinguish between real and generated data. This framework allows for │ │ specific training algorithms and optimization techniques, such as backpropagation and dropout, to be │ │ effectively utilized. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
我们可以看到,对于简单查询,我们的 RAG 流水线表现相对较好,尤其考虑到数据集规模很小。将此方法扩展到转换更大的 PDF 样本将需要更多的计算资源(GPU)以及更高级的 Weaviate 部署(例如 Docker、Kubernetes 或 Weaviate Cloud)。有关可用的 Weaviate 配置的更多信息,请查看文档。