基于 LlamaIndex 的 RAG¶
步骤 | 技术 | 执行 |
---|---|---|
嵌入 | Hugging Face / Sentence Transformers | 💻 本地 |
向量存储 | Milvus | 💻 本地 |
生成式 AI | Hugging Face Inference API | 🌐 远程 |
概述¶
此示例利用了官方的 LlamaIndex Docling 扩展。
提供的扩展 DoclingReader
和 DoclingNodeParser
使您能够
- 在您的 LLM 应用程序中轻松快速地使用各种文档类型,并且
- 利用 Docling 的丰富格式实现高级的、文档原生的接地。
设置¶
- 👉 为了获得最佳转换速度,请在可用时使用 GPU 加速;例如,如果在 Colab 上运行,请使用支持 GPU 的运行时。
- Notebook 使用 HuggingFace 的 Inference API;为了增加 LLM 配额,可以通过环境变量
HF_TOKEN
提供 token。 - 可以按如下所示安装依赖项(
--no-warn-conflicts
适用于 Colab 预填充的 Python 环境;对于更严格的用法,请随意移除)
In [1]
已复制!
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
Note: you may need to restart the kernel to use updated packages.
In [2]
已复制!
import os
from pathlib import Path
from tempfile import mkdtemp
from warnings import filterwarnings
from dotenv import load_dotenv
def _get_env_from_colab_or_os(key):
try:
from google.colab import userdata
try:
return userdata.get(key)
except userdata.SecretNotFoundError:
pass
except ImportError:
pass
return os.getenv(key)
load_dotenv()
filterwarnings(action="ignore", category=UserWarning, module="pydantic")
filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import os from pathlib import Path from tempfile import mkdtemp from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action="ignore", category=UserWarning, module="pydantic") filterwarnings(action="ignore", category=FutureWarning, module="easyocr") # https://github.com/huggingface/transformers/issues/5486: os.environ["TOKENIZERS_PARALLELISM"] = "false"
现在我们可以定义主要参数了
In [3]
已复制!
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
GEN_MODEL = HuggingFaceInferenceAPI(
token=_get_env_from_colab_or_os("HF_TOKEN"),
model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
SOURCE = "https://arxiv.org/pdf/2408.09869" # Docling Technical Report
QUERY = "Which are the main AI models in Docling?"
embed_dim = len(EMBED_MODEL.get_text_embedding("hi"))
from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") MILVUS_URI = str(Path(mkdtemp()) / "docling.db") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os("HF_TOKEN"), model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", ) SOURCE = "https://arxiv.org/pdf/2408.09869" # Docling Technical Report QUERY = "Which are the main AI models in Docling?" embed_dim = len(EMBED_MODEL.get_text_embedding("hi"))
使用 Markdown 导出¶
为了创建一个简单的 RAG 管线,我们可以
- 定义一个
DoclingReader
,它默认导出为 Markdown 格式,并且 - 对这些基于 Markdown 的文档使用一个标准的节点解析器,例如
MarkdownNodeParser
In [4]
已复制!
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.readers.docling import DoclingReader
from llama_index.vector_stores.milvus import MilvusVectorStore
reader = DoclingReader()
node_parser = MarkdownNodeParser()
vector_store = MilvusVectorStore(
uri=str(Path(mkdtemp()) / "docling.db"), # or set as needed
dim=embed_dim,
overwrite=True,
)
index = VectorStoreIndex.from_documents(
documents=reader.load_data(SOURCE),
transformations=[node_parser],
storage_context=StorageContext.from_defaults(vector_store=vector_store),
embed_model=EMBED_MODEL,
)
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])
from llama_index.core import StorageContext, VectorStoreIndex from llama_index.core.node_parser import MarkdownNodeParser from llama_index.readers.docling import DoclingReader from llama_index.vector_stores.milvus import MilvusVectorStore reader = DoclingReader() node_parser = MarkdownNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / "docling.db"), # 或根据需要设置 dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:") display([(n.text, n.metadata) for n in result.source_nodes])
Q: Which are the main AI models in Docling? A: The main AI models in Docling are a layout analysis model, which is an accurate object-detector for page elements, and TableFormer, a state-of-the-art table structure recognition model. Sources:
[('3.2 AI models\n\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.', {'Header_2': '3.2 AI models'}), ("5 Applications\n\nThanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build large-scale multi-modal training datasets.", {'Header_2': '5 Applications'})]
使用 Docling 格式¶
为了利用 Docling 丰富的原生格式,我们
- 创建一个
DoclingReader
并将导出类型设置为 JSON,并且 - 使用
DoclingNodeParser
以便恰当地解析 Docling 格式。
请注意,现在来源信息中也包含文档级别的接地信息(例如页码或边界框信息)
In [5]
已复制!
from llama_index.node_parser.docling import DoclingNodeParser
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
node_parser = DoclingNodeParser()
vector_store = MilvusVectorStore(
uri=str(Path(mkdtemp()) / "docling.db"), # or set as needed
dim=embed_dim,
overwrite=True,
)
index = VectorStoreIndex.from_documents(
documents=reader.load_data(SOURCE),
transformations=[node_parser],
storage_context=StorageContext.from_defaults(vector_store=vector_store),
embed_model=EMBED_MODEL,
)
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])
from llama_index.node_parser.docling import DoclingNodeParser reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) node_parser = DoclingNodeParser() vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / "docling.db"), # 或根据需要设置 dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:") display([(n.text, n.metadata) for n in result.source_nodes])
Q: Which are the main AI models in Docling? A: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art table structure recognition model. Sources:
[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.', {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/34', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 107.07593536376953, 't': 406.1695251464844, 'r': 504.1148681640625, 'b': 330.2677307128906, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}), ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.', {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/9', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 107.0031967163086, 't': 136.7283935546875, 'r': 504.04998779296875, 'b': 83.30133056640625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 488]}]}], 'headings': ['1 Introduction'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}})]
使用 Simple Directory Reader¶
为了演示这种用法模式,我们首先设置一个测试文档目录。
In [6]
已复制!
from pathlib import Path
from tempfile import mkdtemp
import requests
tmp_dir_path = Path(mkdtemp())
r = requests.get(SOURCE)
with open(tmp_dir_path / f"{Path(SOURCE).name}.pdf", "wb") as out_file:
out_file.write(r.content)
from pathlib import Path from tempfile import mkdtemp import requests tmp_dir_path = Path(mkdtemp()) r = requests.get(SOURCE) with open(tmp_dir_path / f"{Path(SOURCE).name}.pdf", "wb") as out_file: out_file.write(r.content)
使用上面任一变体中的 reader
和 node_parser
定义,与 SimpleDirectoryReader
一起使用的方法如下所示
In [7]
已复制!
from llama_index.core import SimpleDirectoryReader
dir_reader = SimpleDirectoryReader(
input_dir=tmp_dir_path,
file_extractor={".pdf": reader},
)
vector_store = MilvusVectorStore(
uri=str(Path(mkdtemp()) / "docling.db"), # or set as needed
dim=embed_dim,
overwrite=True,
)
index = VectorStoreIndex.from_documents(
documents=dir_reader.load_data(SOURCE),
transformations=[node_parser],
storage_context=StorageContext.from_defaults(vector_store=vector_store),
embed_model=EMBED_MODEL,
)
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])
from llama_index.core import SimpleDirectoryReader dir_reader = SimpleDirectoryReader( input_dir=tmp_dir_path, file_extractor={".pdf": reader}, ) vector_store = MilvusVectorStore( uri=str(Path(mkdtemp()) / "docling.db"), # 或根据需要设置 dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(SOURCE), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, ) result = index.as_query_engine(llm=GEN_MODEL).query(QUERY) print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:") display([(n.text, n.metadata) for n in result.source_nodes])
Loading files: 100%|██████████| 1/1 [00:11<00:00, 11.27s/file]
Q: Which are the main AI models in Docling? A: 1. A layout analysis model, an accurate object-detector for page elements. 2. TableFormer, a state-of-the-art table structure recognition model. Sources:
[('As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.', {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf', 'file_name': '2408.09869.pdf', 'file_type': 'application/pdf', 'file_size': 5566574, 'creation_date': '2024-10-28', 'last_modified_date': '2024-10-28', 'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/34', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 107.07593536376953, 't': 406.1695251464844, 'r': 504.1148681640625, 'b': 330.2677307128906, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869.pdf'}}), ('With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.', {'file_path': '/var/folders/76/4wwfs06x6835kcwj4186c0nc0000gn/T/tmp2ooyusg5/2408.09869.pdf', 'file_name': '2408.09869.pdf', 'file_type': 'application/pdf', 'file_size': 5566574, 'creation_date': '2024-10-28', 'last_modified_date': '2024-10-28', 'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/9', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 107.0031967163086, 't': 136.7283935546875, 'r': 504.04998779296875, 'b': 83.30133056640625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 488]}]}], 'headings': ['1 Introduction'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869.pdf'}})]
In [ ]
已复制!