自定义 XML 的转换¶
步骤 | 技术 | 执行 |
---|---|---|
Embedding | Hugging Face / Sentence Transformers | 💻 本地 |
向量存储 | Milvus | 💻 本地 |
生成式 AI | Hugging Face 推理 API | 🌐 远程 |
概述¶
这是一个使用 Docling 将结构化数据 (XML) 转换为统一文档表示格式 DoclingDocument 的示例,并利用其丰富的结构化内容进行 RAG 应用程序。
本示例中使用的数据包括来自美国专利商标局 (USPTO) 的专利和来自 PubMed Central® (PMC) 的医学文章。
在本 notebook 中,我们将完成以下任务
- 简单转换(简要说明)
- 端到端应用程序,使用 Docling 支持的公共 XML 文件集
- 设置生成式 AI 的 API 访问
- 获取数据,使用 Docling 自定义后端从 USPTO 和 PubMed Central® 网站获取数据
- 解析、分块和索引文档,将文档存储到向量数据库中
- 执行 RAG,使用 LlamaIndex Docling 扩展
有关使用 Docling 进行文档分块的更多详细信息,请参阅分块文档。有关使用 Docling 和 LlamaIndex 的 RAG,另请查看示例使用 LlamaIndex 的 RAG。
from docling.document_converter import DocumentConverter
# a sample PMC article:
source = "../../tests/data/jats/elife-56337.nxml"
converter = DocumentConverter()
result = converter.convert(source)
print(result.status)
ConversionStatus.SUCCESS
文档转换后,可以导出为 Docling 支持的任何格式。例如,导出为 Markdown(此处仅显示前几行)
md_doc = result.document.export_to_markdown()
delim = "\n"
print(delim.join(md_doc.split(delim)[:8]))
# KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage Gernot Wolf, Alberto de Iaco, Ming-An Sun, Melania Bruno, Matthew Tinkham, Don Hoang, Apratim Mitra, Sherry Ralls, Didier Trono, Todd S Macfarlan The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health, Bethesda, United States; School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland ## Abstract
如果 XML 文件不受支持,将抛出 ConversionError
消息。
from io import BytesIO
from docling.datamodel.base_models import DocumentStream
from docling.exceptions import ConversionError
xml_content = (
b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE docling_test SYSTEM '
b'"test.dtd"><docling>Random content</docling>'
)
stream = DocumentStream(name="docling_test.xml", stream=BytesIO(xml_content))
try:
result = converter.convert(stream)
except ConversionError as ce:
print(ce)
Input document docling_test.xml does not match any allowed format.
File format not allowed: docling_test.xml
您随时可以参考用法文档页面,获取支持的格式列表。
端到端应用程序¶
本节介绍了一个分步应用程序,用于处理支持的公共集合中的 XML 文件,并将其用于问答。
设置¶
可以按如下所示安装依赖项。--no-warn-conflicts
参数适用于 Colab 的预填充 Python 环境,如果需要更严格的使用,请随意移除。
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
Note: you may need to restart the kernel to use updated packages.
本 notebook 使用 HuggingFace 的推理 API。为了增加 LLM 配额,可以通过环境变量 HF_TOKEN
提供一个 token。
如果您在 Google Colab 中运行此 notebook,请确保将您的 API 密钥添加为 secret。
import os
from warnings import filterwarnings
from dotenv import load_dotenv
def _get_env_from_colab_or_os(key):
try:
from google.colab import userdata
try:
return userdata.get(key)
except userdata.SecretNotFoundError:
pass
except ImportError:
pass
return os.getenv(key)
load_dotenv()
filterwarnings(action="ignore", category=UserWarning, module="pydantic")
现在我们可以定义主要参数了
from pathlib import Path
from tempfile import mkdtemp
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)
TEMP_DIR = Path(mkdtemp())
MILVUS_URI = str(TEMP_DIR / "docling.db")
GEN_MODEL = HuggingFaceInferenceAPI(
token=_get_env_from_colab_or_os("HF_TOKEN"),
model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
embed_dim = len(EMBED_MODEL.get_text_embedding("hi"))
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
获取数据¶
在本 notebook 中,我们将使用 Docling 支持的集合中的 XML 数据
- 来自 PubMed Central® (PMC) 的医学文章。它们以
.tar.gz
文件形式存储在 FTP 服务器上。每个文件包含完整的 XML 格式文章数据,以及图像或电子表格等其他补充文件。 - 来自 美国专利商标局的专利。它们以 zip 文件形式存储在批量数据存储系统 (BDSS) 中。每个 zip 文件可能包含多个 XML 格式的专利。
原始文件将从源下载并保存在临时目录中。
PMC 文章¶
OA 文件是所有 PMC 文章的清单文件,包括下载源文件的 URL 路径。在本 notebook 中,我们将以文章病原体通过高空风媒蚊子传播为例,该文章可在归档文件PMC11703268.tar.gz 中获取。
import tarfile
from io import BytesIO
import requests
# PMC article PMC11703268
url: str = "https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz"
print(f"Downloading {url}...")
buf = BytesIO(requests.get(url).content)
print("Extracting and storing the XML file containing the article text...")
with tarfile.open(fileobj=buf, mode="r:gz") as tar_file:
for tarinfo in tar_file:
if tarinfo.isreg():
file_path = Path(tarinfo.name)
if file_path.suffix == ".nxml":
with open(TEMP_DIR / file_path.name, "wb") as file_obj:
file_obj.write(tar_file.extractfile(tarinfo).read())
print(f"Stored XML file {file_path.name}")
Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz... Extracting and storing the XML file containing the article text... Stored XML file nihpp-2024.12.26.630351v1.nxml
USPTO 专利¶
由于每个 USPTO 文件是多个专利的连接,我们需要将其内容分割成有效的 XML 片段。以下代码下载一个示例 zip 文件,将其内容分割成多个部分,并将每个部分转储为 XML 文件。为简单起见,此流水线此处按顺序显示,但可以并行化。
import zipfile
# Patent grants from December 17-23, 2024
url: str = (
"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip"
)
XML_SPLITTER: str = '<?xml version="1.0"'
doc_num: int = 0
print(f"Downloading {url}...")
buf = BytesIO(requests.get(url).content)
print("Parsing zip file, splitting into XML sections, and exporting to files...")
with zipfile.ZipFile(buf) as zf:
res = zf.testzip()
if res:
print("Error validating zip file")
else:
with zf.open(zf.namelist()[0]) as xf:
is_patent = False
patent_buffer = BytesIO()
for xf_line in xf:
decoded_line = xf_line.decode(errors="ignore").rstrip()
xml_index = decoded_line.find(XML_SPLITTER)
if xml_index != -1:
if (
xml_index > 0
): # cases like </sequence-cwu><?xml version="1.0"...
patent_buffer.write(xf_line[:xml_index])
patent_buffer.write(b"\r\n")
xf_line = xf_line[xml_index:]
if patent_buffer.getbuffer().nbytes > 0 and is_patent:
doc_num += 1
patent_id = f"ipg241217-{doc_num}"
with open(TEMP_DIR / f"{patent_id}.xml", "wb") as file_obj:
file_obj.write(patent_buffer.getbuffer())
is_patent = False
patent_buffer = BytesIO()
elif decoded_line.startswith("<!DOCTYPE"):
is_patent = True
patent_buffer.write(xf_line)
Downloading https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip... Parsing zip file, splitting into XML sections, and exporting to files...
print(f"Fetched and exported {doc_num} documents.")
Fetched and exported 4014 documents.
使用后端转换器(可选)¶
- 自定义后端转换器
PubMedDocumentBackend
和PatentUsptoDocumentBackend
分别用于处理 PMC 文章和 USPTO 专利的解析。 - 与任何其他后端一样,您可以使用
is_valid()
函数检查输入文档是否受此后端支持。 - 请注意,原始 USPTO zip 文件中的某些 XML 部分可能不代表专利,例如序列列表,因此后端会将其显示为无效。
from tqdm.notebook import tqdm
from docling.backend.xml.jats_backend import JatsDocumentBackend
from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument
# check PMC
in_doc = InputDocument(
path_or_stream=TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml",
format=InputFormat.XML_JATS,
backend=JatsDocumentBackend,
)
backend = JatsDocumentBackend(
in_doc=in_doc, path_or_stream=TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml"
)
print(f"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}")
# check USPTO
in_doc = InputDocument(
path_or_stream=TEMP_DIR / "ipg241217-1.xml",
format=InputFormat.XML_USPTO,
backend=PatentUsptoDocumentBackend,
)
backend = PatentUsptoDocumentBackend(
in_doc=in_doc, path_or_stream=TEMP_DIR / "ipg241217-1.xml"
)
print(f"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}")
patent_valid = 0
pbar = tqdm(TEMP_DIR.glob("*.xml"), total=doc_num)
for in_path in pbar:
in_doc = InputDocument(
path_or_stream=in_path,
format=InputFormat.XML_USPTO,
backend=PatentUsptoDocumentBackend,
)
backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)
patent_valid += int(backend.is_valid())
print(f"Found {patent_valid} patents out of {doc_num} XML files.")
Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True Document ipg241217-1.xml is a valid patent? True
0%| | 0/4014 [00:00<?, ?it/s]
Found 3928 patents out of 4014 XML files.
调用 convert()
函数将把输入文档转换为 DoclingDocument
doc = backend.convert()
claims_sec = next(item for item in doc.texts if item.text == "CLAIMS")
print(f'Patent "{doc.texts[0].text}" has {len(claims_sec.children)} claims')
Patent "Semiconductor package" has 19 claims
✏️ 提示:通常,无需使用后端转换器来解析 USPTO 或 JATS (PubMed) XML 文件。通用的 DocumentConverter
对象会尝试猜测输入文档格式并应用相应的后端解析器。简单转换中所示的转换是支持的 XML 文件的推荐用法。
解析、分块和索引¶
转换后的专利的 DoclingDocument 格式具有丰富的层次结构,继承自原始 XML 文档并由 Docling 自定义后端保留。在本 notebook 中,我们将利用
SimpleDirectoryReader
模式,用于迭代在获取数据部分创建的导出的 XML 文件。- LlamaIndex 扩展
DoclingReader
和DoclingNodeParser
,用于将专利分块摄取到 Milvus 向量存储中。 HierarchicalChunker
实现,它应用基于文档的层次分块,以利用专利结构,例如章节及其内的段落。
请参考分块文档和使用 LlamaIndex 的 RAG notebook 中其他可能的实现和使用模式。
设置 Docling 阅读器和目录阅读器¶
请注意,DoclingReader
默认使用 Docling 的 DocumentConverter
,因此它会自动识别 XML 文件的格式并利用 PatentUsptoDocumentBackend
。
出于演示目的,我们将分析范围限制在前 100 个专利。
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.docling import DoclingReader
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
dir_reader = SimpleDirectoryReader(
input_dir=TEMP_DIR,
exclude=["docling.db", "*.nxml"],
file_extractor={".xml": reader},
filename_as_id=True,
num_files_limit=100,
)
设置节点解析器¶
请注意,HierarchicalChunker
是 DoclingNodeParser
的默认分块实现。
from llama_index.node_parser.docling import DoclingNodeParser
node_parser = DoclingNodeParser()
设置本地 Milvus 数据库并运行摄取¶
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore
vector_store = MilvusVectorStore(
uri=MILVUS_URI,
dim=embed_dim,
overwrite=True,
)
index = VectorStoreIndex.from_documents(
documents=dir_reader.load_data(show_progress=True),
transformations=[node_parser],
storage_context=StorageContext.from_defaults(vector_store=vector_store),
embed_model=EMBED_MODEL,
show_progress=True,
)
最后,直接从阅读器将 PMC 文章添加到向量存储中。
index.from_documents(
documents=reader.load_data(TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml"),
transformations=[node_parser],
storage_context=StorageContext.from_defaults(vector_store=vector_store),
embed_model=EMBED_MODEL,
)
<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x373a7f7d0>
使用 RAG 进行问答¶
检索器可用于识别高度相关的文档
retriever = index.as_retriever(similarity_top_k=3)
results = retriever.retrieve("What patents are related to fitness devices?")
for item in results:
print(item)
Node ID: 5afd36c0-a739-4a88-a51c-6d0f75358db5 Text: The portable fitness monitoring device 102 may be a device such as, for example, a mobile phone, a personal digital assistant, a music file player (e.g. and MP3 player), an intelligent article for wearing (e.g. a fitness monitoring garment, wrist band, or watch), a dongle (e.g. a small hardware device that protects software) that includes a fitn... Score: 0.772 Node ID: f294b5fd-9089-43cb-8c4e-d1095a634ff1 Text: US Patent Application US 20120071306 entitled “Portable Multipurpose Whole Body Exercise Device” discloses a portable multipurpose whole body exercise device which can be used for general fitness, Pilates-type, core strengthening, therapeutic, and rehabilitative exercises as well as stretching and physical therapy and which includes storable acc... Score: 0.749 Node ID: 8251c7ef-1165-42e1-8c91-c99c8a711bf7 Text: Program products, methods, and systems for providing fitness monitoring services of the present invention can include any software application executed by one or more computing devices. A computing device can be any type of computing device having one or more processors. For example, a computing device can be a workstation, mobile device (e.g., ... Score: 0.744
使用查询引擎,我们可以在索引文档集上运行 RAG 模式的问答。
首先,我们可以直接提示 LLM
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from rich.console import Console
from rich.panel import Panel
console = Console()
query = "Do mosquitoes in high altitude expand viruses over large distances?"
usr_msg = ChatMessage(role=MessageRole.USER, content=query)
response = GEN_MODEL.chat(messages=[usr_msg])
console.print(Panel(query, title="Prompt", border_style="bold red"))
console.print(
Panel(
response.message.content.strip(),
title="Generated Content",
border_style="bold green",
)
)
╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮ │ Do mosquitoes in high altitude expand viruses over large distances? │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮ │ Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not │ │ primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever, │ │ and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, │ │ and environmental conditions that support their survival and reproduction. │ │ │ │ At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder │ │ temperatures, lower humidity, and stronger winds, which can limit their population size and distribution. │ │ However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases │ │ in these areas. │ │ │ │ It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is │ │ not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance │ │ transmission of viruses is more often associated with human travel and transportation, which can rapidly spread │ │ infected mosquitoes or humans to new areas, leading to the spread of disease. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
现在,我们可以比较使用索引的 PMC 文章作为支持上下文提示模型时的响应
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="filename", value="nihpp-2024.12.26.630351v1.nxml"),
]
)
query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)
result = query_engine.query(query)
console.print(
Panel(
result.response.strip(),
title="Generated Content with RAG",
border_style="bold green",
)
)
╭────────────────────────────────────────── Generated Content with RAG ───────────────────────────────────────────╮ │ Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female │ │ mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with │ │ arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with │ │ flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, │ │ including three arboviruses that affect humans (dengue, West Nile, and M’Poko viruses). The study provides │ │ compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯