自定义 XML 的转换¶

步骤	技术	执行
Embedding	Hugging Face / Sentence Transformers	💻 本地
向量存储	Milvus	💻 本地
生成式 AI	Hugging Face 推理 API	🌐 远程

概述¶

这是一个使用 Docling 将结构化数据 (XML) 转换为统一文档表示格式 DoclingDocument 的示例，并利用其丰富的结构化内容进行 RAG 应用程序。

本示例中使用的数据包括来自美国专利商标局 (USPTO) 的专利和来自 PubMed Central® (PMC) 的医学文章。

在本 notebook 中，我们将完成以下任务

简单转换（简要说明）
端到端应用程序，使用 Docling 支持的公共 XML 文件集
- 设置生成式 AI 的 API 访问
- 获取数据，使用 Docling 自定义后端从 USPTO 和 PubMed Central® 网站获取数据
- 解析、分块和索引文档，将文档存储到向量数据库中
- 执行 RAG，使用 LlamaIndex Docling 扩展

有关使用 Docling 进行文档分块的更多详细信息，请参阅分块文档。有关使用 Docling 和 LlamaIndex 的 RAG，另请查看示例使用 LlamaIndex 的 RAG。

简单转换¶

XML 文件格式定义和存储数据的方式是人类可读和机器可读的。由于这种灵活性，Docling 需要自定义后端处理器来解释 XML 定义并将其转换为 DoclingDocument 对象。

Docling 已支持一些公共 XML 格式数据集合（USPTO 专利和 PMC 文章）。在这些情况下，文档转换非常简单，与任何其他支持的格式（如 PDF 或 HTML）相同。简单转换中的执行示例是 Docling 推荐的单个文件用法

输入 [1]

已复制!

from docling.document_converter import DocumentConverter

# a sample PMC article:
source = "../../tests/data/jats/elife-56337.nxml"
converter = DocumentConverter()
result = converter.convert(source)
print(result.status)
from docling.document_converter import DocumentConverter # 一个示例 PMC 文章: source = "../../tests/data/jats/elife-56337.nxml" converter = DocumentConverter() result = converter.convert(source) print(result.status)

ConversionStatus.SUCCESS

文档转换后，可以导出为 Docling 支持的任何格式。例如，导出为 Markdown（此处仅显示前几行）

输入 [2]

已复制!

md_doc = result.document.export_to_markdown()

delim = "\n"
print(delim.join(md_doc.split(delim)[:8]))
md_doc = result.document.export_to_markdown() delim = "\n" print(delim.join(md_doc.split(delim)[:8]))

# KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage

Gernot Wolf, Alberto de Iaco, Ming-An Sun, Melania Bruno, Matthew Tinkham, Don Hoang, Apratim Mitra, Sherry Ralls, Didier Trono, Todd S Macfarlan

The Eunice Kennedy Shriver National Institute of Child Health and Human Development, The National Institutes of Health, Bethesda, United States; School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

## Abstract

如果 XML 文件不受支持，将抛出 ConversionError 消息。

输入 [3]

已复制!





from io import BytesIO

from docling.datamodel.base_models import DocumentStream
from docling.exceptions import ConversionError

xml_content = (
    b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE docling_test SYSTEM '
    b'"test.dtd"><docling>Random content</docling>'
)
stream = DocumentStream(name="docling_test.xml", stream=BytesIO(xml_content))
try:
    result = converter.convert(stream)
except ConversionError as ce:
    print(ce)
from io import BytesIO from docling.datamodel.base_models import DocumentStream from docling.exceptions import ConversionError xml_content = ( b'
随机内容' ) stream = DocumentStream(name="docling_test.xml", stream=BytesIO(xml_content)) try: result = converter.convert(stream) except ConversionError as ce: print(ce)

Input document docling_test.xml does not match any allowed format.

File format not allowed: docling_test.xml

您随时可以参考用法文档页面，获取支持的格式列表。

端到端应用程序¶

本节介绍了一个分步应用程序，用于处理支持的公共集合中的 XML 文件，并将其用于问答。

设置¶

可以按如下所示安装依赖项。--no-warn-conflicts 参数适用于 Colab 的预填充 Python 环境，如果需要更严格的使用，请随意移除。

输入 [4]

已复制!

%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv

Note: you may need to restart the kernel to use updated packages.

本 notebook 使用 HuggingFace 的推理 API。为了增加 LLM 配额，可以通过环境变量 HF_TOKEN 提供一个 token。

如果您在 Google Colab 中运行此 notebook，请确保将您的 API 密钥添加为 secret。

输入 [5]

已复制!

import os
from warnings import filterwarnings

from dotenv import load_dotenv

def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)

load_dotenv()

filterwarnings(action="ignore", category=UserWarning, module="pydantic")
import os from warnings import filterwarnings from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() filterwarnings(action="ignore", category=UserWarning, module="pydantic")

现在我们可以定义主要参数了

输入 [6]

已复制!





from pathlib import Path
from tempfile import mkdtemp

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)
TEMP_DIR = Path(mkdtemp())
MILVUS_URI = str(TEMP_DIR / "docling.db")
GEN_MODEL = HuggingFaceInferenceAPI(
    token=_get_env_from_colab_or_os("HF_TOKEN"),
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
embed_dim = len(EMBED_MODEL.get_text_embedding("hi"))
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from pathlib import Path from tempfile import mkdtemp from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5" EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID) TEMP_DIR = Path(mkdtemp()) MILVUS_URI = str(TEMP_DIR / "docling.db") GEN_MODEL = HuggingFaceInferenceAPI( token=_get_env_from_colab_or_os("HF_TOKEN"), model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", ) embed_dim = len(EMBED_MODEL.get_text_embedding("hi")) # https://github.com/huggingface/transformers/issues/5486: os.environ["TOKENIZERS_PARALLELISM"] = "false"

获取数据¶

在本 notebook 中，我们将使用 Docling 支持的集合中的 XML 数据

来自 PubMed Central® (PMC) 的医学文章。它们以 .tar.gz 文件形式存储在 FTP 服务器上。每个文件包含完整的 XML 格式文章数据，以及图像或电子表格等其他补充文件。
来自美国专利商标局的专利。它们以 zip 文件形式存储在批量数据存储系统 (BDSS) 中。每个 zip 文件可能包含多个 XML 格式的专利。

原始文件将从源下载并保存在临时目录中。

PMC 文章¶

OA 文件是所有 PMC 文章的清单文件，包括下载源文件的 URL 路径。在本 notebook 中，我们将以文章病原体通过高空风媒蚊子传播为例，该文章可在归档文件PMC11703268.tar.gz 中获取。

输入 [7]

已复制!





import tarfile
from io import BytesIO

import requests

# PMC article PMC11703268
url: str = "https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz"

print(f"Downloading {url}...")
buf = BytesIO(requests.get(url).content)
print("Extracting and storing the XML file containing the article text...")
with tarfile.open(fileobj=buf, mode="r:gz") as tar_file:
    for tarinfo in tar_file:
        if tarinfo.isreg():
            file_path = Path(tarinfo.name)
            if file_path.suffix == ".nxml":
                with open(TEMP_DIR / file_path.name, "wb") as file_obj:
                    file_obj.write(tar_file.extractfile(tarinfo).read())
                print(f"Stored XML file {file_path.name}")
import tarfile from io import BytesIO import requests # PMC 文章 PMC11703268 URL: url: str = "https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz" print(f"正在下载 {url}...") buf = BytesIO(requests.get(url).content) print("正在提取和存储包含文章文本的 XML 文件...") with tarfile.open(fileobj=buf, mode="r:gz") as tar_file: for tarinfo in tar_file: if tarinfo.isreg(): file_path = Path(tarinfo.name) if file_path.suffix == ".nxml": with open(TEMP_DIR / file_path.name, "wb") as file_obj: file_obj.write(tar_file.extractfile(tarinfo).read()) print(f"已存储 XML 文件 {file_path.name}")

Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz...
Extracting and storing the XML file containing the article text...
Stored XML file nihpp-2024.12.26.630351v1.nxml

USPTO 专利¶

由于每个 USPTO 文件是多个专利的连接，我们需要将其内容分割成有效的 XML 片段。以下代码下载一个示例 zip 文件，将其内容分割成多个部分，并将每个部分转储为 XML 文件。为简单起见，此流水线此处按顺序显示，但可以并行化。

输入 [8]

已复制!





import zipfile

# Patent grants from December 17-23, 2024
url: str = (
    "https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip"
)
XML_SPLITTER: str = '<?xml version="1.0"'
doc_num: int = 0

print(f"Downloading {url}...")
buf = BytesIO(requests.get(url).content)
print("Parsing zip file, splitting into XML sections, and exporting to files...")
with zipfile.ZipFile(buf) as zf:
    res = zf.testzip()
    if res:
        print("Error validating zip file")
    else:
        with zf.open(zf.namelist()[0]) as xf:
            is_patent = False
            patent_buffer = BytesIO()
            for xf_line in xf:
                decoded_line = xf_line.decode(errors="ignore").rstrip()
                xml_index = decoded_line.find(XML_SPLITTER)
                if xml_index != -1:
                    if (
                        xml_index > 0
                    ):  # cases like </sequence-cwu><?xml version="1.0"...
                        patent_buffer.write(xf_line[:xml_index])
                        patent_buffer.write(b"\r\n")
                        xf_line = xf_line[xml_index:]
                    if patent_buffer.getbuffer().nbytes > 0 and is_patent:
                        doc_num += 1
                        patent_id = f"ipg241217-{doc_num}"
                        with open(TEMP_DIR / f"{patent_id}.xml", "wb") as file_obj:
                            file_obj.write(patent_buffer.getbuffer())
                    is_patent = False
                    patent_buffer = BytesIO()
                elif decoded_line.startswith("<!DOCTYPE"):
                    is_patent = True
                patent_buffer.write(xf_line)
import zipfile # 2024 年 12 月 17-23 日的授权专利 url: str = ( "https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip" ) XML_SPLITTER: str = '0 ): # 例如0 and is_patent: doc_num += 1 patent_id = f"ipg241217-{doc_num}" with open(TEMP_DIR / f"{patent_id}.xml", "wb") as file_obj: file_obj.write(patent_buffer.getbuffer()) is_patent = False patent_buffer = BytesIO() elif decoded_line.startswith("

Downloading https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip...
Parsing zip file, splitting into XML sections, and exporting to files...

输入 [9]

已复制!

print(f"Fetched and exported {doc_num} documents.")
print(f"已获取并导出 {doc_num} 个文档。")

Fetched and exported 4014 documents.

使用后端转换器（可选）¶

自定义后端转换器 PubMedDocumentBackend 和 PatentUsptoDocumentBackend 分别用于处理 PMC 文章和 USPTO 专利的解析。
与任何其他后端一样，您可以使用 is_valid() 函数检查输入文档是否受此后端支持。
请注意，原始 USPTO zip 文件中的某些 XML 部分可能不代表专利，例如序列列表，因此后端会将其显示为无效。

输入 [11]

已复制!





from tqdm.notebook import tqdm

from docling.backend.xml.jats_backend import JatsDocumentBackend
from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import InputDocument

# check PMC
in_doc = InputDocument(
    path_or_stream=TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml",
    format=InputFormat.XML_JATS,
    backend=JatsDocumentBackend,
)
backend = JatsDocumentBackend(
    in_doc=in_doc, path_or_stream=TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml"
)
print(f"Document {in_doc.file.name} is a valid PMC article? {backend.is_valid()}")

# check USPTO
in_doc = InputDocument(
    path_or_stream=TEMP_DIR / "ipg241217-1.xml",
    format=InputFormat.XML_USPTO,
    backend=PatentUsptoDocumentBackend,
)
backend = PatentUsptoDocumentBackend(
    in_doc=in_doc, path_or_stream=TEMP_DIR / "ipg241217-1.xml"
)
print(f"Document {in_doc.file.name} is a valid patent? {backend.is_valid()}")

patent_valid = 0
pbar = tqdm(TEMP_DIR.glob("*.xml"), total=doc_num)
for in_path in pbar:
    in_doc = InputDocument(
        path_or_stream=in_path,
        format=InputFormat.XML_USPTO,
        backend=PatentUsptoDocumentBackend,
    )
    backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path)
    patent_valid += int(backend.is_valid())

print(f"Found {patent_valid} patents out of {doc_num} XML files.")
from tqdm.notebook import tqdm from docling.backend.xml.jats_backend import JatsDocumentBackend from docling.backend.xml.uspto_backend import PatentUsptoDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.document import InputDocument # 检查 PMC in_doc = InputDocument( path_or_stream=TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml", format=InputFormat.XML_JATS, backend=JatsDocumentBackend, ) backend = JatsDocumentBackend( in_doc=in_doc, path_or_stream=TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml" ) print(f"文档 {in_doc.file.name} 是有效的 PMC 文章吗？{backend.is_valid()}") # 检查 USPTO in_doc = InputDocument( path_or_stream=TEMP_DIR / "ipg241217-1.xml", format=InputFormat.XML_USPTO, backend=PatentUsptoDocumentBackend, ) backend = PatentUsptoDocumentBackend( in_doc=in_doc, path_or_stream=TEMP_DIR / "ipg241217-1.xml" ) print(f"文档 {in_doc.file.name} 是有效的专利吗？{backend.is_valid()}") patent_valid = 0 pbar = tqdm(TEMP_DIR.glob("*.xml"), total=doc_num) for in_path in pbar: in_doc = InputDocument( path_or_stream=in_path, format=InputFormat.XML_USPTO, backend=PatentUsptoDocumentBackend, ) backend = PatentUsptoDocumentBackend(in_doc=in_doc, path_or_stream=in_path) patent_valid += int(backend.is_valid()) print(f"在 {doc_num} 个 XML 文件中找到 {patent_valid} 个专利。")

Document nihpp-2024.12.26.630351v1.nxml is a valid PMC article? True
Document ipg241217-1.xml is a valid patent? True

  0%|          | 0/4014 [00:00<?, ?it/s]

Found 3928 patents out of 4014 XML files.

调用 convert() 函数将把输入文档转换为 DoclingDocument

输入 [12]

已复制!

doc = backend.convert()

claims_sec = next(item for item in doc.texts if item.text == "CLAIMS")
print(f'Patent "{doc.texts[0].text}" has {len(claims_sec.children)} claims')
doc = backend.convert() claims_sec = next(item for item in doc.texts if item.text == "CLAIMS") print(f'专利 "{doc.texts[0].text}" 有 {len(claims_sec.children)} 项权利要求')

Patent "Semiconductor package" has 19 claims

✏️ 提示：通常，无需使用后端转换器来解析 USPTO 或 JATS (PubMed) XML 文件。通用的 DocumentConverter 对象会尝试猜测输入文档格式并应用相应的后端解析器。简单转换中所示的转换是支持的 XML 文件的推荐用法。

解析、分块和索引¶

转换后的专利的 DoclingDocument 格式具有丰富的层次结构，继承自原始 XML 文档并由 Docling 自定义后端保留。在本 notebook 中，我们将利用

SimpleDirectoryReader 模式，用于迭代在获取数据部分创建的导出的 XML 文件。
LlamaIndex 扩展 DoclingReader 和 DoclingNodeParser，用于将专利分块摄取到 Milvus 向量存储中。
HierarchicalChunker 实现，它应用基于文档的层次分块，以利用专利结构，例如章节及其内的段落。

请参考分块文档和使用 LlamaIndex 的 RAG notebook 中其他可能的实现和使用模式。

设置 Docling 阅读器和目录阅读器¶

请注意，DoclingReader 默认使用 Docling 的 DocumentConverter，因此它会自动识别 XML 文件的格式并利用 PatentUsptoDocumentBackend。

出于演示目的，我们将分析范围限制在前 100 个专利。

输入 [13]

已复制!





from llama_index.core import SimpleDirectoryReader
from llama_index.readers.docling import DoclingReader

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
dir_reader = SimpleDirectoryReader(
    input_dir=TEMP_DIR,
    exclude=["docling.db", "*.nxml"],
    file_extractor={".xml": reader},
    filename_as_id=True,
    num_files_limit=100,
)
from llama_index.core import SimpleDirectoryReader from llama_index.readers.docling import DoclingReader reader = DoclingReader(export_type=DoclingReader.ExportType.JSON) dir_reader = SimpleDirectoryReader( input_dir=TEMP_DIR, exclude=["docling.db", "*.nxml"], file_extractor={".xml": reader}, filename_as_id=True, num_files_limit=100, )

设置节点解析器¶

请注意，HierarchicalChunker 是 DoclingNodeParser 的默认分块实现。

输入 [14]

已复制!

from llama_index.node_parser.docling import DoclingNodeParser

node_parser = DoclingNodeParser()
from llama_index.node_parser.docling import DoclingNodeParser node_parser = DoclingNodeParser()

设置本地 Milvus 数据库并运行摄取¶

输入 [ ]

已复制!





from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore

vector_store = MilvusVectorStore(
    uri=MILVUS_URI,
    dim=embed_dim,
    overwrite=True,
)

index = VectorStoreIndex.from_documents(
    documents=dir_reader.load_data(show_progress=True),
    transformations=[node_parser],
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    embed_model=EMBED_MODEL,
    show_progress=True,
)
from llama_index.core import StorageContext, VectorStoreIndex from llama_index.vector_stores.milvus import MilvusVectorStore vector_store = MilvusVectorStore( uri=MILVUS_URI, dim=embed_dim, overwrite=True, ) index = VectorStoreIndex.from_documents( documents=dir_reader.load_data(show_progress=True), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, show_progress=True, )

最后，直接从阅读器将 PMC 文章添加到向量存储中。

输入 [14]

已复制!





index.from_documents(
    documents=reader.load_data(TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml"),
    transformations=[node_parser],
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    embed_model=EMBED_MODEL,
)
index.from_documents( documents=reader.load_data(TEMP_DIR / "nihpp-2024.12.26.630351v1.nxml"), transformations=[node_parser], storage_context=StorageContext.from_defaults(vector_store=vector_store), embed_model=EMBED_MODEL, )

输出[14]

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x373a7f7d0>

使用 RAG 进行问答¶

检索器可用于识别高度相关的文档

输入 [15]

已复制!

retriever = index.as_retriever(similarity_top_k=3)
results = retriever.retrieve("What patents are related to fitness devices?")

for item in results:
    print(item)
retriever = index.as_retriever(similarity_top_k=3) results = retriever.retrieve("哪些专利与健身设备相关？") for item in results: print(item)

Node ID: 5afd36c0-a739-4a88-a51c-6d0f75358db5
Text: The portable fitness monitoring device 102 may be a device such
as, for example, a mobile phone, a personal digital assistant, a music
file player (e.g. and MP3 player), an intelligent article for wearing
(e.g. a fitness monitoring garment, wrist band, or watch), a dongle
(e.g. a small hardware device that protects software) that includes a
fitn...
Score:  0.772

Node ID: f294b5fd-9089-43cb-8c4e-d1095a634ff1
Text: US Patent Application US 20120071306 entitled “Portable
Multipurpose Whole Body Exercise Device” discloses a portable
multipurpose whole body exercise device which can be used for general
fitness, Pilates-type, core strengthening, therapeutic, and
rehabilitative exercises as well as stretching and physical therapy
and which includes storable acc...
Score:  0.749

Node ID: 8251c7ef-1165-42e1-8c91-c99c8a711bf7
Text: Program products, methods, and systems for providing fitness
monitoring services of the present invention can include any software
application executed by one or more computing devices. A computing
device can be any type of computing device having one or more
processors. For example, a computing device can be a workstation,
mobile device (e.g., ...
Score:  0.744

使用查询引擎，我们可以在索引文档集上运行 RAG 模式的问答。

首先，我们可以直接提示 LLM

输入 [16]

已复制!





from llama_index.core.base.llms.types import ChatMessage, MessageRole
from rich.console import Console
from rich.panel import Panel

console = Console()
query = "Do mosquitoes in high altitude expand viruses over large distances?"

usr_msg = ChatMessage(role=MessageRole.USER, content=query)
response = GEN_MODEL.chat(messages=[usr_msg])

console.print(Panel(query, title="Prompt", border_style="bold red"))
console.print(
    Panel(
        response.message.content.strip(),
        title="Generated Content",
        border_style="bold green",
    )
)
from llama_index.core.base.llms.types import ChatMessage, MessageRole from rich.console import Console from rich.panel import Panel console = Console() query = "高空蚊子是否会远距离传播病毒？" usr_msg = ChatMessage(role=MessageRole.USER, content=query) response = GEN_MODEL.chat(messages=[usr_msg]) console.print(Panel(query, title="提示", border_style="bold red")) console.print( Panel( response.message.content.strip(), title="生成内容", border_style="bold green", ) )

╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮
│ Do mosquitoes in high altitude expand viruses over large distances?                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮
│ Mosquitoes can be found at high altitudes, but their ability to transmit viruses over long distances is not     │
│ primarily dependent on altitude. Mosquitoes are vectors for various diseases, such as malaria, dengue fever,    │
│ and Zika virus, and their transmission range is more closely related to their movement, the presence of a host, │
│ and environmental conditions that support their survival and reproduction.                                      │
│                                                                                                                 │
│ At high altitudes, the environment can be less suitable for mosquitoes due to factors such as colder            │
│ temperatures, lower humidity, and stronger winds, which can limit their population size and distribution.       │
│ However, some species of mosquitoes have adapted to high-altitude environments and can still transmit diseases  │
│ in these areas.                                                                                                 │
│                                                                                                                 │
│ It is possible for mosquitoes to be transported by wind or human activities to higher altitudes, but this is    │
│ not a significant factor in their ability to transmit viruses over long distances. Instead, long-distance       │
│ transmission of viruses is more often associated with human travel and transportation, which can rapidly spread │
│ infected mosquitoes or humans to new areas, leading to the spread of disease.                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

现在，我们可以比较使用索引的 PMC 文章作为支持上下文提示模型时的响应

输入 [17]

已复制!





from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="filename", value="nihpp-2024.12.26.630351v1.nxml"),
    ]
)

query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)
result = query_engine.query(query)

console.print(
    Panel(
        result.response.strip(),
        title="Generated Content with RAG",
        border_style="bold green",
    )
)
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters filters = MetadataFilters( filters=[ ExactMatchFilter(key="filename", value="nihpp-2024.12.26.630351v1.nxml"), ] ) query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3) result = query_engine.query(query) console.print( Panel( result.response.strip(), title="使用 RAG 生成内容", border_style="bold green", ) )

╭────────────────────────────────────────── Generated Content with RAG ───────────────────────────────────────────╮
│ Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female      │
│ mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with      │
│ arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with            │
│ flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens,         │
│ including three arboviruses that affect humans (dengue, West Nile, and M’Poko viruses). The study provides      │
│ compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude.         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯