序列化¶
概览¶
在此笔记本中,我们展示了 Docling 序列化器的用法。
设置¶
In [1]
已复制!
%pip install -qU pip docling docling-core~=2.29 rich
%pip install -qU pip docling docling-core~=2.29 rich
Note: you may need to restart the kernel to use updated packages.
In [2]
已复制!
DOC_SOURCE = "https://arxiv.org/pdf/2311.18481"
# we set some start-stop cues for defining an excerpt to print
start_cue = "Copyright © 2024"
stop_cue = "Application of NLP to ESG"
DOC_SOURCE = "https://arxiv.org/pdf/2311.18481" # 我们设置了一些起始-停止提示,用于定义要打印的摘录 start_cue = "Copyright © 2024" stop_cue = "Application of NLP to ESG"
In [3]
已复制!
from rich.console import Console
from rich.panel import Panel
console = Console(width=210) # for preventing Markdown table wrapped rendering
def print_in_console(text):
console.print(Panel(text))
from rich.console import Console from rich.panel import Panel console = Console(width=210) # 防止 Markdown 表格换行渲染 def print_in_console(text): console.print(Panel(text))
基本用法¶
我们首先转换文档
In [4]
已复制!
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
doc = converter.convert(source=DOC_SOURCE).document
from docling.document_converter import DocumentConverter converter = DocumentConverter() doc = converter.convert(source=DOC_SOURCE).document
/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used. warnings.warn(warn_msg)
我们现在可以将任何 BaseDocSerializer
应用到生成的文档上。
👉 请注意,为使显示输出简短,我们只打印了摘录。
例如,下面我们应用一个 HTMLDocSerializer
In [5]
已复制!
from docling_core.transforms.serializer.html import HTMLDocSerializer
serializer = HTMLDocSerializer(doc=doc)
ser_result = serializer.serialize()
ser_text = ser_result.text
# we here only print an excerpt to keep the output brief:
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
from docling_core.transforms.serializer.html import HTMLDocSerializer serializer = HTMLDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text # 为了使输出简短,我们这里只打印了摘录: print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.</p> │ │ <table><tbody><tr><th>Report</th><th>Question</th><th>Answer</th></tr><tr><td>IBM 2022</td><td>How many hours were spent on employee learning in 2021?</td><td>22.5 million hours</td></tr><tr><td>IBM │ │ 2022</td><td>What was the rate of fatalities in 2021?</td><td>The rate of fatalities in 2021 was 0.0016.</td></tr><tr><td>IBM 2022</td><td>How many full audits were con- ducted in 2022 in │ │ India?</td><td>2</td></tr><tr><td>Starbucks 2022</td><td>What is the percentage of women in the Board of Directors?</td><td>25%</td></tr><tr><td>Starbucks 2022</td><td>What was the total energy con- │ │ sumption in 2021?</td><td>According to the table, the total energy consumption in 2021 was 2,491,543 MWh.</td></tr><tr><td>Starbucks 2022</td><td>How much packaging material was made from renewable mate- │ │ rials?</td><td>According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22.</td></tr></tbody></table> │ │ <p>Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system.</p> │ │ <p>ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │ │ response.</p> │ │ <h2>Related Work</h2> │ │ <p>The DocQA integrates multiple AI technologies, namely:</p> │ │ <p>Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric │ │ layout analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et │ │ al. 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │ │ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning-</p> │ │ <figure><figcaption>Figure 1: System architecture: Simplified sketch of document question-answering pipeline.</figcaption></figure> │ │ <p>based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018).</p> │ │ <p> │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
在以下示例中,我们使用一个 MarkdownDocSerializer
In [6]
已复制!
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer
serializer = MarkdownDocSerializer(doc=doc)
ser_result = serializer.serialize()
ser_text = ser_result.text
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer serializer = MarkdownDocSerializer(doc=doc) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. │ │ │ │ | Report | Question | Answer | │ │ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| │ │ | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | │ │ | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | │ │ | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | │ │ | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | │ │ | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | │ │ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | │ │ │ │ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. │ │ │ │ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │ │ response. │ │ │ │ ## Related Work │ │ │ │ The DocQA integrates multiple AI technologies, namely: │ │ │ │ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │ │ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. │ │ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │ │ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- │ │ │ │ Figure 1: System architecture: Simplified sketch of document question-answering pipeline. │ │ │ │ <!-- image --> │ │ │ │ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). │ │ │ │ │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
配置序列化器¶
现在我们假设想重新配置 Markdown 序列化,使其满足以下条件:
- 它使用不同的组件序列化器,例如,如果我们希望以三元组格式打印表格(与 Markdown 表格相比,这可能会改善向量表示)
- 它使用特定的用户定义参数,例如,如果我们希望使用与默认不同的图片占位符文本
查看以下配置并注意下面输出中的序列化差异
In [7]
已复制!
from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer
from docling_core.transforms.serializer.markdown import MarkdownParams
serializer = MarkdownDocSerializer(
doc=doc,
table_serializer=TripletTableSerializer(),
params=MarkdownParams(
image_placeholder="<!-- demo picture placeholder -->",
# ...
),
)
ser_result = serializer.serialize()
ser_text = ser_result.text
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer from docling_core.transforms.serializer.markdown import MarkdownParams serializer = MarkdownDocSerializer( doc=doc, table_serializer=TripletTableSerializer(), params=MarkdownParams( image_placeholder="", # ... ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. │ │ │ │ IBM 2022, Question = How many hours were spent on employee learning in 2021?. IBM 2022, Answer = 22.5 million hours. IBM 2022, Question = What was the rate of fatalities in 2021?. IBM 2022, Answer = The │ │ rate of fatalities in 2021 was 0.0016.. IBM 2022, Question = How many full audits were con- ducted in 2022 in India?. IBM 2022, Answer = 2. Starbucks 2022, Question = What is the percentage of women in the │ │ Board of Directors?. Starbucks 2022, Answer = 25%. Starbucks 2022, Question = What was the total energy con- sumption in 2021?. Starbucks 2022, Answer = According to the table, the total energy consumption │ │ in 2021 was 2,491,543 MWh.. Starbucks 2022, Question = How much packaging material was made from renewable mate- rials?. Starbucks 2022, Answer = According to the given data, 31% of packaging materials were │ │ made from recycled or renewable materials in FY22. │ │ │ │ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. │ │ │ │ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │ │ response. │ │ │ │ ## Related Work │ │ │ │ The DocQA integrates multiple AI technologies, namely: │ │ │ │ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │ │ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. │ │ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │ │ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- │ │ │ │ Figure 1: System architecture: Simplified sketch of document question-answering pipeline. │ │ │ │ <!-- demo picture placeholder --> │ │ │ │ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). │ │ │ │ │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
创建自定义序列化器¶
在上面的示例中,我们能够重用现有实现来满足我们所需的序列化策略,但现在我们假设要定义一个自定义序列化逻辑,例如,我们希望图片序列化包含任何可用的图片描述(标注)注解。
为此,我们首先需要回顾我们的转换过程,并包含进行图片描述增强所需的所有流水线选项。
In [8]
已复制!
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
PictureDescriptionVlmOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
do_picture_description=True,
picture_description_options=PictureDescriptionVlmOptions(
repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
prompt="Describe this picture in three to five sentences. Be precise and concise.",
),
generate_picture_images=True,
images_scale=2,
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
doc = converter.convert(source=DOC_SOURCE).document
from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( PdfPipelineOptions, PictureDescriptionVlmOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption pipeline_options = PdfPipelineOptions( do_picture_description=True, picture_description_options=PictureDescriptionVlmOptions( repo_id="HuggingFaceTB/SmolVLM-256M-Instruct", prompt="Describe this picture in three to five sentences. Be precise and concise.", ), generate_picture_images=True, images_scale=2, ) converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} ) doc = converter.convert(source=DOC_SOURCE).document
/Users/pva/work/github.com/DS4SD/docling/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:683: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used. warnings.warn(warn_msg)
然后我们可以定义自定义图片序列化器
In [9]
已复制!
from typing import Any, Optional
from docling_core.transforms.serializer.base import (
BaseDocSerializer,
SerializationResult,
)
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.transforms.serializer.markdown import (
MarkdownParams,
MarkdownPictureSerializer,
)
from docling_core.types.doc.document import (
DoclingDocument,
ImageRefMode,
PictureDescriptionData,
PictureItem,
)
from typing_extensions import override
class AnnotationPictureSerializer(MarkdownPictureSerializer):
@override
def serialize(
self,
*,
item: PictureItem,
doc_serializer: BaseDocSerializer,
doc: DoclingDocument,
separator: Optional[str] = None,
**kwargs: Any,
) -> SerializationResult:
text_parts: list[str] = []
# reusing the existing result:
parent_res = super().serialize(
item=item,
doc_serializer=doc_serializer,
doc=doc,
**kwargs,
)
text_parts.append(parent_res.text)
# appending annotations:
for annotation in item.annotations:
if isinstance(annotation, PictureDescriptionData):
text_parts.append(f"<!-- Picture description: {annotation.text} -->")
text_res = (separator or "\n").join(text_parts)
return create_ser_result(text=text_res, span_source=item)
from typing import Any, Optional from docling_core.transforms.serializer.base import ( BaseDocSerializer, SerializationResult, ) from docling_core.transforms.serializer.common import create_ser_result from docling_core.transforms.serializer.markdown import ( MarkdownParams, MarkdownPictureSerializer, ) from docling_core.types.doc.document import ( DoclingDocument, ImageRefMode, PictureDescriptionData, PictureItem, ) from typing_extensions import override class AnnotationPictureSerializer(MarkdownPictureSerializer): @override def serialize( self, *, item: PictureItem, doc_serializer: BaseDocSerializer, doc: DoclingDocument, separator: Optional[str] = None, **kwargs: Any, ) -> SerializationResult: text_parts: list[str] = [] # 重用现有结果: parent_res = super().serialize( item=item, doc_serializer=doc_serializer, doc=doc, **kwargs, ) text_parts.append(parent_res.text) # 添加注解: for annotation in item.annotations: if isinstance(annotation, PictureDescriptionData): text_parts.append(f"") text_res = (separator or "\n").join(text_parts) return create_ser_result(text=text_res, span_source=item)
最后但同样重要的是,我们定义一个新的文档序列化器,它利用了我们自定义的图片序列化器。
请注意下面输出中的图片描述注解
In [10]
已复制!
serializer = MarkdownDocSerializer(
doc=doc,
picture_serializer=AnnotationPictureSerializer(),
params=MarkdownParams(
image_mode=ImageRefMode.PLACEHOLDER,
image_placeholder="",
),
)
ser_result = serializer.serialize()
ser_text = ser_result.text
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
serializer = MarkdownDocSerializer( doc=doc, picture_serializer=AnnotationPictureSerializer(), params=MarkdownParams( image_mode=ImageRefMode.PLACEHOLDER, image_placeholder="", ), ) ser_result = serializer.serialize() ser_text = ser_result.text print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. │ │ │ │ | Report | Question | Answer | │ │ |----------------|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| │ │ | IBM 2022 | How many hours were spent on employee learning in 2021? | 22.5 million hours | │ │ | IBM 2022 | What was the rate of fatalities in 2021? | The rate of fatalities in 2021 was 0.0016. | │ │ | IBM 2022 | How many full audits were con- ducted in 2022 in India? | 2 | │ │ | Starbucks 2022 | What is the percentage of women in the Board of Directors? | 25% | │ │ | Starbucks 2022 | What was the total energy con- sumption in 2021? | According to the table, the total energy consumption in 2021 was 2,491,543 MWh. | │ │ | Starbucks 2022 | How much packaging material was made from renewable mate- rials? | According to the given data, 31% of packaging materials were made from recycled or renewable materials in FY22. | │ │ │ │ Table 1: Example question answers from the ESG reports of IBM and Starbucks using Deep Search DocQA system. │ │ │ │ ESG report in our library via our QA conversational assistant. Our assistant generates answers and also presents the information (paragraph or table), in the ESG report, from which it has generated the │ │ response. │ │ │ │ ## Related Work │ │ │ │ The DocQA integrates multiple AI technologies, namely: │ │ │ │ Document Conversion: Converting unstructured documents, such as PDF files, into a machine-readable format is a challenging task in AI. Early strategies for document conversion were based on geometric layout │ │ analysis (Cattoni et al. 2000; Breuel 2002). Thanks to the availability of large annotated datasets (PubLayNet (Zhong et al. 2019), DocBank (Li et al. 2020), DocLayNet (Pfitzmann et al. 2022; Auer et al. │ │ 2023), deep learning-based methods are routinely used. Modern approaches for recovering the structure of a document can be broadly divided into two categories: image-based or PDF representation-based . │ │ Imagebased methods usually employ Transformer or CNN architectures on the images of pages (Zhang et al. 2023; Li et al. 2022; Huang et al. 2022). On the other hand, deep learning- │ │ │ │ Figure 1: System architecture: Simplified sketch of document question-answering pipeline. │ │ <!-- Picture description: The image depicts a document conversion process. It is a sequence of steps that includes document conversion, information retrieval, and response generation. The document │ │ conversion step involves converting the document from a text format to a markdown format. The information retrieval step involves retrieving the document from a database or other source. The response │ │ generation step involves generating a response from the information retrieval step. --> │ │ │ │ based language processing methods are applied on the native PDF content (generated by a single PDF printing command) (Auer et al. 2022; Livathinos et al. 2021; Staar et al. 2018). │ │ │ │ │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯