混合分块¶
概述¶
混合分块在基于文档的层次化分块之上应用了对分词感知的细化。
更多详情,请参见此处。
设置¶
输入 [1]
已复制!
%pip install -qU pip docling transformers
%pip install -qU pip docling transformers
Note: you may need to restart the kernel to use updated packages.
输入 [2]
已复制!
DOC_SOURCE = "../../tests/data/md/wiki.md"
DOC_SOURCE = "../../tests/data/md/wiki.md"
基本用法¶
我们首先转换文档
输入 [3]
已复制!
from docling.document_converter import DocumentConverter
doc = DocumentConverter().convert(source=DOC_SOURCE).document
from docling.document_converter import DocumentConverter doc = DocumentConverter().convert(source=DOC_SOURCE).document
对于基本的分块场景,我们可以直接实例化一个 HybridChunker
,它将使用默认参数。
输入 [4]
已复制!
from docling.chunking import HybridChunker
chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc)
from docling.chunking import HybridChunker chunker = HybridChunker() chunk_iter = chunker.chunk(dl_doc=doc)
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
👉 注意:如上所示,使用
HybridChunker
有时可能会导致 transformers 库发出警告,但这只是一个“虚惊”——详情请查看此处。
请注意,您通常想要嵌入的文本是经 contextualize()
方法返回的、已进行上下文富化的文本。
输入 [5]
已复制!
for i, chunk in enumerate(chunk_iter):
print(f"=== {i} ===")
print(f"chunk.text:\n{f'{chunk.text[:300]}…'!r}")
enriched_text = chunker.contextualize(chunk=chunk)
print(f"chunker.contextualize(chunk):\n{f'{enriched_text[:300]}…'!r}")
print()
for i, chunk in enumerate(chunk_iter): print(f"=== {i} ===") print(f"chunk.text:\n{f'{chunk.text[:300]}…'!r}") enriched_text = chunker.contextualize(chunk=chunk) print(f"chunker.contextualize(chunk):\n{f'{enriched_text[:300]}…'!r}") print()
=== 0 === chunk.text: 'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver…' chunker.contextualize(chunk): 'IBM\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial …' === 1 === chunk.text: 'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa…' chunker.contextualize(chunk): 'IBM\n1910s–1950s\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889…' === 2 === chunk.text: 'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,…' chunker.contextualize(chunk): 'IBM\n1910s–1950s\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John …' === 3 === chunk.text: 'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…' chunker.contextualize(chunk): 'IBM\n1960s–1980s\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'
配置分词器¶
为了更好地控制分块,我们可以如下所示地参数化分词器。
在 RAG / 检索上下文中,确保分块器和嵌入模型使用相同的分词器非常重要。
👉 HuggingFace transformers 的分词器可以如下例所示使用
输入 [6]
已复制!
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
from docling.chunking import HybridChunker
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 64 # set to a small number for illustrative purposes
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case
)
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer from transformers import AutoTokenizer from docling.chunking import HybridChunker EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" MAX_TOKENS = 64 # set to a small number for illustrative purposes tokenizer = HuggingFaceTokenizer( tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID), max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case )
👉 或者,可以使用OpenAI 的分词器,如下例所示(取消注释以使用 — 需要安装 docling-core[chunking-openai]
)
输入 [7]
已复制!
# import tiktoken
# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
# tokenizer = OpenAITokenizer(
# tokenizer=tiktoken.encoding_for_model("gpt-4o"),
# max_tokens=128 * 1024, # context window length required for OpenAI tokenizers
# )
# import tiktoken # from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer # tokenizer = OpenAITokenizer( # tokenizer=tiktoken.encoding_for_model("gpt-4o"), # max_tokens=128 * 1024, # context window length required for OpenAI tokenizers # )
现在我们可以实例化分块器了
输入 [8]
已复制!
chunker = HybridChunker(
tokenizer=tokenizer,
merge_peers=True, # optional, defaults to True
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)
chunker = HybridChunker( tokenizer=tokenizer, merge_peers=True, # optional, defaults to True ) chunk_iter = chunker.chunk(dl_doc=doc) chunks = list(chunk_iter)
查看下面的输出分块需要注意的地方
- 在可能的情况下,我们为包含元数据的序列化形式适配了 64 个 Token 的限制(参见分块 2)
- 在需要时,我们在达到限制前停止,例如,参见 63 个 Token 的情况,因为再继续就会遇到逗号(参见分块 6)
- 在可能的情况下,我们合并了尺寸不足的同级分块(参见分块 0)
- 合并后紧随其后的“尾部”分块可能仍然尺寸不足(参见分块 8)
输入 [9]
已复制!
for i, chunk in enumerate(chunks):
print(f"=== {i} ===")
txt_tokens = tokenizer.count_tokens(chunk.text)
print(f"chunk.text ({txt_tokens} tokens):\n{chunk.text!r}")
ser_txt = chunker.contextualize(chunk=chunk)
ser_tokens = tokenizer.count_tokens(ser_txt)
print(f"chunker.contextualize(chunk) ({ser_tokens} tokens):\n{ser_txt!r}")
print()
for i, chunk in enumerate(chunks): print(f"=== {i} ===") txt_tokens = tokenizer.count_tokens(chunk.text) print(f"chunk.text ({txt_tokens} tokens):\n{chunk.text!r}") ser_txt = chunker.contextualize(chunk=chunk) ser_tokens = tokenizer.count_tokens(ser_txt) print(f"chunker.contextualize(chunk) ({ser_tokens} tokens):\n{ser_txt!r}") print()
=== 0 === chunk.text (55 tokens): 'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.' chunker.contextualize(chunk) (56 tokens): 'IBM\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.' === 1 === chunk.text (45 tokens): 'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.' chunker.contextualize(chunk) (46 tokens): 'IBM\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.' === 2 === chunk.text (63 tokens): 'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed "International Business Machines" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the' chunker.contextualize(chunk) (64 tokens): 'IBM\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed "International Business Machines" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the' === 3 === chunk.text (44 tokens): "IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]" chunker.contextualize(chunk) (45 tokens): "IBM\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]" === 4 === chunk.text (63 tokens): 'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,' chunker.contextualize(chunk) (64 tokens): 'IBM\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,' === 5 === chunk.text (61 tokens): 'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.' chunker.contextualize(chunk) (62 tokens): 'IBM\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.' === 6 === chunk.text (62 tokens): "As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming" chunker.contextualize(chunk) (63 tokens): "IBM\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming" === 7 === chunk.text (63 tokens): 'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing' chunker.contextualize(chunk) (64 tokens): 'IBM\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing' === 8 === chunk.text (5 tokens): 'Awards.[16]' chunker.contextualize(chunk) (6 tokens): 'IBM\nAwards.[16]' === 9 === chunk.text (56 tokens): 'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine' chunker.contextualize(chunk) (60 tokens): 'IBM\n1910s–1950s\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine' === 10 === chunk.text (60 tokens): "(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the" chunker.contextualize(chunk) (64 tokens): "IBM\n1910s–1950s\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the" === 11 === chunk.text (59 tokens): 'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,' chunker.contextualize(chunk) (63 tokens): 'IBM\n1910s–1950s\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,' === 12 === chunk.text (13 tokens): 'D.C.; and Toronto, Canada.[22]' chunker.contextualize(chunk) (17 tokens): 'IBM\n1910s–1950s\nD.C.; and Toronto, Canada.[22]' === 13 === chunk.text (60 tokens): 'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called' chunker.contextualize(chunk) (64 tokens): 'IBM\n1910s–1950s\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called' === 14 === chunk.text (59 tokens): "on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business" chunker.contextualize(chunk) (63 tokens): "IBM\n1910s–1950s\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business" === 15 === chunk.text (23 tokens): "practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\n105" chunker.contextualize(chunk) (27 tokens): "IBM\n1910s–1950s\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\n105" === 16 === chunk.text (59 tokens): 'He implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan,' chunker.contextualize(chunk) (63 tokens): 'IBM\n1910s–1950s\nHe implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan,' === 17 === chunk.text (60 tokens): '"THINK", became a mantra for each company\'s employees.[25] During Watson\'s first four years, revenues reached $9 million ($158 million today) and the company\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the' chunker.contextualize(chunk) (64 tokens): 'IBM\n1910s–1950s\n"THINK", became a mantra for each company\'s employees.[25] During Watson\'s first four years, revenues reached $9 million ($158 million today) and the company\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the' === 18 === chunk.text (57 tokens): 'clumsy hyphenated name "Computing-Tabulating-Recording Company" and chose to replace it with the more expansive title "International Business Machines" which had previously been used as the name of CTR\'s Canadian Division;[27] the name was changed on February 14,' chunker.contextualize(chunk) (61 tokens): 'IBM\n1910s–1950s\nclumsy hyphenated name "Computing-Tabulating-Recording Company" and chose to replace it with the more expansive title "International Business Machines" which had previously been used as the name of CTR\'s Canadian Division;[27] the name was changed on February 14,' === 19 === chunk.text (21 tokens): '1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.' chunker.contextualize(chunk) (25 tokens): 'IBM\n1910s–1950s\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.' === 20 === chunk.text (22 tokens): 'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.' chunker.contextualize(chunk) (26 tokens): 'IBM\n1960s–1980s\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'