使用 Haystack 的 RAG¶
步骤 | 技术 | 执行 |
---|---|---|
嵌入 | Hugging Face / Sentence Transformers | 💻 本地 |
向量存储 | Milvus | 💻 本地 |
生成式 AI | Hugging Face Inference API | 🌐 远程 |
概述¶
本示例利用了 Haystack Docling 扩展,以及基于 Milvus 的文档存储和检索器实例,还有 sentence-transformers 嵌入。
所提供的 DoclingConverter
组件使您能够
- 轻松快速地在您的 LLM 应用中使用各种文档类型,并且
- 利用 Docling 丰富的格式进行高级的、文档原生的定位。
DoclingConverter
支持两种不同的导出模式
ExportType.MARKDOWN
:如果您想将每个输入文档捕获为独立的 Haystack 文档,或者ExportType.DOC_CHUNKS
(默认):如果您想将每个输入文档分块,然后将每个单独的块作为独立的下游 Haystack 文档捕获。
本示例允许通过参数 EXPORT_TYPE
探索这两种模式;根据设置的值,摄取和 RAG 流水线将相应地进行设置。
设置¶
- 👉 为了获得最佳转换速度,尽可能使用 GPU 加速;例如,如果在 Colab 上运行,请使用启用 GPU 的运行时。
- Notebook 使用 HuggingFace 的 Inference API;为了增加 LLM 配额,可以通过环境变量
HF_TOKEN
提供 token。 - 可以通过以下方式安装所需的依赖(
--no-warn-conflicts
用于 Colab 预填充的 Python 环境;如果需要更严格的使用,请随意移除)
In [1]
已复制!
%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv
%pip install -q --progress-bar off --no-warn-conflicts docling-haystack haystack-ai docling pymilvus milvus-haystack sentence-transformers python-dotenv
Note: you may need to restart the kernel to use updated packages.
In [2]
已复制!
import os
from pathlib import Path
from tempfile import mkdtemp
from docling_haystack.converter import ExportType
from dotenv import load_dotenv
def _get_env_from_colab_or_os(key):
try:
from google.colab import userdata
try:
return userdata.get(key)
except userdata.SecretNotFoundError:
pass
except ImportError:
pass
return os.getenv(key)
load_dotenv()
HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")
PATHS = ["https://arxiv.org/pdf/2408.09869"] # Docling Technical Report
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = "Which are the main AI models in Docling?"
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
import os from pathlib import Path from tempfile import mkdtemp from docling_haystack.converter import ExportType from dotenv import load_dotenv def _get_env_from_colab_or_os(key): try: from google.colab import userdata try: return userdata.get(key) except userdata.SecretNotFoundError: pass except ImportError: pass return os.getenv(key) load_dotenv() HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN") PATHS = ["https://arxiv.org/pdf/2408.09869"] # Docling Technical Report EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" GENERATION_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" EXPORT_TYPE = ExportType.DOC_CHUNKS QUESTION = "Which are the main AI models in Docling?" TOP_K = 3 MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
索引流水线¶
In [3]
已复制!
from docling_haystack.converter import DoclingConverter
from haystack import Pipeline
from haystack.components.embedders import (
SentenceTransformersDocumentEmbedder,
SentenceTransformersTextEmbedder,
)
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever
from docling.chunking import HybridChunker
document_store = MilvusDocumentStore(
connection_args={"uri": MILVUS_URI},
drop_old=True,
text_field="txt", # set for preventing conflict with same-name metadata field
)
idx_pipe = Pipeline()
idx_pipe.add_component(
"converter",
DoclingConverter(
export_type=EXPORT_TYPE,
chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
),
)
idx_pipe.add_component(
"embedder",
SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID),
)
idx_pipe.add_component("writer", DocumentWriter(document_store=document_store))
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
idx_pipe.connect("converter", "embedder")
elif EXPORT_TYPE == ExportType.MARKDOWN:
idx_pipe.add_component(
"splitter",
DocumentSplitter(split_by="sentence", split_length=1),
)
idx_pipe.connect("converter.documents", "splitter.documents")
idx_pipe.connect("splitter.documents", "embedder.documents")
else:
raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
idx_pipe.connect("embedder", "writer")
idx_pipe.run({"converter": {"paths": PATHS}})
from docling_haystack.converter import DoclingConverter from haystack import Pipeline from haystack.components.embedders import ( SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, ) from haystack.components.preprocessors import DocumentSplitter from haystack.components.writers import DocumentWriter from milvus_haystack import MilvusDocumentStore, MilvusEmbeddingRetriever from docling.chunking import HybridChunker document_store = MilvusDocumentStore( connection_args={"uri": MILVUS_URI}, drop_old=True, text_field="txt", # set for preventing conflict with same-name metadata field ) idx_pipe = Pipeline() idx_pipe.add_component( "converter", DoclingConverter( export_type=EXPORT_TYPE, chunker=HybridChunker(tokenizer=EMBED_MODEL_ID), ), ) idx_pipe.add_component( "embedder", SentenceTransformersDocumentEmbedder(model=EMBED_MODEL_ID), ) idx_pipe.add_component("writer", DocumentWriter(document_store=document_store)) if EXPORT_TYPE == ExportType.DOC_CHUNKS: idx_pipe.connect("converter", "embedder") elif EXPORT_TYPE == ExportType.MARKDOWN: idx_pipe.add_component( "splitter", DocumentSplitter(split_by="sentence", split_length=1), ) idx_pipe.connect("converter.documents", "splitter.documents") idx_pipe.connect("splitter.documents", "embedder.documents") else: raise ValueError(f"Unexpected export type: {EXPORT_TYPE}") idx_pipe.connect("embedder", "writer") idx_pipe.run({"converter": {"paths": PATHS}})
Token indices sequence length is longer than the specified maximum sequence length for this model (1041 > 512). Running this sequence through the model will result in indexing errors
Batches: 0%| | 0/2 [00:00<?, ?it/s]
Out[3]
{'writer': {'documents_written': 54}}
RAG 流水线¶
In [4]
已复制!
from haystack.components.builders import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
prompt_template = """
Given these documents, answer the question.
Documents:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Question: {{query}}
Answer:
"""
rag_pipe = Pipeline()
rag_pipe.add_component(
"embedder",
SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID),
)
rag_pipe.add_component(
"retriever",
MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K),
)
rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipe.add_component(
"llm",
HuggingFaceAPIGenerator(
api_type="serverless_inference_api",
api_params={"model": GENERATION_MODEL_ID},
token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None,
),
)
rag_pipe.add_component("answer_builder", AnswerBuilder())
rag_pipe.connect("embedder.embedding", "retriever")
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder", "llm")
rag_pipe.connect("llm.replies", "answer_builder.replies")
rag_pipe.connect("llm.meta", "answer_builder.meta")
rag_pipe.connect("retriever", "answer_builder.documents")
rag_res = rag_pipe.run(
{
"embedder": {"text": QUESTION},
"prompt_builder": {"query": QUESTION},
"answer_builder": {"query": QUESTION},
}
)
from haystack.components.builders import AnswerBuilder from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.generators import HuggingFaceAPIGenerator from haystack.utils import Secret prompt_template = """ Given these documents, answer the question. Documents: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{query}} Answer: """ rag_pipe = Pipeline() rag_pipe.add_component( "embedder", SentenceTransformersTextEmbedder(model=EMBED_MODEL_ID), ) rag_pipe.add_component( "retriever", MilvusEmbeddingRetriever(document_store=document_store, top_k=TOP_K), ) rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template)) rag_pipe.add_component( "llm", HuggingFaceAPIGenerator( api_type="serverless_inference_api", api_params={"model": GENERATION_MODEL_ID}, token=Secret.from_token(HF_TOKEN) if HF_TOKEN else None, ), ) rag_pipe.add_component("answer_builder", AnswerBuilder()) rag_pipe.connect("embedder.embedding", "retriever") rag_pipe.connect("retriever", "prompt_builder.documents") rag_pipe.connect("prompt_builder", "llm") rag_pipe.connect("llm.replies", "answer_builder.replies") rag_pipe.connect("llm.meta", "answer_builder.meta") rag_pipe.connect("retriever", "answer_builder.documents") rag_res = rag_pipe.run( { "embedder": {"text": QUESTION}, "prompt_builder": {"query": QUESTION}, "answer_builder": {"query": QUESTION}, } )
Batches: 0%| | 0/1 [00:00<?, ?it/s]
/Users/pva/work/github.com/docling-project/docling/.venv/lib/python3.12/site-packages/huggingface_hub/inference/_client.py:2232: FutureWarning: `stop_sequences` is a deprecated argument for `text_generation` task and will be removed in version '0.28.0'. Use `stop` instead. warnings.warn(
下面我们打印出 RAG 结果。如果您使用了 ExportType.DOC_CHUNKS
,请注意源文档如何包含文档级的定位信息(例如页码或边界框信息)
In [5]
已复制!
from docling.chunking import DocChunk
print(f"Question:\n{QUESTION}\n")
print(f"Answer:\n{rag_res['answer_builder']['answers'][0].data.strip()}\n")
print("Sources:")
sources = rag_res["answer_builder"]["answers"][0].documents
for source in sources:
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
doc_chunk = DocChunk.model_validate(source.meta["dl_meta"])
print(f"- text: {doc_chunk.text!r}")
if doc_chunk.meta.origin:
print(f" file: {doc_chunk.meta.origin.filename}")
if doc_chunk.meta.headings:
print(f" section: {' / '.join(doc_chunk.meta.headings)}")
bbox = doc_chunk.meta.doc_items[0].prov[0].bbox
print(
f" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, "
f"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]"
)
elif EXPORT_TYPE == ExportType.MARKDOWN:
print(repr(source.content))
else:
raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
from docling.chunking import DocChunk print(f"Question:\n{QUESTION}\n") print(f"Answer:\n{rag_res['answer_builder']['answers'][0].data.strip()}\n") print("Sources:") sources = rag_res["answer_builder"]["answers"][0].documents for source in sources: if EXPORT_TYPE == ExportType.DOC_CHUNKS: doc_chunk = DocChunk.model_validate(source.meta["dl_meta"]) print(f"- text: {doc_chunk.text!r}") if doc_chunk.meta.origin: print(f" file: {doc_chunk.meta.origin.filename}") if doc_chunk.meta.headings: print(f" section: {' / '.join(doc_chunk.meta.headings)}") bbox = doc_chunk.meta.doc_items[0].prov[0].bbox print( f" page: {doc_chunk.meta.doc_items[0].prov[0].page_no}, " f"bounding box: [{int(bbox.l)}, {int(bbox.t)}, {int(bbox.r)}, {int(bbox.b)}]" ) elif EXPORT_TYPE == ExportType.MARKDOWN: print(repr(source.content)) else: raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
Question: Which are the main AI models in Docling? Answer: The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, while TableFormer is a state-of-the-art table structure recognition model. These models are provided with pre-trained weights and a separate package for the inference code as docling-ibm-models. They are also used in the open-access deepsearch-experience, a cloud-native service for knowledge exploration tasks. Additionally, Docling plans to extend its model library with a figure-classifier model, an equation-recognition model, a code-recognition model, and more in the future. Sources: - text: 'As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure recognition model. We provide the pre-trained weights (hosted on huggingface) and a separate package for the inference code as docling-ibm-models . Both models are also powering the open-access deepsearch-experience, our cloud-native service for knowledge exploration tasks.' file: 2408.09869v5.pdf section: 3.2 AI models page: 3, bounding box: [107, 406, 504, 330] - text: 'Docling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support downstream operations. Then, the standard model pipeline applies a sequence of AI models independently on every page in the document to extract features and content, such as layout and table structures. Finally, the results from all pages are aggregated and passed through a post-processing stage, which augments metadata, detects the document language, infers reading-order and eventually assembles a typed document object which can be serialized to JSON or Markdown.' file: 2408.09869v5.pdf section: 3 Processing pipeline page: 2, bounding box: [107, 273, 504, 176] - text: 'Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.\nWe encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.' section: 6 Future work and contributions page: 5, bounding box: [106, 323, 504, 258]
In [ ]
已复制!