DedocPDFLoader

DedocPDFLoader(
  self,
  file_path: str,
  *,
  split: str = 'document',

Bases

DedocBaseLoader

Inherited fromDedocBaseLoader

Attributes

Aparsing_parameters: dict Avalid_split_values: set Asplit: split Awith_tables

View source on GitHub

Setup:

Install dedoc package.

.. code-block:: bash

pip install -U dedoc

Instantiate:

.. code-block:: python

from langchain_community.document_loaders import DedocPDFLoader

loader = DedocPDFLoader( file_path="example.pdf", # split=..., # with_tables=..., # pdf_with_text_layer=..., # pages=..., # ... )

Load:

.. code-block:: python

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Lazy load:

.. code-block:: python

docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Parameters used for document parsing via dedoc (https://dedoc.readthedocs.io/en/latest/parameters/pdf_handling.html):

with_attachments: enable attached files extraction recursion_deep_attachments: recursion level for attached files extraction, works only when with_attachments==True pdf_with_text_layer: type of handler for parsing, available options ["true", "false", "tabby", "auto", "auto_tabby" (default)] language: language of the document for PDF without a textual layer, available options ["eng", "rus", "rus+eng" (default)], the list of languages can be extended, please see https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html pages: page slice to define the reading range for parsing is_one_column_document: detect number of columns for PDF without a textual layer, available options ["true", "false", "auto" (default)] document_orientation: fix document orientation (90, 180, 270 degrees) for PDF without a textual layer, available options ["auto" (default), "no_change"] need_header_footer_analysis: remove headers and footers from the output result need_binarization: clean pages background (binarize) for PDF without a textual layer need_pdf_table_analysis: parse tables for PDF without a textual layer

LangChain Assistant

Menu

DedocPDFLoader

Bases

Inherited fromDedocBaseLoader

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Menu

DedocPDFLoader

Bases

Used in Docs

Inherited fromDedocBaseLoader

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods