# DedocPDFLoader

> **Class** in `langchain_community`

📖 [View in docs](https://reference.langchain.com/python/langchain-community/document_loaders/pdf/DedocPDFLoader)

DedocPDFLoader document loader integration to load PDF files using `dedoc`.
The file loader can automatically detect the correctness of a textual layer in the
    PDF document.
Note that `__init__` method supports parameters that differ from ones of
    DedocBaseLoader.

## Signature

```python
DedocPDFLoader(
    self,
    file_path: str,
    *,
    split: str = 'document',
    with_tables: bool = True,
    with_attachments: Union[str, bool] = False,
    recursion_deep_attachments: int = 10,
    pdf_with_text_layer: str = 'auto_tabby',
    language: str = 'rus+eng',
    pages: str = ':',
    is_one_column_document: str = 'auto',
    document_orientation: str = 'auto',
    need_header_footer_analysis: Union[str, bool] = False,
    need_binarization: Union[str, bool] = False,
    need_pdf_table_analysis: Union[str, bool] = True,
    delimiter: Optional[str] = None,
    encoding: Optional[str] = None,
)
```

## Description

**Setup:**

Install ``dedoc`` package.

.. code-block:: bash

    pip install -U dedoc

**Instantiate:**

.. code-block:: python

from langchain_community.document_loaders import DedocPDFLoader

loader = DedocPDFLoader(
    file_path="example.pdf",
    # split=...,
    # with_tables=...,
    # pdf_with_text_layer=...,
    # pages=...,
    # ...
)

**Load:**

.. code-block:: python

    docs = loader.load()
    print(docs[0].page_content[:100])
    print(docs[0].metadata)

.. code-block:: python

    Some text
    {
        'file_name': 'example.pdf',
        'file_type': 'application/pdf',
        # ...
    }

**Lazy load:**

.. code-block:: python

    docs = []
    docs_lazy = loader.lazy_load()

    for doc in docs_lazy:
        docs.append(doc)
    print(docs[0].page_content[:100])
    print(docs[0].metadata)

.. code-block:: python

    Some text
    {
        'file_name': 'example.pdf',
        'file_type': 'application/pdf',
        # ...
    }

Parameters used for document parsing via `dedoc`
(https://dedoc.readthedocs.io/en/latest/parameters/pdf_handling.html):

with_attachments: enable attached files extraction
recursion_deep_attachments: recursion level for attached files extraction,
    works only when with_attachments==True
pdf_with_text_layer: type of handler for parsing, available options
    ["true", "false", "tabby", "auto", "auto_tabby" (default)]
language: language of the document for PDF without a textual layer,
    available options ["eng", "rus", "rus+eng" (default)], the list of
    languages can be extended, please see
    https://dedoc.readthedocs.io/en/latest/tutorials/add_new_language.html
pages: page slice to define the reading range for parsing
is_one_column_document: detect number of columns for PDF without a textual
    layer, available options ["true", "false", "auto" (default)]
document_orientation: fix document orientation (90, 180, 270 degrees) for PDF
    without a textual layer, available options ["auto" (default), "no_change"]
need_header_footer_analysis: remove headers and footers from the output result
need_binarization: clean pages background (binarize) for PDF without a textual
    layer
need_pdf_table_analysis: parse tables for PDF without a textual layer

## Extends

- `DedocBaseLoader`

---

[View source on GitHub](https://github.com/langchain-ai/langchain-community/blob/4b280287bd55b99b44db2dd849f02d66c89534d5/libs/community/langchain_community/document_loaders/pdf.py#L1199)