# DedocBaseLoader

> **Class** in `langchain_community`

📖 [View in docs](https://reference.langchain.com/python/langchain-community/document_loaders/dedoc/DedocBaseLoader)

Base Loader that uses `dedoc` (https://dedoc.readthedocs.io).

Loader enables extracting text, tables and attached files from the given file:
    * `Text` can be split by pages, `dedoc` tree nodes, textual lines
        (according to the `split` parameter).
    * `Attached files` (when with_attachments=True)
        are split according to the `split` parameter.
        For attachments, langchain Document object has an additional metadata field
        `type`="attachment".
    * `Tables` (when with_tables=True) are not split - each table corresponds to one
        langchain Document object.
        For tables, Document object has additional metadata fields `type`="table"
        and `text_as_html` with table HTML representation.

## Signature

```python
DedocBaseLoader(
    self,
    file_path: str,
    *,
    split: str = 'document',
    with_tables: bool = True,
    with_attachments: Union[str, bool] = False,
    recursion_deep_attachments: int = 10,
    pdf_with_text_layer: str = 'auto_tabby',
    language: str = 'rus+eng',
    pages: str = ':',
    is_one_column_document: str = 'auto',
    document_orientation: str = 'auto',
    need_header_footer_analysis: Union[str, bool] = False,
    need_binarization: Union[str, bool] = False,
    need_pdf_table_analysis: Union[str, bool] = True,
    delimiter: Optional[str] = None,
    encoding: Optional[str] = None,
)
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `file_path` | `str` | Yes | path to the file for processing |
| `split` | `str` | No | type of document splitting into parts (each part is returned separately), default value "document" "document": document text is returned as a single langchain Document     object (don't split) "page": split document text into pages (works for PDF, DJVU, PPTX, PPT,     ODP) "node": split document text into tree nodes (title nodes, list item     nodes, raw text nodes) "line": split document text into lines (default: `'document'`) |
| `with_tables` | `bool` | No | add tables to the result - each table is returned as a single langchain Document object (default: `True`) |

## Extends

- `BaseLoader`
- `ABC`

## Constructors

```python
__init__(
    self,
    file_path: str,
    *,
    split: str = 'document',
    with_tables: bool = True,
    with_attachments: Union[str, bool] = False,
    recursion_deep_attachments: int = 10,
    pdf_with_text_layer: str = 'auto_tabby',
    language: str = 'rus+eng',
    pages: str = ':',
    is_one_column_document: str = 'auto',
    document_orientation: str = 'auto',
    need_header_footer_analysis: Union[str, bool] = False,
    need_binarization: Union[str, bool] = False,
    need_pdf_table_analysis: Union[str, bool] = True,
    delimiter: Optional[str] = None,
    encoding: Optional[str] = None,
) -> None
```

| Name | Type |
|------|------|
| `file_path` | `str` |
| `split` | `str` |
| `with_tables` | `bool` |
| `with_attachments` | `Union[str, bool]` |
| `recursion_deep_attachments` | `int` |
| `pdf_with_text_layer` | `str` |
| `language` | `str` |
| `pages` | `str` |
| `is_one_column_document` | `str` |
| `document_orientation` | `str` |
| `need_header_footer_analysis` | `Union[str, bool]` |
| `need_binarization` | `Union[str, bool]` |
| `need_pdf_table_analysis` | `Union[str, bool]` |
| `delimiter` | `Optional[str]` |
| `encoding` | `Optional[str]` |


## Properties

- `parsing_parameters`
- `valid_split_values`
- `split`
- `with_tables`
- `file_path`

## Methods

- [`lazy_load()`](https://reference.langchain.com/python/langchain-community/document_loaders/dedoc/DedocBaseLoader/lazy_load)

---

[View source on GitHub](https://github.com/langchain-ai/langchain-community/blob/4b280287bd55b99b44db2dd849f02d66c89534d5/libs/community/langchain_community/document_loaders/dedoc.py#L18)