| Name | Type | Description |
|---|---|---|
file_path* | str | path to the file for processing |
split | str | Default: 'document'type of document splitting into parts (each part is returned separately), default value "document" "document": document text is returned as a single langchain Document object (don't split) "page": split document text into pages (works for PDF, DJVU, PPTX, PPT, ODP) "node": split document text into tree nodes (title nodes, list item nodes, raw text nodes) "line": split document text into lines |
with_tables | bool | Default: Trueadd tables to the result - each table is returned as a single langchain Document object |
| Name | Type |
|---|---|
| file_path | str |
| split | str |
| with_tables | bool |
| with_attachments | Union[str, bool] |
| recursion_deep_attachments | int |
| pdf_with_text_layer | str |
| language | str |
| pages | str |
| is_one_column_document | str |
| document_orientation | str |
| need_header_footer_analysis | Union[str, bool] |
| need_binarization | Union[str, bool] |
| need_pdf_table_analysis | Union[str, bool] |
| delimiter | Optional[str] |
| encoding | Optional[str] |
Base Loader that uses dedoc (https://dedoc.readthedocs.io).
Loader enables extracting text, tables and attached files from the given file:
* Text can be split by pages, dedoc tree nodes, textual lines
(according to the split parameter).
* Attached files (when with_attachments=True)
are split according to the split parameter.
For attachments, langchain Document object has an additional metadata field
type="attachment".
* Tables (when with_tables=True) are not split - each table corresponds to one
langchain Document object.
For tables, Document object has additional metadata fields type="table"
and text_as_html with table HTML representation.