Class●Since v0.3

DedocBaseLoader

DedocBaseLoader(
  self,
  file_path: str,
  *,
  split: str = 'document',

Bases

BaseLoaderABC

Constructors

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Mload Maload Mload_and_split Malazy_load

View source on GitHub

Parameters

Name	Type	Description
`file_path`*	`str`	path to the file for processing
`split`	`str`	Default:`'document'` type of document splitting into parts (each part is returned separately), default value "document" "document": document text is returned as a single langchain Document object (don't split) "page": split document text into pages (works for PDF, DJVU, PPTX, PPT, ODP) "node": split document text into tree nodes (title nodes, list item nodes, raw text nodes) "line": split document text into lines
`with_tables`	`bool`	Default:`True` add tables to the result - each table is returned as a single langchain Document object

constructor

__init__

Name	Type
file_path	str
split	str
with_tables	bool
with_attachments	Union[str, bool]
recursion_deep_attachments	int
pdf_with_text_layer	str
language	str
pages	str
is_one_column_document	str
document_orientation	str
need_header_footer_analysis	Union[str, bool]
need_binarization	Union[str, bool]
need_pdf_table_analysis	Union[str, bool]
delimiter	Optional[str]
encoding	Optional[str]

Base Loader that uses dedoc (https://dedoc.readthedocs.io).

Loader enables extracting text, tables and attached files from the given file: * Text can be split by pages, dedoc tree nodes, textual lines (according to the split parameter). * Attached files (when with_attachments=True) are split according to the split parameter. For attachments, langchain Document object has an additional metadata field type="attachment". * Tables (when with_tables=True) are not split - each table corresponds to one langchain Document object. For tables, Document object has additional metadata fields type="table" and text_as_html with table HTML representation.

LangChain Assistant

Menu

DedocBaseLoader

Bases

Constructors

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Parameters