DedocAPIFileLoader

DedocAPIFileLoader(
  self,
  file_path: str,
  *,
  url: str = 'http://0.0.0.0:1231'

Bases

DedocBaseLoader

Constructors

Attributes

Methods

Inherited fromDedocBaseLoader

Attributes

Aparsing_parameters: dict Avalid_split_values: set Asplit: split Awith_tables

View source on GitHub

Parameters

Name	Type	Description
`file_path`*	`str`	path to the file for processing
`url`	`str`	Default:`'http://0.0.0.0:1231'` URL to call `dedoc` API
`split`	`str`	Default:`'document'` type of document splitting into parts (each part is returned separately), default value "document" "document": document is returned as a single langchain Document object (don't split) "page": split document into pages (works for PDF, DJVU, PPTX, PPT, ODP) "node": split document into tree nodes (title nodes, list item nodes, raw text nodes) "line": split document into lines
`with_tables`	`bool`	Default:`True`

constructor

__init__

Name	Type
file_path	str
url	str
split	str
with_tables	bool
with_attachments	Union[str, bool]
recursion_deep_attachments	int
pdf_with_text_layer	str
language	str
pages	str
is_one_column_document	str
document_orientation	str
need_header_footer_analysis	Union[str, bool]
need_binarization	Union[str, bool]
need_pdf_table_analysis	Union[str, bool]
delimiter	Optional[str]
encoding	Optional[str]

Setup:

You don't need to install dedoc library for using this loader. Instead, the dedoc API needs to be run. You may use Docker container for this purpose. Please see dedoc documentation for more details: https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker

.. code-block:: bash

docker pull dedocproject/dedoc
docker run -p 1231:1231

Instantiate:

.. code-block:: python

from langchain_community.document_loaders import DedocAPIFileLoader

loader = DedocAPIFileLoader( file_path="example.pdf", # url=..., # split=..., # with_tables=..., # pdf_with_text_layer=..., # pages=..., # ... )

Load:

.. code-block:: python

docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

Lazy load:

.. code-block:: python

docs = []
docs_lazy = loader.lazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

Some text
{
    'file_name': 'example.pdf',
    'file_type': 'application/pdf',
    # ...
}

LangChain Assistant

Menu

DedocAPIFileLoader

Bases

Constructors

Attributes

Methods

Inherited fromDedocBaseLoader

Attributes

Inherited fromBaseLoader(langchain_core)

Methods

Parameters

Menu

DedocAPIFileLoader

Bases

Used in Docs

Constructors

Attributes

Methods

Inherited fromDedocBaseLoader

Attributes

Inherited fromBaseLoader(langchain_core)

Methods

Parameters