| Name | Type | Description |
|---|---|---|
file_path* | str | path to the file for processing |
url | str | Default: 'http://0.0.0.0:1231'URL to call |
split | str | Default: 'document'type of document splitting into parts (each part is returned separately), default value "document" "document": document is returned as a single langchain Document object (don't split) "page": split document into pages (works for PDF, DJVU, PPTX, PPT, ODP) "node": split document into tree nodes (title nodes, list item nodes, raw text nodes) "line": split document into lines |
with_tables | bool | Default: True |
| Name | Type |
|---|---|
| file_path | str |
| url | str |
| split | str |
| with_tables | bool |
| with_attachments | Union[str, bool] |
| recursion_deep_attachments | int |
| pdf_with_text_layer | str |
| language | str |
| pages | str |
| is_one_column_document | str |
| document_orientation | str |
| need_header_footer_analysis | Union[str, bool] |
| need_binarization | Union[str, bool] |
| need_pdf_table_analysis | Union[str, bool] |
| delimiter | Optional[str] |
| encoding | Optional[str] |
Load files using dedoc API.
The file loader automatically detects the file type (even with the wrong extension).
By default, the loader makes a call to the locally hosted dedoc API.
More information about dedoc API can be found in dedoc documentation:
https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html
Please see the documentation of DedocBaseLoader to get more details.
Setup:
You don't need to install dedoc library for using this loader.
Instead, the dedoc API needs to be run.
You may use Docker container for this purpose.
Please see dedoc documentation for more details:
https://dedoc.readthedocs.io/en/latest/getting_started/installation.html#install-and-run-dedoc-using-docker
.. code-block:: bash
docker pull dedocproject/dedoc
docker run -p 1231:1231
Instantiate:
.. code-block:: python
from langchain_community.document_loaders import DedocAPIFileLoader
loader = DedocAPIFileLoader( file_path="example.pdf", # url=..., # split=..., # with_tables=..., # pdf_with_text_layer=..., # pages=..., # ... )
Load:
.. code-block:: python
docs = loader.load()
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
Some text
{
'file_name': 'example.pdf',
'file_type': 'application/pdf',
# ...
}
Lazy load:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
Some text
{
'file_name': 'example.pdf',
'file_type': 'application/pdf',
# ...
}
add tables to the result - each table is returned as a single langchain Document object