Base Loader that uses dedoc (https://dedoc.readthedocs.io).
Loader enables extracting text, tables and attached files from the given file:
* Text can be split by pages, dedoc tree nodes, textual lines
(according to the split parameter).
* Attached files (when with_attachments=True)
are split according to the split parameter.
For attachments, langchain Document object has an additional metadata field
type="attachment".
* Tables (when with_tables=True) are not split - each table corresponds to one
langchain Document object.
For tables, Document object has additional metadata fields type="table"
and text_as_html with table HTML representation.
DedocFileLoader document loader integration to load files using dedoc.
The file loader automatically detects the file type (with the correct extension). The list of supported file types is gives at https://dedoc.readthedocs.io/en/latest/index.html#id1. Please see the documentation of DedocBaseLoader to get more details.
Load files using dedoc API.
The file loader automatically detects the file type (even with the wrong extension).
By default, the loader makes a call to the locally hosted dedoc API.
More information about dedoc API can be found in dedoc documentation:
https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html
Please see the documentation of DedocBaseLoader to get more details.