document_loaders

Document Loaders are classes to load Documents.

Document Loaders are usually used to load a lot of Documents in a single run.

Class hierarchy:

.. code-block::

BaseLoader --> <name>Loader  # Examples: TextLoader, UnstructuredFileLoader

Main helpers:

.. code-block::

Document, <name>TextSplitter

Attributes

Classes

Modules

View source on GitHub

Load blobs in the local file system.

Example:

.. code-block:: python

from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader
loader = FileSystemBlobLoader("/path/to/directory")
for blob in loader.yield_blobs():
    print(blob)  # noqa: T201

Document loader for EverNote ENEX export files.

Loads EverNote notebook export files (.enex format) into LangChain Documents. Extracts plain text content from HTML and preserves note metadata including titles, timestamps, and attachments. Uses secure XML parsing to prevent vulnerabilities.

The loader supports two modes:

Single document: Concatenates all notes into one Document (default)
Multiple documents: Creates separate Documents for each note

Instructions for creating ENEX files <https://help.evernote.com/hc/en-us/articles/209005557-Export-notes-and-notebooks-as-ENEX-or-HTML>__

Example:

.. code-block:: python

from langchain_community.document_loaders import EverNoteLoader

# Load all notes as a single document
loader = EverNoteLoader("my_notebook.enex")
documents = loader.load()

# Load each note as a separate document:
# documents = [ document1, document2, ... ]
loader = EverNoteLoader("my_notebook.enex", load_single_document=False)
documents = loader.load()

# Lazy loading for large files
for doc in loader.lazy_load():
    print(f"Title: {doc.metadata.get('title', 'Untitled')}")
    print(f"Content: {doc.page_content[:100]}...")

Load model information from Hugging Face Hub, including README content.

This loader interfaces with the Hugging Face Models API to fetch and load model metadata and README files. The API allows you to search and filter models based on specific criteria such as model tags, authors, and more.

API URL: https://huggingface.co/api/models DOC URL: https://huggingface.co/docs/hub/en/api

Examples:

.. code-block:: python

    from langchain_community.document_loaders import HuggingFaceModelLoader

    # Initialize the loader with search criteria
    loader = HuggingFaceModelLoader(search="bert", limit=10)

    # Load models
    documents = loader.load()

    # Iterate through the fetched documents
    for doc in documents:
        print(doc.page_content)  # README content of the model
        print(doc.metadata)      # Metadata of the model

Load and parse a PDF file using 'pdfminer.six' library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pdfminer.six library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pdfminer.six

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PDFMinerLoader

       loader = PDFMinerLoader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load and parse a PDF file using 'PyMuPDF' library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting tables, extracting images, and defining extraction mode. It integrates the PyMuPDF library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pymupdf

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PyMuPDFLoader

       loader = PyMuPDFLoader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_parser = TesseractBlobParser(), # extract_tables = "markdown", # extract_tables_settings = None, )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load and parse a PDF file using the pypdfium2 library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pypdfium2 library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdfium2

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PyPDFium2Loader

       loader = PyPDFium2Loader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load and parse a PDF file using 'pypdf' library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pypdf library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdf

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PyPDFLoader

       loader = PyPDFLoader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_parser = RapidOCRBlobParser(), )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Recursively load all child links from a root URL.

Security Note: This loader is a crawler that will start crawling at a given URL and then expand to crawl child links recursively.

Web crawlers should generally NOT be deployed with network access
to any internal servers.

Control access to who can submit crawling requests and what network access
the crawler has.

While crawling, the crawler may encounter malicious URLs that would lead to a
server-side request forgery (SSRF) attack.

To mitigate risks, the crawler by default will only load URLs from the same
domain as the start URL (controlled via prevent_outside named argument).

This will mitigate the risk of SSRF attacks, but will not eliminate it.

For example, if crawling a host which hosts several sites:

https://some_host/alice_site/
https://some_host/bob_site/

A malicious URL on Alice's site could cause the crawler to make a malicious
GET request to an endpoint on Bob's site. Both sites are hosted on the
same host, so such a request would not be prevented by default.

See https://python.langchain.com/docs/security/

Setup:

This class has no required additional dependencies. You can optionally install
``beautifulsoup4`` for richer default metadata extraction:

.. code-block:: bash

    pip install -U beautifulsoup4

Load a sitemap and its URLs.

Security Note: This loader can be used to load all URLs specified in a sitemap. If a malicious actor gets access to the sitemap, they could force the server to load URLs from other domains by modifying the sitemap. This could lead to server-side request forgery (SSRF) attacks; e.g., with the attacker forcing the server to load URLs from internal service endpoints that are not publicly accessible. While the attacker may not immediately gain access to this data, this data could leak into downstream systems (e.g., data loader is used to load data for indexing).

This loader is a crawler and web crawlers should generally NOT be deployed
with network access to any internal servers.

Control access to who can submit crawling requests and what network access
the crawler has.

By default, the loader will only load URLs from the same domain as the sitemap
if the site map is not a local file. This can be disabled by setting
restrict_to_same_domain to False (not recommended).

If the site map is a local file, no such risk mitigation is applied by default.

Use the filter URLs argument to limit which URLs can be loaded.

See https://python.langchain.com/docs/security

Load files using Unstructured API.

By default, the loader makes a call to the hosted Unstructured API. If you are running the unstructured API locally, you can change the API rule by passing in the url parameter when you initialize the loader. The hosted Unstructured API requires an API key. See the links below to learn more about our API offerings and get an API key.

You can run the loader in different modes: "single", "elements", and "paged". The default "single" mode will return a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText and return those as individual langchain Document objects. In addition to these post-processing modes (which are specific to the LangChain Loaders), Unstructured has its own "chunking" parameters for post-processing elements into more useful chunks for uses cases such as Retrieval Augmented Generation (RAG). You can pass in additional unstructured kwargs to configure different unstructured settings.

Examples

from langchain_community.document_loaders import UnstructuredAPIFileLoader

loader = UnstructuredAPIFileLoader(
    "example.pdf", mode="elements", strategy="fast", api_key="MY_API_KEY",
)
docs = loader.load()

References
----------
https://docs.unstructured.io/api-reference/api-services/sdk
https://docs.unstructured.io/api-reference/api-services/overview
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking

Menu

document_loaders

Attributes

Classes

Modules

Examples

References

Examples

Example

Example

Examples

References

Examples

Examples

References

Examples

References

Examples

Examples

References

Examples

References

Examples

References

Examples

References

Examples

References

Examples

References

Examples

Examples

References

Examples

References

Examples

References

Examples

References

Examples

References

Examples

References