Module●Since v0.3

registry

Module includes a registry of default parser configurations.

Functions

function

get_parser

Get a parser by parser name.

Classes

class

MimeTypeBasedParser

Parser that uses mime-types to parse a blob.

This parser is useful for simple pipelines where the mime-type is sufficient to determine how to parse a blob.

To use, configure handlers based on mime-types and pass them to the initializer.

Example:

.. code-block:: python

    from langchain_community.document_loaders.parsers.generic import MimeTypeBasedParser

    parser = MimeTypeBasedParser(
        handlers={
            "application/pdf": ...,
        },
        fallback_parser=...,
    )

class

MsWordParser

Parse the Microsoft Word documents from a blob.

class

PyMuPDFParser

Parse a blob from a PDF using PyMuPDF library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyMuPDF' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pymupdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyMuPDFParser

       parser = PyMuPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), # extract_tables="markdown", # extract_tables_settings=None, # text_kwargs=None, )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

TextParser

Parser for text blobs.

View source on GitHub

registry

Functions

Classes

LangChain Assistant

Menu

registry

Functions

Classes