pdf

Base Loader that uses dedoc (https://dedoc.readthedocs.io).

Loader enables extracting text, tables and attached files from the given file: * Text can be split by pages, dedoc tree nodes, textual lines (according to the split parameter). * Attached files (when with_attachments=True) are split according to the split parameter. For attachments, langchain Document object has an additional metadata field type="attachment". * Tables (when with_tables=True) are not split - each table corresponds to one langchain Document object. For tables, Document object has additional metadata fields type="table" and text_as_html with table HTML representation.

Parse a blob from a PDF using pdfminer.six library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'pdfminer.six' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pdfminer.six pillow

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PDFMinerParser

       parser = PDFMinerParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Parse a blob from a PDF using PyMuPDF library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyMuPDF' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pymupdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyMuPDFParser

       parser = PyMuPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), # extract_tables="markdown", # extract_tables_settings=None, # text_kwargs=None, )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Parse a blob from a PDF using PyPDFium2 library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyPDFium2' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdfium2

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyPDFium2Parser

       parser = PyPDFium2Parser(
           # password=None,
           mode="page",
           pages_delimiter="

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Parse a blob from a PDF using pypdf library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the 'pypdf' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyPDFParser

       parser = PyPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load PDF files using Unstructured.

You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.

Examples

from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader( "example.pdf", mode="elements", strategy="fast", ) docs = loader.load()

References

https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf

Base Loader class for PDF files.

If the file is a web path, it will download it to a temporary file, use it, then clean up the temporary file after completion.

Load and parse a PDF file using 'pypdf' library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pypdf library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdf

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PyPDFLoader

       loader = PyPDFLoader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_parser = RapidOCRBlobParser(), )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load and parse a PDF file using the pypdfium2 library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pypdfium2 library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdfium2

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PyPDFium2Loader

       loader = PyPDFium2Loader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load and parse a directory of PDF files using 'pypdf' library.

This class provides methods to load and parse multiple PDF documents in a directory, supporting options for recursive search, handling password-protected files, extracting images, and defining extraction modes. It integrates the pypdf library for PDF processing and offers synchronous document loading.

Load and parse a PDF file using 'pdfminer.six' library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pdfminer.six library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pdfminer.six

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PDFMinerLoader

       loader = PDFMinerLoader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load and parse a PDF file using 'PyMuPDF' library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting tables, extracting images, and defining extraction mode. It integrates the PyMuPDF library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pymupdf

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PyMuPDFLoader

       loader = PyMuPDFLoader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_parser = TesseractBlobParser(), # extract_tables = "markdown", # extract_tables_settings = None, )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

Load PDF files from a local file system, HTTP or S3.

To authenticate, the AWS client uses the following methods to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.

Make sure the credentials / roles used have the required policies to access the Amazon Textract service.

DedocPDFLoader document loader integration to load PDF files using dedoc. The file loader can automatically detect the correctness of a textual layer in the PDF document. Note that __init__ method supports parameters that differ from ones of DedocBaseLoader.

Load files using Unstructured.

The file loader uses the unstructured partition function and will automatically detect the file type. You can run the loader in different modes: "single", "elements", and "paged". The default "single" mode will return a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText and return those as individual langchain Document objects. In addition to these post-processing modes (which are specific to the LangChain Loaders), Unstructured has its own "chunking" parameters for post-processing elements into more useful chunks for uses cases such as Retrieval Augmented Generation (RAG). You can pass in additional unstructured kwargs to configure different unstructured settings.

Examples

from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader( "example.pdf", mode="elements", strategy="fast", ) docs = loader.load()

References

https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking

Send PDF files to Amazon Textract and parse them.

For parsing multi-page PDFs, they have to reside on S3.

The AmazonTextractPDFLoader calls the Amazon Textract Service to convert PDFs into a Document structure. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.

For the call to be successful an AWS account is required, similar to the AWS CLI requirements.

Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.

from langchain_community.document_loaders import AmazonTextractPDFLoader
loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()

One feature is the linearization of the output. When using the features LAYOUT, FORMS or TABLES together with Textract

from langchain_community.document_loaders import AmazonTextractPDFLoader
# you can mix and match each of the features
loader=AmazonTextractPDFLoader(
    "example_data/alejandro_rosalez_sample-small.jpeg",
    textract_features=["TABLES", "LAYOUT"])
documents = loader.load()

it will generate output that formats the text in reading order and try to output the information in a tabular structure or output the key/value pairs with a colon (key: value). This helps most LLMs to achieve better accuracy when processing these texts.

Document objects are returned with metadata that includes the source and a 1-based index of the page number in page. Note that page represents the index of the result returned from Textract, not necessarily the as-written page number in the document.

Document loader utilizing Zerox library: https://github.com/getomni-ai/zerox

Zerox converts PDF document to series of images (page-wise) and uses vision-capable LLM model to generate Markdown representation.

Zerox utilizes anyc operations. Therefore when using this loader inside Jupyter Notebook (or any environment running async) you will need to:

    import nest_asyncio
    nest_asyncio.apply()

LangChain Assistant

Menu

Attributes

Classes

Examples

References

Examples

References