This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the 'pypdf' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyPDFParser

       parser = PyPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

PDFMinerParser

Parse a blob from a PDF using pdfminer.six library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'pdfminer.six' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pdfminer.six pillow

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PDFMinerParser

       parser = PDFMinerParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

PyMuPDFParser

Parse a blob from a PDF using PyMuPDF library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyMuPDF' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pymupdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyMuPDFParser

       parser = PyMuPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), # extract_tables="markdown", # extract_tables_settings=None, # text_kwargs=None, )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

PyPDFium2Parser

Parse a blob from a PDF using PyPDFium2 library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyPDFium2' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdfium2

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyPDFium2Parser

       parser = PyPDFium2Parser(
           # password=None,
           mode="page",
           pages_delimiter="

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

PDFPlumberParser

Parse PDF with PDFPlumber.

class

AmazonTextractPDFParser

Send PDF files to Amazon Textract and parse them.

For parsing multi-page PDFs, they have to reside on S3.

The AmazonTextractPDFLoader calls the Amazon Textract Service to convert PDFs into a Document structure. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.

For the call to be successful an AWS account is required, similar to the AWS CLI requirements.

Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.

from langchain_community.document_loaders import AmazonTextractPDFLoader
loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()

One feature is the linearization of the output. When using the features LAYOUT, FORMS or TABLES together with Textract

from langchain_community.document_loaders import AmazonTextractPDFLoader
# you can mix and match each of the features
loader=AmazonTextractPDFLoader(
    "example_data/alejandro_rosalez_sample-small.jpeg",
    textract_features=["TABLES", "LAYOUT"])
documents = loader.load()

it will generate output that formats the text in reading order and try to output the information in a tabular structure or output the key/value pairs with a colon (key: value). This helps most LLMs to achieve better accuracy when processing these texts.

Document objects are returned with metadata that includes the source and a 1-based index of the page number in page. Note that page represents the index of the result returned from Textract, not necessarily the as-written page number in the document.

class

DocumentIntelligenceParser

Loads a PDF with Azure Document Intelligence (formerly Form Recognizer) and chunks at character level.

View source on GitHub

pdf

Attributes

Functions

Classes

LangChain Assistant

Menu

pdf

Attributes

Functions

Classes