PyPDFParser

PyPDFParser(
  self,
  password: Optional[Union[str, bytes]] = None

Bases

BaseBlobParser

Constructors

Attributes

Methods

Inherited fromBaseBlobParser(langchain_core)

Methods

Mparse

View source on GitHub

_DEFAULT_PAGES_DELIMITER

Parameters

Name	Type	Description
`password`	`Optional[Union[str, bytes]]`	Default:`None` Optional password for opening encrypted PDFs.
`extract_images`	`bool`	Default:`False` Whether to extract images from the PDF.
`mode`	`Literal['single', 'page']`	Default:`'page'`
`pages_delimiter`	`str`	Default:`_DEFAULT_PAGES_DELIMITER`
`images_parser`	`Optional[BaseImageBlobParser]`	Default:`None`
`images_inner_format`	`Literal['text', 'markdown-img', 'html-img']`	Default:`'text'`
`extraction_mode`	`Literal['plain', 'layout']`	Default:`'plain'`
`extraction_kwargs`	`Optional[dict[str, Any]]`	Default:`None`

constructor

__init__

Name	Type
password	Optional[Union[str, bytes]]
extract_images	bool
mode	Literal['single', 'page']
pages_delimiter	str
images_parser	Optional[BaseImageBlobParser]
images_inner_format	Literal['text', 'markdown-img', 'html-img']
extraction_mode	Literal['plain', 'layout']
extraction_kwargs	Optional[dict[str, Any]]

Parse a blob from a PDF using pypdf library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the 'pypdf' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyPDFParser

       parser = PyPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

The extraction mode, either "single" for the entire document or "page" for page-wise extraction.

A string delimiter to separate pages in single-mode extraction.

Optional image blob parser.

The format for the parsed output.

"text" = return the content as is
"markdown-img" = wrap the content into an image markdown link, w/ link pointing to (![body)(#)]
"html-img" = wrap the content as the alt text of an tag and link to (<img alt="{body}" src="#"/>)

“plain” for legacy functionality, “layout” extract text in a fixed width format that closely adheres to the rendered layout in the source pdf.

Optional additional parameters for the extraction process.

LangChain Assistant

Menu

PyPDFParser

Bases

Constructors

Attributes

Methods

Inherited fromBaseBlobParser(langchain_core)

Methods

Parameters

Menu

PyPDFParser

Bases

Used in Docs

Constructors

Attributes

Methods

Inherited fromBaseBlobParser(langchain_core)

Methods

Parameters