| Name | Type | Description |
|---|---|---|
password | Optional[Union[str, bytes]] | Default: NoneOptional password for opening encrypted PDFs. |
extract_images | bool | Default: FalseWhether to extract images from the PDF. |
mode | Literal['single', 'page'] | Default: 'page' |
pages_delimiter | str | Default: _DEFAULT_PAGES_DELIMITER |
images_parser | Optional[BaseImageBlobParser] | Default: None |
images_inner_format | Literal['text', 'markdown-img', 'html-img'] | Default: 'text' |
extraction_mode | Literal['plain', 'layout'] | Default: 'plain' |
extraction_kwargs | Optional[dict[str, Any]] | Default: None |
Parse a blob from a PDF using pypdf library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the 'pypdf' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdf
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PyPDFParser
parser = PyPDFParser(
# password = None,
mode = "single",
pages_delimiter = "
", # images_parser = TesseractBlobParser(), )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
The extraction mode, either "single" for the entire document or "page" for page-wise extraction.
A string delimiter to separate pages in single-mode extraction.
Optional image blob parser.
The format for the parsed output.
![body)(#)]alt text of an tag and link to
(<img alt="{body}" src="#"/>)“plain” for legacy functionality, “layout” extract text in a fixed width format that closely adheres to the rendered layout in the source pdf.
Optional additional parameters for the extraction process.