# PyPDFium2Parser

> **Class** in `langchain_community`

📖 [View in docs](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PyPDFium2Parser)

Parse a blob from a PDF using `PyPDFium2` library.

   This class provides methods to parse a blob from a PDF document, supporting various
   configurations such as handling password-protected PDFs, extracting images, and
   defining extraction mode.
   It integrates the 'PyPDFium2' library for PDF processing and offers synchronous
   blob parsing.

   Examples:
       Setup:

       .. code-block:: bash

           pip install -U langchain-community pypdfium2

       Load a blob from a PDF file:

       .. code-block:: python

           from langchain_core.documents.base import Blob

           blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

       Instantiate the parser:

       .. code-block:: python

           from langchain_community.document_loaders.parsers import PyPDFium2Parser

           parser = PyPDFium2Parser(
               # password=None,
               mode="page",
               pages_delimiter="
",
               # extract_images = True,
               # images_to_text = convert_images_to_text_with_tesseract(),
           )

       Lazily parse the blob:

       .. code-block:: python

           docs = []
           docs_lazy = parser.lazy_parse(blob)

           for doc in docs_lazy:
               docs.append(doc)
           print(docs[0].page_content[:100])
           print(docs[0].metadata)

## Signature

```python
PyPDFium2Parser(
    self,
    extract_images: bool = False,
    *,
    password: Optional[str] = None,
    mode: Literal['single', 'page'] = 'page',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
)
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `password` | `Optional[str]` | No | Optional password for opening encrypted PDFs. (default: `None`) |
| `mode` | `Literal['single', 'page']` | No | The extraction mode, either "single" for the entire document or "page" for page-wise extraction. (default: `'page'`) |
| `pages_delimiter` | `str` | No | A string delimiter to separate pages in single-mode extraction. (default: `_DEFAULT_PAGES_DELIMITER`) |
| `extract_images` | `bool` | No | Whether to extract images from the PDF. (default: `False`) |
| `images_parser` | `Optional[BaseImageBlobParser]` | No | Optional image blob parser. (default: `None`) |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` | No | The format for the parsed output. - "text" = return the content as is - "markdown-img" = wrap the content into an image markdown link, w/ link pointing to (`![body)(#)`] - "html-img" = wrap the content as the `alt` text of an tag and link to (`<img alt="{body}" src="#"/>`) (default: `'text'`) |
| `extraction_mode` | `unknown` | Yes | “plain” for legacy functionality, “layout” for experimental layout mode functionality |
| `extraction_kwargs` | `unknown` | Yes | Optional additional parameters for the extraction process. |

## Extends

- `BaseBlobParser`

## Constructors

```python
__init__(
    self,
    extract_images: bool = False,
    *,
    password: Optional[str] = None,
    mode: Literal['single', 'page'] = 'page',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
) -> None
```

| Name | Type |
|------|------|
| `extract_images` | `bool` |
| `password` | `Optional[str]` |
| `mode` | `Literal['single', 'page']` |
| `pages_delimiter` | `str` |
| `images_parser` | `Optional[BaseImageBlobParser]` |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` |


## Properties

- `extract_images`
- `images_parser`
- `images_inner_format`
- `password`
- `mode`
- `pages_delimiter`

## Methods

- [`lazy_parse()`](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PyPDFium2Parser/lazy_parse)

---

[View source on GitHub](https://github.com/langchain-ai/langchain-community/blob/4b280287bd55b99b44db2dd849f02d66c89534d5/libs/community/langchain_community/document_loaders/parsers/pdf.py#L1177)