# PyPDFium2Loader

> **Class** in `langchain_community`

📖 [View in docs](https://reference.langchain.com/python/langchain-community/document_loaders/pdf/PyPDFium2Loader)

Load and parse a PDF file using the `pypdfium2` library.

   This class provides methods to load and parse PDF documents, supporting various
   configurations such as handling password-protected files, extracting images, and
   defining extraction mode.
   It integrates the `pypdfium2` library for PDF processing and offers both
   synchronous and asynchronous document loading.

   Examples:
       Setup:

       .. code-block:: bash

           pip install -U langchain-community pypdfium2

       Instantiate the loader:

       .. code-block:: python

           from langchain_community.document_loaders import PyPDFium2Loader

           loader = PyPDFium2Loader(
               file_path = "./example_data/layout-parser-paper.pdf",
               # headers = None
               # password = None,
               mode = "single",
               pages_delimiter = "
",
               # extract_images = True,
               # images_to_text = convert_images_to_text_with_tesseract(),
           )

       Lazy load documents:

       .. code-block:: python

           docs = []
           docs_lazy = loader.lazy_load()

           for doc in docs_lazy:
               docs.append(doc)
           print(docs[0].page_content[:100])
           print(docs[0].metadata)

       Load documents asynchronously:

       .. code-block:: python

           docs = await loader.aload()
           print(docs[0].page_content[:100])
           print(docs[0].metadata)

## Signature

```python
PyPDFium2Loader(
    self,
    file_path: Union[str, PurePath],
    *,
    mode: Literal['single', 'page'] = 'page',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    password: Optional[str] = None,
    extract_images: bool = False,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
    headers: Optional[dict] = None,
)
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `file_path` | `Union[str, PurePath]` | Yes | The path to the PDF file to be loaded. |
| `headers` | `Optional[dict]` | No | Optional headers to use for GET request to download a file from a web path. (default: `None`) |
| `password` | `Optional[str]` | No | Optional password for opening encrypted PDFs. (default: `None`) |
| `mode` | `Literal['single', 'page']` | No | The extraction mode, either "single" for the entire document or "page" for page-wise extraction. (default: `'page'`) |
| `pages_delimiter` | `str` | No | A string delimiter to separate pages in single-mode extraction. (default: `_DEFAULT_PAGES_DELIMITER`) |
| `extract_images` | `bool` | No | Whether to extract images from the PDF. (default: `False`) |
| `images_parser` | `Optional[BaseImageBlobParser]` | No | Optional image blob parser. (default: `None`) |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` | No | The format for the parsed output. - "text" = return the content as is - "markdown-img" = wrap the content into an image markdown link, w/ link pointing to (`![body)(#)`] - "html-img" = wrap the content as the `alt` text of an tag and link to (`<img alt="{body}" src="#"/>`) (default: `'text'`) |

## Extends

- `BasePDFLoader`

## Constructors

```python
__init__(
    self,
    file_path: Union[str, PurePath],
    *,
    mode: Literal['single', 'page'] = 'page',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    password: Optional[str] = None,
    extract_images: bool = False,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
    headers: Optional[dict] = None,
)
```

| Name | Type |
|------|------|
| `file_path` | `Union[str, PurePath]` |
| `mode` | `Literal['single', 'page']` |
| `pages_delimiter` | `str` |
| `password` | `Optional[str]` |
| `extract_images` | `bool` |
| `images_parser` | `Optional[BaseImageBlobParser]` |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` |
| `headers` | `Optional[dict]` |


## Properties

- `parser`

## Methods

- [`lazy_load()`](https://reference.langchain.com/python/langchain-community/document_loaders/pdf/PyPDFium2Loader/lazy_load)

---

[View source on GitHub](https://github.com/langchain-ai/langchain-community/blob/a6a6079511ac8a5c1293337f88096b8641562e77/libs/community/langchain_community/document_loaders/pdf.py#L308)