# PDFMinerParser

> **Class** in `langchain_community`

📖 [View in docs](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PDFMinerParser)

Parse a blob from a PDF using `pdfminer.six` library.

   This class provides methods to parse a blob from a PDF document, supporting various
   configurations such as handling password-protected PDFs, extracting images, and
   defining extraction mode.
   It integrates the 'pdfminer.six' library for PDF processing and offers synchronous
   blob parsing.

   Examples:
       Setup:

       .. code-block:: bash

           pip install -U langchain-community pdfminer.six pillow

       Load a blob from a PDF file:

       .. code-block:: python

           from langchain_core.documents.base import Blob

           blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

       Instantiate the parser:

       .. code-block:: python

           from langchain_community.document_loaders.parsers import PDFMinerParser

           parser = PDFMinerParser(
               # password = None,
               mode = "single",
               pages_delimiter = "
",
               # extract_images = True,
               # images_to_text = convert_images_to_text_with_tesseract(),
           )

       Lazily parse the blob:

       .. code-block:: python

           docs = []
           docs_lazy = parser.lazy_parse(blob)

           for doc in docs_lazy:
               docs.append(doc)
           print(docs[0].page_content[:100])
           print(docs[0].metadata)

## Signature

```python
PDFMinerParser(
    self,
    extract_images: bool = False,
    *,
    password: Optional[str] = None,
    mode: Literal['single', 'page'] = 'single',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
    concatenate_pages: Optional[bool] = None,
)
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `password` | `Optional[str]` | No | Optional password for opening encrypted PDFs. (default: `None`) |
| `mode` | `Literal['single', 'page']` | No | Extraction mode to use. Either "single" or "page" for page-wise extraction. (default: `'single'`) |
| `pages_delimiter` | `str` | No | A string delimiter to separate pages in single-mode extraction. (default: `_DEFAULT_PAGES_DELIMITER`) |
| `extract_images` | `bool` | No | Whether to extract images from PDF. (default: `False`) |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` | No | The format for the parsed output. - "text" = return the content as is - "markdown-img" = wrap the content into an image markdown link, w/ link pointing to (`![body)(#)`] - "html-img" = wrap the content as the `alt` text of an tag and link to (`<img alt="{body}" src="#"/>`) (default: `'text'`) |
| `concatenate_pages` | `Optional[bool]` | No | Deprecated. If True, concatenate all PDF pages into one a single document. Otherwise, return one document per page. (default: `None`) |

## Extends

- `BaseBlobParser`

## Constructors

```python
__init__(
    self,
    extract_images: bool = False,
    *,
    password: Optional[str] = None,
    mode: Literal['single', 'page'] = 'single',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
    concatenate_pages: Optional[bool] = None,
)
```

| Name | Type |
|------|------|
| `extract_images` | `bool` |
| `password` | `Optional[str]` |
| `mode` | `Literal['single', 'page']` |
| `pages_delimiter` | `str` |
| `images_parser` | `Optional[BaseImageBlobParser]` |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` |
| `concatenate_pages` | `Optional[bool]` |


## Properties

- `extract_images`
- `images_parser`
- `images_inner_format`
- `password`
- `mode`
- `pages_delimiter`

## Methods

- [`decode_text()`](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PDFMinerParser/decode_text)
- [`resolve_and_decode()`](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PDFMinerParser/resolve_and_decode)
- [`lazy_parse()`](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PDFMinerParser/lazy_parse)

---

[View source on GitHub](https://github.com/langchain-ai/langchain-community/blob/a6a6079511ac8a5c1293337f88096b8641562e77/libs/community/langchain_community/document_loaders/parsers/pdf.py#L474)