# PyMuPDFParser

> **Class** in `langchain_community`

📖 [View in docs](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PyMuPDFParser)

Parse a blob from a PDF using `PyMuPDF` library.

   This class provides methods to parse a blob from a PDF document, supporting various
   configurations such as handling password-protected PDFs, extracting images, and
   defining extraction mode.
   It integrates the 'PyMuPDF' library for PDF processing and offers synchronous blob
   parsing.

   Examples:
       Setup:

       .. code-block:: bash

           pip install -U langchain-community pymupdf

       Load a blob from a PDF file:

       .. code-block:: python

           from langchain_core.documents.base import Blob

           blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

       Instantiate the parser:

       .. code-block:: python

           from langchain_community.document_loaders.parsers import PyMuPDFParser

           parser = PyMuPDFParser(
               # password = None,
               mode = "single",
               pages_delimiter = "
",
               # images_parser = TesseractBlobParser(),
               # extract_tables="markdown",
               # extract_tables_settings=None,
               # text_kwargs=None,
           )

       Lazily parse the blob:

       .. code-block:: python

           docs = []
           docs_lazy = parser.lazy_parse(blob)

           for doc in docs_lazy:
               docs.append(doc)
           print(docs[0].page_content[:100])
           print(docs[0].metadata)

## Signature

```python
PyMuPDFParser(
    self,
    text_kwargs: Optional[dict[str, Any]] = None,
    extract_images: bool = False,
    *,
    password: Optional[str] = None,
    mode: Literal['single', 'page'] = 'page',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
    extract_tables: Union[Literal['csv', 'markdown', 'html'], None] = None,
    extract_tables_settings: Optional[dict[str, Any]] = None,
)
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `password` | `Optional[str]` | No | Optional password for opening encrypted PDFs. (default: `None`) |
| `mode` | `Literal['single', 'page']` | No | The extraction mode, either "single" for the entire document or "page" for page-wise extraction. (default: `'page'`) |
| `pages_delimiter` | `str` | No | A string delimiter to separate pages in single-mode extraction. (default: `_DEFAULT_PAGES_DELIMITER`) |
| `extract_images` | `bool` | No | Whether to extract images from the PDF. (default: `False`) |
| `images_parser` | `Optional[BaseImageBlobParser]` | No | Optional image blob parser. (default: `None`) |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` | No | The format for the parsed output. - "text" = return the content as is - "markdown-img" = wrap the content into an image markdown link, w/ link pointing to (`![body)(#)`] - "html-img" = wrap the content as the `alt` text of an tag and link to (`<img alt="{body}" src="#"/>`) (default: `'text'`) |
| `extract_tables` | `Union[Literal['csv', 'markdown', 'html'], None]` | No | Whether to extract tables in a specific format, such as "csv", "markdown", or "html". (default: `None`) |
| `extract_tables_settings` | `Optional[dict[str, Any]]` | No | Optional dictionary of settings for customizing table extraction. (default: `None`) |

## Extends

- `BaseBlobParser`

## Constructors

```python
__init__(
    self,
    text_kwargs: Optional[dict[str, Any]] = None,
    extract_images: bool = False,
    *,
    password: Optional[str] = None,
    mode: Literal['single', 'page'] = 'page',
    pages_delimiter: str = _DEFAULT_PAGES_DELIMITER,
    images_parser: Optional[BaseImageBlobParser] = None,
    images_inner_format: Literal['text', 'markdown-img', 'html-img'] = 'text',
    extract_tables: Union[Literal['csv', 'markdown', 'html'], None] = None,
    extract_tables_settings: Optional[dict[str, Any]] = None,
) -> None
```

| Name | Type |
|------|------|
| `text_kwargs` | `Optional[dict[str, Any]]` |
| `extract_images` | `bool` |
| `password` | `Optional[str]` |
| `mode` | `Literal['single', 'page']` |
| `pages_delimiter` | `str` |
| `images_parser` | `Optional[BaseImageBlobParser]` |
| `images_inner_format` | `Literal['text', 'markdown-img', 'html-img']` |
| `extract_tables` | `Union[Literal['csv', 'markdown', 'html'], None]` |
| `extract_tables_settings` | `Optional[dict[str, Any]]` |


## Properties

- `mode`
- `pages_delimiter`
- `password`
- `text_kwargs`
- `extract_images`
- `images_inner_format`
- `images_parser`
- `extract_tables`
- `extract_tables_settings`

## Methods

- [`lazy_parse()`](https://reference.langchain.com/python/langchain-community/document_loaders/parsers/pdf/PyMuPDFParser/lazy_parse)

---

[View source on GitHub](https://github.com/langchain-ai/langchain-community/blob/d5ea8358933260ad48dd31f7f8076555c7b4885a/libs/community/langchain_community/document_loaders/parsers/pdf.py#L807)