| Name | Type | Description |
|---|---|---|
file_path* | Union[str, PurePath] | The path to the PDF file to be loaded. |
headers | Optional[dict] | Default: NoneOptional headers to use for GET request to download a file from a web path. |
password | Optional[str] | Default: None |
mode | Literal['single', 'page'] | Default: 'page' |
pages_delimiter | str | Default: _DEFAULT_PAGES_DELIMITER |
extract_images | bool | Default: False |
images_parser | Optional[BaseImageBlobParser] | Default: None |
images_inner_format | Literal['text', 'markdown-img', 'html-img'] | Default: 'text' |
extract_tables | Union[Literal['csv', 'markdown', 'html'], None] | Default: None |
extract_tables_settings | Optional[dict[str, Any]] | Default: None |
**kwargs | Any | Default: {} |
| Name | Type |
|---|---|
| file_path | Union[str, PurePath] |
| password | Optional[str] |
| mode | Literal['single', 'page'] |
| pages_delimiter | str |
| extract_images | bool |
| images_parser | Optional[BaseImageBlobParser] |
| images_inner_format | Literal['text', 'markdown-img', 'html-img'] |
| extract_tables | Union[Literal['csv', 'markdown', 'html'], None] |
| headers | Optional[dict] |
| extract_tables_settings | Optional[dict[str, Any]] |
Load and parse a PDF file using 'PyMuPDF' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting tables,
extracting images, and defining extraction mode. It integrates the PyMuPDF
library for PDF processing and offers both synchronous and asynchronous document
loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pymupdf
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_parser = TesseractBlobParser(), # extract_tables = "markdown", # extract_tables_settings = None, )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Optional password for opening encrypted PDFs.
The extraction mode, either "single" for the entire document or "page" for page-wise extraction.
A string delimiter to separate pages in single-mode extraction.
Whether to extract images from the PDF.
Optional image blob parser.
The format for the parsed output.
![body)(#)]alt text of an tag and link to
(<img alt="{body}" src="#"/>)Whether to extract tables in a specific format, such as "csv", "markdown", or "html".
Optional dictionary of settings for customizing table extraction.
Additional keyword arguments for customizing text extraction behavior.