PDFMinerLoader

PDFMinerLoader(
  self,
  file_path: Union[str, PurePath],
  *,
  password: Optional

Bases

BasePDFLoader

Constructors

Attributes

Methods

Inherited fromBasePDFLoader

Attributes

Afile_path Aweb_path: None Aheaders: headers Atemp_dir A

View source on GitHub

Parameters

Name	Type	Description
`file_path`*	`Union[str, PurePath]`	The path to the PDF file to be loaded.
`headers`	`Optional[dict]`	Default:`None` Optional headers to use for GET request to download a file from a web path.
`password`	`Optional[str]`	Default:`None`
`mode`	`Literal['single', 'page']`	Default:`'single'`
`pages_delimiter`	`str`	Default:`_DEFAULT_PAGES_DELIMITER`
`extract_images`	`bool`	Default:`False`
`images_parser`	`Optional[BaseImageBlobParser]`	Default:`None`
`images_inner_format`	`Literal['text', 'markdown-img', 'html-img']`	Default:`'text'`
`concatenate_pages`	`Optional[bool]`	Default:`None`

constructor

__init__

Name	Type
file_path	Union[str, PurePath]
password	Optional[str]
mode	Literal['single', 'page']
pages_delimiter	str
extract_images	bool
images_parser	Optional[BaseImageBlobParser]
images_inner_format	Literal['text', 'markdown-img', 'html-img']
headers	Optional[dict]
concatenate_pages	Optional[bool]

Load and parse a PDF file using 'pdfminer.six' library.

This class provides methods to load and parse PDF documents, supporting various configurations such as handling password-protected files, extracting images, and defining extraction mode. It integrates the pdfminer.six library for PDF processing and offers both synchronous and asynchronous document loading.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pdfminer.six

   Instantiate the loader:

   .. code-block:: python

       from langchain_community.document_loaders import PDFMinerLoader

       loader = PDFMinerLoader(
           file_path = "./example_data/layout-parser-paper.pdf",
           # headers = None
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazy load documents:

   .. code-block:: python

       docs = []
       docs_lazy = loader.lazy_load()

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

   Load documents asynchronously:

   .. code-block:: python

       docs = await loader.aload()
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

LangChain Assistant

Menu

PDFMinerLoader

Bases

Constructors

Attributes

Methods

Inherited fromBasePDFLoader

Attributes

Inherited fromBaseLoader(langchain_core)

Methods

Parameters