Load and parse a PDF file using 'pypdf' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting images, and
defining extraction mode. It integrates the pypdf library for PDF processing and
offers both synchronous and asynchronous document loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdf
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_parser = RapidOCRBlobParser(), )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load content from RSpace notebooks, folders, documents or PDF Gallery files.
Map RSpace document <-> Langchain Document in 1-1. PDFs are imported using PyPDF.
Requirements are rspace_client (pip install rspace_client) and PyPDF if importing
PDF docs (pip install pypdf).