Base Loader that uses dedoc (https://dedoc.readthedocs.io).
Loader enables extracting text, tables and attached files from the given file:
* Text can be split by pages, dedoc tree nodes, textual lines
(according to the split parameter).
* Attached files (when with_attachments=True)
are split according to the split parameter.
For attachments, langchain Document object has an additional metadata field
type="attachment".
* Tables (when with_tables=True) are not split - each table corresponds to one
langchain Document object.
For tables, Document object has additional metadata fields type="table"
and text_as_html with table HTML representation.
Abstract base class for parsing image blobs into text.
Send PDF files to Amazon Textract and parse them.
For parsing multi-page PDFs, they have to reside on S3.
The AmazonTextractPDFLoader calls the Amazon Textract Service to convert PDFs into a Document structure. Single and multi-page documents are supported with up to 3000 pages and 512 MB of size.
For the call to be successful an AWS account is required, similar to the AWS CLI requirements.
Besides the AWS configuration, it is very similar to the other PDF loaders, while also supporting JPEG, PNG and TIFF and non-native PDF formats.
from langchain_community.document_loaders import AmazonTextractPDFLoader
loader=AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()
One feature is the linearization of the output. When using the features LAYOUT, FORMS or TABLES together with Textract
from langchain_community.document_loaders import AmazonTextractPDFLoader
# you can mix and match each of the features
loader=AmazonTextractPDFLoader(
"example_data/alejandro_rosalez_sample-small.jpeg",
textract_features=["TABLES", "LAYOUT"])
documents = loader.load()
it will generate output that formats the text in reading order and try to output the information in a tabular structure or output the key/value pairs with a colon (key: value). This helps most LLMs to achieve better accuracy when processing these texts.
Document objects are returned with metadata that includes the source and
a 1-based index of the page number in page. Note that page represents
the index of the result returned from Textract, not necessarily the as-written
page number in the document.
Loads a PDF with Azure Document Intelligence (formerly Form Recognizer) and chunks at character level.
Parse a blob from a PDF using pdfminer.six library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'pdfminer.six' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pdfminer.six pillow
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PDFMinerParser
parser = PDFMinerParser(
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Parse PDF with PDFPlumber.
Parse a blob from a PDF using PyMuPDF library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyMuPDF' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pymupdf
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PyMuPDFParser
parser = PyMuPDFParser(
# password = None,
mode = "single",
pages_delimiter = "
", # images_parser = TesseractBlobParser(), # extract_tables="markdown", # extract_tables_settings=None, # text_kwargs=None, )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Parse a blob from a PDF using PyPDFium2 library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyPDFium2' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdfium2
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PyPDFium2Parser
parser = PyPDFium2Parser(
# password=None,
mode="page",
pages_delimiter="
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Parse a blob from a PDF using pypdf library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the 'pypdf' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdf
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PyPDFParser
parser = PyPDFParser(
# password = None,
mode = "single",
pages_delimiter = "
", # images_parser = TesseractBlobParser(), )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load PDF files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader( "example.pdf", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf
Base Loader class for PDF files.
If the file is a web path, it will download it to a temporary file, use it, then clean up the temporary file after completion.
Load online PDF.
Load and parse a PDF file using 'pypdf' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting images, and
defining extraction mode. It integrates the pypdf library for PDF processing and
offers both synchronous and asynchronous document loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdf
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_parser = RapidOCRBlobParser(), )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load and parse a PDF file using the pypdfium2 library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting images, and
defining extraction mode.
It integrates the pypdfium2 library for PDF processing and offers both
synchronous and asynchronous document loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdfium2
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyPDFium2Loader
loader = PyPDFium2Loader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load and parse a directory of PDF files using 'pypdf' library.
This class provides methods to load and parse multiple PDF documents in a directory,
supporting options for recursive search, handling password-protected files,
extracting images, and defining extraction modes. It integrates the pypdf library
for PDF processing and offers synchronous document loading.
Load and parse a PDF file using 'pdfminer.six' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting images, and
defining extraction mode. It integrates the pdfminer.six library for PDF
processing and offers both synchronous and asynchronous document loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pdfminer.six
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PDFMinerLoader
loader = PDFMinerLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load PDF files as HTML content using PDFMiner.
Load and parse a PDF file using 'PyMuPDF' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting tables,
extracting images, and defining extraction mode. It integrates the PyMuPDF
library for PDF processing and offers both synchronous and asynchronous document
loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pymupdf
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_parser = TesseractBlobParser(), # extract_tables = "markdown", # extract_tables_settings = None, )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load PDF files using Mathpix service.
Load PDF files using pdfplumber.
Load PDF files from a local file system, HTTP or S3.
To authenticate, the AWS client uses the following methods to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.
Make sure the credentials / roles used have the required policies to access the Amazon Textract service.
DedocPDFLoader document loader integration to load PDF files using dedoc.
The file loader can automatically detect the correctness of a textual layer in the
PDF document.
Note that __init__ method supports parameters that differ from ones of
DedocBaseLoader.
Load a PDF with Azure Document Intelligence
Document loader utilizing Zerox library: https://github.com/getomni-ai/zerox
Zerox converts PDF document to series of images (page-wise) and uses vision-capable LLM model to generate Markdown representation.
Zerox utilizes anyc operations. Therefore when using this loader inside Jupyter Notebook (or any environment running async) you will need to:
import nest_asyncio
nest_asyncio.apply()Load files using Unstructured.
The file loader uses the unstructured partition function and will automatically detect the file type. You can run the loader in different modes: "single", "elements", and "paged". The default "single" mode will return a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText and return those as individual langchain Document objects. In addition to these post-processing modes (which are specific to the LangChain Loaders), Unstructured has its own "chunking" parameters for post-processing elements into more useful chunks for uses cases such as Retrieval Augmented Generation (RAG). You can pass in additional unstructured kwargs to configure different unstructured settings.
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader( "example.pdf", mode="elements", strategy="fast", ) docs = loader.load()
https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking