Transcribe and parse audio files.
Audio transcription is with OpenAI Whisper model.
Loads a PDF with Azure Document Intelligence (formerly Forms Recognizer).
Load article PDF files using Grobid.
Parse HTML files using Beautiful Soup.
Abstract base class for parsing image blobs into text.
Parser for analyzing images using a language model (LLM).
Parser for extracting text from images using the RapidOCR library.
Parse for extracting text from images using the Tesseract OCR library.
Parse using the respective programming language syntax.
Each top-level function and class in the code is loaded into separate documents. Furthermore, an extra document is generated, containing the remaining top-level code that excludes the already segmented functions and classes.
This approach can potentially improve the accuracy of QA models over source code.
The supported languages for code parsing are:
esprima)Items marked with (*) require the packages tree_sitter and
tree_sitter_languages. It is straightforward to add support for additional
languages using tree_sitter, although this currently requires modifying LangChain.
The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax.
If a language is not explicitly specified, LanguageParser will infer one from
filename extensions, if present.
Examples:
.. code-block:: python
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
loader = GenericLoader.from_filesystem(
"./code",
glob="**/*",
suffixes=[".py", ".js"],
parser=LanguageParser()
)
docs = loader.load()
Example instantiations to manually select the language:
.. code-block:: python
loader = GenericLoader.from_filesystem(
"./code",
glob="**/*",
suffixes=[".py"],
parser=LanguageParser(language="python")
)
Example instantiations to set number of lines threshold:
.. code-block:: python
loader = GenericLoader.from_filesystem(
"./code",
glob="**/*",
suffixes=[".py"],
parser=LanguageParser(parser_threshold=200)
)
Parse a blob from a PDF using pdfminer.six library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'pdfminer.six' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pdfminer.six pillow
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PDFMinerParser
parser = PDFMinerParser(
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Parse PDF with PDFPlumber.
Parse a blob from a PDF using PyMuPDF library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyMuPDF' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pymupdf
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PyMuPDFParser
parser = PyMuPDFParser(
# password = None,
mode = "single",
pages_delimiter = "
", # images_parser = TesseractBlobParser(), # extract_tables="markdown", # extract_tables_settings=None, # text_kwargs=None, )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Parse a blob from a PDF using PyPDFium2 library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyPDFium2' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdfium2
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PyPDFium2Parser
parser = PyPDFium2Parser(
# password=None,
mode="page",
pages_delimiter="
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Parse a blob from a PDF using pypdf library.
This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the 'pypdf' library for PDF processing and offers synchronous blob parsing.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdf
Load a blob from a PDF file:
.. code-block:: python
from langchain_core.documents.base import Blob
blob = Blob.from_path("./example_data/layout-parser-paper.pdf")
Instantiate the parser:
.. code-block:: python
from langchain_community.document_loaders.parsers import PyPDFParser
parser = PyPDFParser(
# password = None,
mode = "single",
pages_delimiter = "
", # images_parser = TesseractBlobParser(), )
Lazily parse the blob:
.. code-block:: python
docs = []
docs_lazy = parser.lazy_parse(blob)
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Parser for vsdx files.
Google Cloud Document AI parser.
For a detailed explanation of Document AI, refer to the product documentation. https://cloud.google.com/document-ai/docs/overview
Module contains common parsers for PDFs.
Code for generic / auxiliary parsers.
This module contains some logic to help assemble more sophisticated parsers.
Module contains a PDF parser based on Document AI from Google Cloud.
You need to install two libraries to use this parser: pip install google-cloud-documentai pip install google-cloud-documentai-toolbox
Module for parsing text files..
Module includes a registry of default parser configurations.