Each top-level function and class in the code is loaded into separate documents. Furthermore, an extra document is generated, containing the remaining top-level code that excludes the already segmented functions and classes.

This approach can potentially improve the accuracy of QA models over source code.

The supported languages for code parsing are:

C: "c" (*)
C++: "cpp" (*)
C#: "csharp" (*)
COBOL: "cobol"
Elixir: "elixir"
Go: "go" (*)
Java: "java" (*)
JavaScript: "js" (requires package esprima)
Kotlin: "kotlin" (*)
Lua: "lua" (*)
Perl: "perl" (*)
Python: "python"
Ruby: "ruby" (*)
Rust: "rust" (*)
Scala: "scala" (*)
SQL: "sql" (*)
TypeScript: "ts" (*)

Items marked with (*) require the packages tree_sitter and tree_sitter_languages. It is straightforward to add support for additional languages using tree_sitter, although this currently requires modifying LangChain.

The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax.

If a language is not explicitly specified, LanguageParser will infer one from filename extensions, if present.

Examples:

.. code-block:: python

    from langchain_community.document_loaders.generic import GenericLoader
    from langchain_community.document_loaders.parsers import LanguageParser

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py", ".js"],
        parser=LanguageParser()
    )
    docs = loader.load()

Example instantiations to manually select the language:

.. code-block:: python

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(language="python")
    )

Example instantiations to set number of lines threshold:

.. code-block:: python

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(parser_threshold=200)
    )

class

PDFMinerParser

Parse a blob from a PDF using pdfminer.six library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'pdfminer.six' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pdfminer.six pillow

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PDFMinerParser

       parser = PDFMinerParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

PDFPlumberParser

Parse PDF with PDFPlumber.

class

PyMuPDFParser

Parse a blob from a PDF using PyMuPDF library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyMuPDF' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pymupdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyMuPDFParser

       parser = PyMuPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), # extract_tables="markdown", # extract_tables_settings=None, # text_kwargs=None, )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

PyPDFium2Parser

Parse a blob from a PDF using PyPDFium2 library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images, and defining extraction mode. It integrates the 'PyPDFium2' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdfium2

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyPDFium2Parser

       parser = PyPDFium2Parser(
           # password=None,
           mode="page",
           pages_delimiter="

", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

PyPDFParser

Parse a blob from a PDF using pypdf library.

This class provides methods to parse a blob from a PDF document, supporting various configurations such as handling password-protected PDFs, extracting images. It integrates the 'pypdf' library for PDF processing and offers synchronous blob parsing.

Examples: Setup:

   .. code-block:: bash

       pip install -U langchain-community pypdf

   Load a blob from a PDF file:

   .. code-block:: python

       from langchain_core.documents.base import Blob

       blob = Blob.from_path("./example_data/layout-parser-paper.pdf")

   Instantiate the parser:

   .. code-block:: python

       from langchain_community.document_loaders.parsers import PyPDFParser

       parser = PyPDFParser(
           # password = None,
           mode = "single",
           pages_delimiter = "

", # images_parser = TesseractBlobParser(), )

   Lazily parse the blob:

   .. code-block:: python

       docs = []
       docs_lazy = parser.lazy_parse(blob)

       for doc in docs_lazy:
           docs.append(doc)
       print(docs[0].page_content[:100])
       print(docs[0].metadata)

class

VsdxParser

Parser for vsdx files.

deprecatedclass

DocAIParser

Google Cloud Document AI parser.