Module●Since v0.3

language

Classes

Parse using the respective programming language syntax.

Each top-level function and class in the code is loaded into separate documents. Furthermore, an extra document is generated, containing the remaining top-level code that excludes the already segmented functions and classes.

This approach can potentially improve the accuracy of QA models over source code.

The supported languages for code parsing are:

C: "c" (*)
C++: "cpp" (*)
C#: "csharp" (*)
COBOL: "cobol"
Elixir: "elixir"
Go: "go" (*)
Java: "java" (*)
JavaScript: "js" (requires package esprima)
Kotlin: "kotlin" (*)
Lua: "lua" (*)
Perl: "perl" (*)
Python: "python"
Ruby: "ruby" (*)
Rust: "rust" (*)
Scala: "scala" (*)
SQL: "sql" (*)
TypeScript: "ts" (*)

Items marked with (*) require the packages tree_sitter and tree_sitter_languages. It is straightforward to add support for additional languages using tree_sitter, although this currently requires modifying LangChain.

The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax.

If a language is not explicitly specified, LanguageParser will infer one from filename extensions, if present.

Examples:

.. code-block:: python

    from langchain_community.document_loaders.generic import GenericLoader
    from langchain_community.document_loaders.parsers import LanguageParser

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py", ".js"],
        parser=LanguageParser()
    )
    docs = loader.load()

Example instantiations to manually select the language:

.. code-block:: python

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(language="python")
    )

Example instantiations to set number of lines threshold:

.. code-block:: python

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(parser_threshold=200)
    )

Modules

View source on GitHub

language

Classes

Modules

LangChain Assistant

Menu

language

Classes

Modules