Parse using the respective programming language syntax.
Each top-level function and class in the code is loaded into separate documents.
Furthermore, an extra document is generated, containing the remaining top-level code
that excludes the already segmented functions and classes.
This approach can potentially improve the accuracy of QA models over source code.
The supported languages for code parsing are:
- C: "c" (*)
- C++: "cpp" (*)
- C#: "csharp" (*)
- COBOL: "cobol"
- Elixir: "elixir"
- Go: "go" (*)
- Java: "java" (*)
- JavaScript: "js" (requires package
esprima)
- Kotlin: "kotlin" (*)
- Lua: "lua" (*)
- Perl: "perl" (*)
- Python: "python"
- Ruby: "ruby" (*)
- Rust: "rust" (*)
- Scala: "scala" (*)
- SQL: "sql" (*)
- TypeScript: "ts" (*)
Items marked with (*) require the packages tree_sitter and
tree_sitter_languages. It is straightforward to add support for additional
languages using tree_sitter, although this currently requires modifying LangChain.
The language used for parsing can be configured, along with the minimum number of
lines required to activate the splitting based on syntax.
If a language is not explicitly specified, LanguageParser will infer one from
filename extensions, if present.
Examples:
.. code-block:: python
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import LanguageParser
loader = GenericLoader.from_filesystem(
"./code",
glob="**/*",
suffixes=[".py", ".js"],
parser=LanguageParser()
)
docs = loader.load()
Example instantiations to manually select the language:
.. code-block:: python
loader = GenericLoader.from_filesystem(
"./code",
glob="**/*",
suffixes=[".py"],
parser=LanguageParser(language="python")
)
Example instantiations to set number of lines threshold:
.. code-block:: python
loader = GenericLoader.from_filesystem(
"./code",
glob="**/*",
suffixes=[".py"],
parser=LanguageParser(parser_threshold=200)
)