Text Splitters are classes for splitting text.
MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter do not derive from
TextSplitter.
Split incoming text and return chunks using tokenizer.
Enum of the programming languages.
Interface for splitting text into chunks.
Tokenizer data class.
Splitting text to tokens using model tokenizer.
Splitting text that looks at characters.
Splitting text by recursively look at characters.
Recursively tries to split by different characters to find one that works.
Element type as typed dict.
Split HTML content into structured Documents based on specified headers.
Splits HTML content by detecting specified header tags and creating hierarchical
Document objects that reflect the semantic structure of the original content. For
each identified section, the splitter associates the extracted text with metadata
corresponding to the encountered headers.
If no specified headers are found, the entire content is returned as a single
Document. This allows for flexible handling of HTML input, ensuring that
information is organized according to its semantic headers.
The splitter provides the option to return each HTML element as a separate
Document or aggregate them into semantically meaningful chunks. It also
gracefully handles multiple levels of nested headers, creating a rich,
hierarchical representation of the content.
Splitting HTML files based on specified tag and font sizes.
Requires lxml package.
Split HTML content preserving semantic structure.
Splits HTML content by headers into generalized chunks, preserving semantic
structure. If chunks exceed the maximum chunk size, it uses
RecursiveCharacterTextSplitter for further splitting.
The splitter preserves full HTML elements and converts links to Markdown-like links. It can also preserve images, videos, and audio elements by converting them into Markdown format. Note that some chunks may exceed the maximum size to maintain semantic integrity.
Splits JSON data into smaller, structured chunks while preserving hierarchy.
This class provides methods to split JSON data into smaller dictionaries or JSON-formatted strings based on configurable maximum and minimum chunk sizes. It supports nested JSON structures, optionally converts lists into dictionaries for better chunking, and allows the creation of document objects for further use.
Text splitter that handles React (JSX), Vue, and Svelte code.
This splitter extends RecursiveCharacterTextSplitter to handle React (JSX), Vue,
and Svelte code by:
The splitter combines:
<Component, <div)This allows chunks to break at natural boundaries in React, Vue, and Svelte component code.
Splitting text using Konlpy package.
It is good for splitting Korean text.
Attempts to split the text along Latex-formatted layout elements.
An experimental text splitter for handling Markdown syntax.
This splitter aims to retain the exact whitespace of the original text while
extracting structured metadata, such as headers. It is a re-implementation of the
MarkdownHeaderTextSplitter with notable changes to the approach and additional
features.
Key Features:
---) as well.headers_to_split_on parameter.Header type as TypedDict.
Line type as TypedDict.
Splitting markdown files based on specified headers.
Attempts to split the text along Markdown-formatted headings.
Splitting text using NLTK package.
Attempts to split the text along Python syntax.
Splitting text to tokens using sentence model tokenizer.
Splitting text using Spacy package.
Per default, Spacy's en_core_web_sm model is used and
its default max_length is 1000000 (it is the length of maximum character
this model takes which can be increased for large files). For a faster, but
potentially less accurate splitting, you can use pipeline='sentencizer'.
Character text splitters.
Sentence transformers text splitter.
Python code text splitter.
Markdown text splitters.
Konlpy text splitter.
JSON text splitter.
Latex text splitter.
Spacy text splitter.
HTML text splitters.
NLTK text splitter.
JavaScript framework text splitter.
Text splitter base interface.