langchain-text-splitters¶
Reference documentation for the langchain-text-splitters package.
langchain_text_splitters
¶
Text Splitters are classes for splitting text.
Note
MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter do not derive from
TextSplitter.
| FUNCTION | DESCRIPTION |
|---|---|
split_text_on_tokens |
Split incoming text and return chunks using tokenizer. |
TextSplitter
¶
Bases: BaseDocumentTransformer, ABC
Interface for splitting text into chunks.
| METHOD | DESCRIPTION |
|---|---|
atransform_documents |
Asynchronously transform a list of documents. |
__init__ |
Create a new TextSplitter. |
split_text |
Split text into multiple components. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents
async
¶
__init__
¶
__init__(
chunk_size: int = 4000,
chunk_overlap: int = 200,
length_function: Callable[[str], int] = len,
keep_separator: bool | Literal["start", "end"] = False,
add_start_index: bool = False,
strip_whitespace: bool = True,
) -> None
Create a new TextSplitter.
| PARAMETER | DESCRIPTION |
|---|---|
chunk_size
|
Maximum size of chunks to return
TYPE:
|
chunk_overlap
|
Overlap in characters between chunks
TYPE:
|
length_function
|
Function that measures the length of given chunks |
keep_separator
|
Whether to keep the separator and where to place it
in each corresponding chunk |
add_start_index
|
If
TYPE:
|
strip_whitespace
|
If
TYPE:
|
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
Tokenizer
dataclass
¶
Tokenizer data class.
decode
instance-attribute
¶
Function to decode a list of token ids to a string
TokenTextSplitter
¶
Bases: TextSplitter
Splitting text to tokens using model tokenizer.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
__init__ |
Create a new TextSplitter. |
split_text |
Splits the input text into smaller chunks based on tokenization. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
__init__
¶
__init__(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> None
Create a new TextSplitter.
split_text
¶
Splits the input text into smaller chunks based on tokenization.
This method uses a custom tokenizer configuration to encode the input text
into tokens, processes the tokens in chunks of a specified size with overlap,
and decodes them back into text chunks. The splitting is performed using the
split_text_on_tokens function.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The input text to be split into smaller chunks.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
A list of text chunks, where each chunk is derived from a portion of the input text based on the tokenization and chunking rules. |
CharacterTextSplitter
¶
Bases: TextSplitter
Splitting text that looks at characters.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
__init__ |
Create a new TextSplitter. |
split_text |
Split into chunks without re-inserting lookaround separators. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
__init__
¶
Create a new TextSplitter.
RecursiveCharacterTextSplitter
¶
Bases: TextSplitter
Splitting text by recursively look at characters.
Recursively tries to split by different characters to find one that works.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
__init__ |
Create a new TextSplitter. |
split_text |
Split the input text into smaller chunks based on predefined separators. |
from_language |
Return an instance of this class based on a specific language. |
get_separators_for_language |
Retrieve a list of separators specific to the given language. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
__init__
¶
__init__(
separators: list[str] | None = None,
keep_separator: bool | Literal["start", "end"] = True,
is_separator_regex: bool = False,
**kwargs: Any,
) -> None
Create a new TextSplitter.
split_text
¶
from_language
classmethod
¶
from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter
Return an instance of this class based on a specific language.
This method initializes the text splitter with language-specific separators.
| PARAMETER | DESCRIPTION |
|---|---|
language
|
The language to configure the text splitter for.
TYPE:
|
**kwargs
|
Additional keyword arguments to customize the splitter.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecursiveCharacterTextSplitter
|
An instance of the text splitter configured for the specified language. |
get_separators_for_language
staticmethod
¶
HTMLHeaderTextSplitter
¶
Split HTML content into structured Documents based on specified headers.
Splits HTML content by detecting specified header tags and creating hierarchical
Document objects that reflect the semantic structure of the original content. For
each identified section, the splitter associates the extracted text with metadata
corresponding to the encountered headers.
If no specified headers are found, the entire content is returned as a single
Document. This allows for flexible handling of HTML input, ensuring that
information is organized according to its semantic headers.
The splitter provides the option to return each HTML element as a separate
Document or aggregate them into semantically meaningful chunks. It also
gracefully handles multiple levels of nested headers, creating a rich,
hierarchical representation of the content.
Example
from langchain_text_splitters.html_header_text_splitter import (
HTMLHeaderTextSplitter,
)
# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
return_each_element=False
)
html_content = """
<html>
<body>
<h1>Introduction</h1>
<p>Welcome to the introduction section.</p>
<h2>Background</h2>
<p>Some background details here.</p>
<h1>Conclusion</h1>
<p>Final thoughts.</p>
</body>
</html>
"""
documents = splitter.split_text(html_content)
# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Final thoughts."
| METHOD | DESCRIPTION |
|---|---|
__init__ |
Initialize with headers to split on. |
split_text |
Split the given text into a list of |
split_text_from_url |
Fetch text content from a URL and split it into documents. |
split_text_from_file |
Split HTML content from a file into a list of |
__init__
¶
Initialize with headers to split on.
| PARAMETER | DESCRIPTION |
|---|---|
headers_to_split_on
|
A list of |
return_each_element
|
If
TYPE:
|
split_text
¶
Split the given text into a list of Document objects.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The HTML text to split.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
A list of split Document objects. Each |
split_text_from_url
¶
Fetch text content from a URL and split it into documents.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to fetch content from.
TYPE:
|
timeout
|
Timeout for the request.
TYPE:
|
**kwargs
|
Additional keyword arguments for the request.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
A list of split Document objects. Each |
| RAISES | DESCRIPTION |
|---|---|
RequestException
|
If the HTTP request fails. |
split_text_from_file
¶
Split HTML content from a file into a list of Document objects.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
A file path or a file-like object containing HTML content. |
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
A list of split Document objects. Each |
HTMLSectionSplitter
¶
Splitting HTML files based on specified tag and font sizes.
Requires lxml package.
| METHOD | DESCRIPTION |
|---|---|
__init__ |
Create a new |
split_documents |
Split documents. |
split_text |
Split HTML text string. |
create_documents |
Create a list of |
split_html_by_headers |
Split an HTML document into sections based on specified header tags. |
convert_possible_tags_to_header |
Convert specific HTML tags to headers using an XSLT transformation. |
split_text_from_file |
Split HTML content from a file into a list of |
__init__
¶
Create a new HTMLSectionSplitter.
| PARAMETER | DESCRIPTION |
|---|---|
headers_to_split_on
|
list of tuples of headers we want to track mapped to
(arbitrary) keys for metadata. Allowed header values: |
**kwargs
|
Additional optional arguments for customizations.
TYPE:
|
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
split_html_by_headers
¶
Split an HTML document into sections based on specified header tags.
This method uses BeautifulSoup to parse the HTML content and divides it into
sections based on headers defined in headers_to_split_on. Each section
contains the header text, content under the header, and the tag name.
| PARAMETER | DESCRIPTION |
|---|---|
html_doc
|
The HTML document to be split into sections.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[dict[str, str | None]]
|
A list of dictionaries representing sections. Each dictionary contains:
|
convert_possible_tags_to_header
¶
Convert specific HTML tags to headers using an XSLT transformation.
This method uses an XSLT file to transform the HTML content, converting certain tags into headers for easier parsing. If no XSLT path is provided, the HTML content is returned unchanged.
| PARAMETER | DESCRIPTION |
|---|---|
html_content
|
The HTML content to be transformed.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The transformed HTML content as a string. |
split_text_from_file
¶
HTMLSemanticPreservingSplitter
¶
Bases: BaseDocumentTransformer
Split HTML content preserving semantic structure.
Splits HTML content by headers into generalized chunks, preserving semantic structure. If chunks exceed the maximum chunk size, it uses RecursiveCharacterTextSplitter for further splitting.
The splitter preserves full HTML elements and converts links to Markdown-like links. It can also preserve images, videos, and audio elements by converting them into Markdown format. Note that some chunks may exceed the maximum size to maintain semantic integrity.
Added in version 0.3.5
Example
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter
def custom_iframe_extractor(iframe_tag):
```
Custom handler function to extract the 'src' attribute from an <iframe> tag.
Converts the iframe to a Markdown-like link: [iframe:<src>](src).
Args:
iframe_tag (bs4.element.Tag): The <iframe> tag to be processed.
Returns:
str: A formatted string representing the iframe in Markdown-like format.
```
iframe_src = iframe_tag.get('src', '')
return f"[iframe:{iframe_src}]({iframe_src})"
text_splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
max_chunk_size=500,
preserve_links=True,
preserve_images=True,
custom_handlers={"iframe": custom_iframe_extractor}
)
| METHOD | DESCRIPTION |
|---|---|
atransform_documents |
Asynchronously transform a list of documents. |
__init__ |
Initialize splitter. |
split_text |
Splits the provided HTML text into smaller chunks based on the configuration. |
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents
async
¶
__init__
¶
__init__(
headers_to_split_on: list[tuple[str, str]],
*,
max_chunk_size: int = 1000,
chunk_overlap: int = 0,
separators: list[str] | None = None,
elements_to_preserve: list[str] | None = None,
preserve_links: bool = False,
preserve_images: bool = False,
preserve_videos: bool = False,
preserve_audio: bool = False,
custom_handlers: dict[str, Callable[[Tag], str]] | None = None,
stopword_removal: bool = False,
stopword_lang: str = "english",
normalize_text: bool = False,
external_metadata: dict[str, str] | None = None,
allowlist_tags: list[str] | None = None,
denylist_tags: list[str] | None = None,
preserve_parent_metadata: bool = False,
keep_separator: bool | Literal["start", "end"] = True,
) -> None
Initialize splitter.
| PARAMETER | DESCRIPTION |
|---|---|
headers_to_split_on
|
HTML headers (e.g., |
max_chunk_size
|
Maximum size for each chunk, with allowance for exceeding this limit to preserve semantics.
TYPE:
|
chunk_overlap
|
Number of characters to overlap between chunks to ensure contextual continuity.
TYPE:
|
separators
|
Delimiters used by |
elements_to_preserve
|
HTML tags (e.g., |
preserve_links
|
Converts
TYPE:
|
preserve_images
|
Converts
TYPE:
|
preserve_videos
|
Converts
TYPE:
|
preserve_audio
|
Converts
TYPE:
|
custom_handlers
|
Optional custom handlers for specific HTML tags, allowing tailored extraction or processing. |
stopword_removal
|
Optionally remove stopwords from the text.
TYPE:
|
stopword_lang
|
The language of stopwords to remove.
TYPE:
|
normalize_text
|
Optionally normalize text (e.g., lowercasing, removing punctuation).
TYPE:
|
external_metadata
|
Additional metadata to attach to the Document objects. |
allowlist_tags
|
Only these tags will be retained in the HTML. |
denylist_tags
|
These tags will be removed from the HTML. |
preserve_parent_metadata
|
Whether to pass through parent document
metadata to split documents when calling
TYPE:
|
keep_separator
|
Whether separators should be at the beginning of a chunk, at the end, or not at all. |
split_text
¶
RecursiveJsonSplitter
¶
Splits JSON data into smaller, structured chunks while preserving hierarchy.
This class provides methods to split JSON data into smaller dictionaries or JSON-formatted strings based on configurable maximum and minimum chunk sizes. It supports nested JSON structures, optionally converts lists into dictionaries for better chunking, and allows the creation of document objects for further use.
| METHOD | DESCRIPTION |
|---|---|
__init__ |
Initialize the chunk size configuration for text processing. |
split_json |
Splits JSON into a list of JSON chunks. |
split_text |
Splits JSON into a list of JSON formatted strings. |
create_documents |
Create a list of |
max_chunk_size
class-attribute
instance-attribute
¶
max_chunk_size: int = max_chunk_size
The maximum size for each chunk.
min_chunk_size
class-attribute
instance-attribute
¶
min_chunk_size: int = (
min_chunk_size if min_chunk_size is not None else max(max_chunk_size - 200, 50)
)
The minimum size for each chunk, derived from max_chunk_size if not
explicitly provided.
__init__
¶
Initialize the chunk size configuration for text processing.
This constructor sets up the maximum and minimum chunk sizes, ensuring that
the min_chunk_size defaults to a value slightly smaller than the
max_chunk_size if not explicitly provided.
| PARAMETER | DESCRIPTION |
|---|---|
max_chunk_size
|
The maximum size for a chunk.
TYPE:
|
min_chunk_size
|
The minimum size for a chunk. If
TYPE:
|
split_json
¶
Splits JSON into a list of JSON chunks.
JSFrameworkTextSplitter
¶
Bases: RecursiveCharacterTextSplitter
Text splitter that handles React (JSX), Vue, and Svelte code.
This splitter extends RecursiveCharacterTextSplitter to handle React (JSX), Vue, and Svelte code by:
- Detecting and extracting custom component tags from the text
- Using those tags as additional separators along with standard JS syntax
The splitter combines:
- Custom component tags as separators (e.g. <Component, <div)
- JavaScript syntax elements (function, const, if, etc)
- Standard text splitting on newlines
This allows chunks to break at natural boundaries in React, Vue, and Svelte component code.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
from_language |
Return an instance of this class based on a specific language. |
get_separators_for_language |
Retrieve a list of separators specific to the given language. |
__init__ |
Initialize the JS Framework text splitter. |
split_text |
Split text into chunks. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
from_language
classmethod
¶
from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter
Return an instance of this class based on a specific language.
This method initializes the text splitter with language-specific separators.
| PARAMETER | DESCRIPTION |
|---|---|
language
|
The language to configure the text splitter for.
TYPE:
|
**kwargs
|
Additional keyword arguments to customize the splitter.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecursiveCharacterTextSplitter
|
An instance of the text splitter configured for the specified language. |
get_separators_for_language
staticmethod
¶
__init__
¶
__init__(
separators: list[str] | None = None,
chunk_size: int = 2000,
chunk_overlap: int = 0,
**kwargs: Any,
) -> None
Initialize the JS Framework text splitter.
| PARAMETER | DESCRIPTION |
|---|---|
separators
|
Optional list of custom separator strings to use |
chunk_size
|
Maximum size of chunks to return
TYPE:
|
chunk_overlap
|
Overlap in characters between chunks
TYPE:
|
**kwargs
|
Additional arguments to pass to parent class
TYPE:
|
split_text
¶
Split text into chunks.
This method splits the text into chunks by:
- Extracting unique opening component tags using regex
- Creating separators list with extracted tags and JS separators
- Splitting the text using the separators by calling the parent class method
| PARAMETER | DESCRIPTION |
|---|---|
text
|
String containing code to split
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
List of text chunks split on component and JS boundaries |
KonlpyTextSplitter
¶
Bases: TextSplitter
Splitting text using Konlpy package.
It is good for splitting Korean text.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
__init__ |
Initialize the Konlpy text splitter. |
split_text |
Split incoming text and return chunks. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
__init__
¶
Initialize the Konlpy text splitter.
LatexTextSplitter
¶
Bases: RecursiveCharacterTextSplitter
Attempts to split the text along Latex-formatted layout elements.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
split_text |
Split the input text into smaller chunks based on predefined separators. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
from_language |
Return an instance of this class based on a specific language. |
get_separators_for_language |
Retrieve a list of separators specific to the given language. |
__init__ |
Initialize a LatexTextSplitter. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
split_text
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
from_language
classmethod
¶
from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter
Return an instance of this class based on a specific language.
This method initializes the text splitter with language-specific separators.
| PARAMETER | DESCRIPTION |
|---|---|
language
|
The language to configure the text splitter for.
TYPE:
|
**kwargs
|
Additional keyword arguments to customize the splitter.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecursiveCharacterTextSplitter
|
An instance of the text splitter configured for the specified language. |
get_separators_for_language
staticmethod
¶
ExperimentalMarkdownSyntaxTextSplitter
¶
An experimental text splitter for handling Markdown syntax.
This splitter aims to retain the exact whitespace of the original text while extracting structured metadata, such as headers. It is a re-implementation of the MarkdownHeaderTextSplitter with notable changes to the approach and additional features.
Key Features:
- Retains the original whitespace and formatting of the Markdown text.
- Extracts headers, code blocks, and horizontal rules as metadata.
- Splits out code blocks and includes the language in the "Code" metadata key.
- Splits text on horizontal rules (
---) as well. - Defaults to sensible splitting behavior, which can be overridden using the
headers_to_split_onparameter.
Example:
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
splitter = ExperimentalMarkdownSyntaxTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = splitter.split(text)
for chunk in chunks:
print(chunk)
This class is currently experimental and subject to change based on feedback and further development.
| METHOD | DESCRIPTION |
|---|---|
__init__ |
Initialize the text splitter with header splitting and formatting options. |
split_text |
Split the input text into structured chunks. |
__init__
¶
__init__(
headers_to_split_on: list[tuple[str, str]] | None = None,
return_each_line: bool = False,
strip_headers: bool = True,
) -> None
Initialize the text splitter with header splitting and formatting options.
This constructor sets up the required configuration for splitting text into chunks based on specified headers and formatting preferences.
| PARAMETER | DESCRIPTION |
|---|---|
headers_to_split_on
|
A list of tuples, where each tuple contains a header tag (e.g., "h1")
and its corresponding metadata key. If |
return_each_line
|
Whether to return each line as an individual chunk.
Defaults to
TYPE:
|
strip_headers
|
Whether to exclude headers from the resulting chunks.
TYPE:
|
split_text
¶
Split the input text into structured chunks.
This method processes the input text line by line, identifying and handling specific patterns such as headers, code blocks, and horizontal rules to split it into structured chunks based on headers, code blocks, and horizontal rules.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The input text to be split into chunks.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
A list of |
list[Document]
|
chunks of the input text. If |
list[Document]
|
is returned as a separate |
MarkdownHeaderTextSplitter
¶
Splitting markdown files based on specified headers.
| METHOD | DESCRIPTION |
|---|---|
__init__ |
Create a new MarkdownHeaderTextSplitter. |
aggregate_lines_to_chunks |
Combine lines with common metadata into chunks. |
split_text |
Split markdown file. |
__init__
¶
__init__(
headers_to_split_on: list[tuple[str, str]],
return_each_line: bool = False,
strip_headers: bool = True,
custom_header_patterns: dict[str, int] | None = None,
) -> None
Create a new MarkdownHeaderTextSplitter.
| PARAMETER | DESCRIPTION |
|---|---|
headers_to_split_on
|
Headers we want to track |
return_each_line
|
Return each line w/ associated headers
TYPE:
|
strip_headers
|
Strip split headers from the content of the chunk
TYPE:
|
custom_header_patterns
|
Optional dict mapping header patterns to their levels. For example: {"": 1, "*": 2} to treat Header as level 1 and Header as level 2 headers. |
aggregate_lines_to_chunks
¶
MarkdownTextSplitter
¶
Bases: RecursiveCharacterTextSplitter
Attempts to split the text along Markdown-formatted headings.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
split_text |
Split the input text into smaller chunks based on predefined separators. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
from_language |
Return an instance of this class based on a specific language. |
get_separators_for_language |
Retrieve a list of separators specific to the given language. |
__init__ |
Initialize a MarkdownTextSplitter. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
split_text
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
from_language
classmethod
¶
from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter
Return an instance of this class based on a specific language.
This method initializes the text splitter with language-specific separators.
| PARAMETER | DESCRIPTION |
|---|---|
language
|
The language to configure the text splitter for.
TYPE:
|
**kwargs
|
Additional keyword arguments to customize the splitter.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecursiveCharacterTextSplitter
|
An instance of the text splitter configured for the specified language. |
get_separators_for_language
staticmethod
¶
NLTKTextSplitter
¶
Bases: TextSplitter
Splitting text using NLTK package.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
__init__ |
Initialize the NLTK splitter. |
split_text |
Split incoming text and return chunks. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
PythonCodeTextSplitter
¶
Bases: RecursiveCharacterTextSplitter
Attempts to split the text along Python syntax.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
split_text |
Split the input text into smaller chunks based on predefined separators. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
from_language |
Return an instance of this class based on a specific language. |
get_separators_for_language |
Retrieve a list of separators specific to the given language. |
__init__ |
Initialize a PythonCodeTextSplitter. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
split_text
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
from_language
classmethod
¶
from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter
Return an instance of this class based on a specific language.
This method initializes the text splitter with language-specific separators.
| PARAMETER | DESCRIPTION |
|---|---|
language
|
The language to configure the text splitter for.
TYPE:
|
**kwargs
|
Additional keyword arguments to customize the splitter.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecursiveCharacterTextSplitter
|
An instance of the text splitter configured for the specified language. |
get_separators_for_language
staticmethod
¶
SentenceTransformersTokenTextSplitter
¶
Bases: TextSplitter
Splitting text to tokens using sentence model tokenizer.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
__init__ |
Create a new TextSplitter. |
split_text |
Splits the input text into smaller components by splitting text on tokens. |
count_tokens |
Counts the number of tokens in the given text. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.
__init__
¶
__init__(
chunk_overlap: int = 50,
model_name: str = "sentence-transformers/all-mpnet-base-v2",
tokens_per_chunk: int | None = None,
**kwargs: Any,
) -> None
Create a new TextSplitter.
split_text
¶
Splits the input text into smaller components by splitting text on tokens.
This method encodes the input text using a private _encode method, then
strips the start and stop token IDs from the encoded result. It returns the
processed segments as a list of strings.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The input text to be split.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
A list of string components derived from the input text after encoding and |
list[str]
|
processing. |
count_tokens
¶
Counts the number of tokens in the given text.
This method encodes the input text using a private _encode method and
calculates the total number of tokens in the encoded result.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The input text for which the token count is calculated.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
The number of tokens in the encoded text.
TYPE:
|
SpacyTextSplitter
¶
Bases: TextSplitter
Splitting text using Spacy package.
Per default, Spacy's en_core_web_sm model is used and
its default max_length is 1000000 (it is the length of maximum character
this model takes which can be increased for large files). For a faster, but
potentially less accurate splitting, you can use pipeline='sentencizer'.
| METHOD | DESCRIPTION |
|---|---|
transform_documents |
Transform sequence of documents by splitting them. |
atransform_documents |
Asynchronously transform a list of documents. |
create_documents |
Create a list of |
split_documents |
Split documents. |
from_huggingface_tokenizer |
Text splitter that uses Hugging Face tokenizer to count length. |
from_tiktoken_encoder |
Text splitter that uses |
__init__ |
Initialize the spacy text splitter. |
split_text |
Split incoming text and return chunks. |
transform_documents
¶
Transform sequence of documents by splitting them.
atransform_documents
async
¶
create_documents
¶
create_documents(
texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]
Create a list of Document objects from a list of texts.
from_huggingface_tokenizer
classmethod
¶
from_huggingface_tokenizer(
tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter
Text splitter that uses Hugging Face tokenizer to count length.
from_tiktoken_encoder
classmethod
¶
from_tiktoken_encoder(
encoding_name: str = "gpt2",
model_name: str | None = None,
allowed_special: Literal["all"] | Set[str] = set(),
disallowed_special: Literal["all"] | Collection[str] = "all",
**kwargs: Any,
) -> Self
Text splitter that uses tiktoken encoder to count length.