Skip to content

langchain-text-splitters

PyPI - Version PyPI - License PyPI - Downloads

Reference documentation for the langchain-text-splitters package.

langchain_text_splitters

Text Splitters are classes for splitting text.

Note

MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter do not derive from TextSplitter.

FUNCTION DESCRIPTION
split_text_on_tokens

Split incoming text and return chunks using tokenizer.

Language

Bases: str, Enum

Enum of the programming languages.

TextSplitter

Bases: BaseDocumentTransformer, ABC

Interface for splitting text into chunks.

METHOD DESCRIPTION
atransform_documents

Asynchronously transform a list of documents.

__init__

Create a new TextSplitter.

split_text

Split text into multiple components.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

transform_documents

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

__init__

__init__(
    chunk_size: int = 4000,
    chunk_overlap: int = 200,
    length_function: Callable[[str], int] = len,
    keep_separator: bool | Literal["start", "end"] = False,
    add_start_index: bool = False,
    strip_whitespace: bool = True,
) -> None

Create a new TextSplitter.

PARAMETER DESCRIPTION
chunk_size

Maximum size of chunks to return

TYPE: int DEFAULT: 4000

chunk_overlap

Overlap in characters between chunks

TYPE: int DEFAULT: 200

length_function

Function that measures the length of given chunks

TYPE: Callable[[str], int] DEFAULT: len

keep_separator

Whether to keep the separator and where to place it in each corresponding chunk (True='start')

TYPE: bool | Literal['start', 'end'] DEFAULT: False

add_start_index

If True, includes chunk's start index in metadata

TYPE: bool DEFAULT: False

strip_whitespace

If True, strips whitespace from the start and end of every document

TYPE: bool DEFAULT: True

split_text abstractmethod

split_text(text: str) -> list[str]

Split text into multiple components.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

Tokenizer dataclass

Tokenizer data class.

chunk_overlap instance-attribute

chunk_overlap: int

Overlap in tokens between chunks

tokens_per_chunk instance-attribute

tokens_per_chunk: int

Maximum number of tokens per chunk

decode instance-attribute

decode: Callable[[list[int]], str]

Function to decode a list of token ids to a string

encode instance-attribute

encode: Callable[[str], list[int]]

Function to encode a string to a list of token ids

TokenTextSplitter

Bases: TextSplitter

Splitting text to tokens using model tokenizer.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

__init__

Create a new TextSplitter.

split_text

Splits the input text into smaller chunks based on tokenization.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

__init__

__init__(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> None

Create a new TextSplitter.

split_text

split_text(text: str) -> list[str]

Splits the input text into smaller chunks based on tokenization.

This method uses a custom tokenizer configuration to encode the input text into tokens, processes the tokens in chunks of a specified size with overlap, and decodes them back into text chunks. The splitting is performed using the split_text_on_tokens function.

PARAMETER DESCRIPTION
text

The input text to be split into smaller chunks.

TYPE: str

RETURNS DESCRIPTION
list[str]

A list of text chunks, where each chunk is derived from a portion of the input text based on the tokenization and chunking rules.

CharacterTextSplitter

Bases: TextSplitter

Splitting text that looks at characters.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

__init__

Create a new TextSplitter.

split_text

Split into chunks without re-inserting lookaround separators.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

__init__

__init__(
    separator: str = "\n\n", is_separator_regex: bool = False, **kwargs: Any
) -> None

Create a new TextSplitter.

split_text

split_text(text: str) -> list[str]

Split into chunks without re-inserting lookaround separators.

RecursiveCharacterTextSplitter

Bases: TextSplitter

Splitting text by recursively look at characters.

Recursively tries to split by different characters to find one that works.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

__init__

Create a new TextSplitter.

split_text

Split the input text into smaller chunks based on predefined separators.

from_language

Return an instance of this class based on a specific language.

get_separators_for_language

Retrieve a list of separators specific to the given language.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

__init__

__init__(
    separators: list[str] | None = None,
    keep_separator: bool | Literal["start", "end"] = True,
    is_separator_regex: bool = False,
    **kwargs: Any,
) -> None

Create a new TextSplitter.

split_text

split_text(text: str) -> list[str]

Split the input text into smaller chunks based on predefined separators.

PARAMETER DESCRIPTION
text

The input text to be split.

TYPE: str

RETURNS DESCRIPTION
list[str]

A list of text chunks obtained after splitting.

from_language classmethod

from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter

Return an instance of this class based on a specific language.

This method initializes the text splitter with language-specific separators.

PARAMETER DESCRIPTION
language

The language to configure the text splitter for.

TYPE: Language

**kwargs

Additional keyword arguments to customize the splitter.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
RecursiveCharacterTextSplitter

An instance of the text splitter configured for the specified language.

get_separators_for_language staticmethod

get_separators_for_language(language: Language) -> list[str]

Retrieve a list of separators specific to the given language.

PARAMETER DESCRIPTION
language

The language for which to get the separators.

TYPE: Language

RETURNS DESCRIPTION
list[str]

A list of separators appropriate for the specified language.

ElementType

Bases: TypedDict

Element type as typed dict.

HTMLHeaderTextSplitter

Split HTML content into structured Documents based on specified headers.

Splits HTML content by detecting specified header tags and creating hierarchical Document objects that reflect the semantic structure of the original content. For each identified section, the splitter associates the extracted text with metadata corresponding to the encountered headers.

If no specified headers are found, the entire content is returned as a single Document. This allows for flexible handling of HTML input, ensuring that information is organized according to its semantic headers.

The splitter provides the option to return each HTML element as a separate Document or aggregate them into semantically meaningful chunks. It also gracefully handles multiple levels of nested headers, creating a rich, hierarchical representation of the content.

Example
from langchain_text_splitters.html_header_text_splitter import (
    HTMLHeaderTextSplitter,
)

# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    return_each_element=False
)

html_content = """
<html>
    <body>
        <h1>Introduction</h1>
        <p>Welcome to the introduction section.</p>
        <h2>Background</h2>
        <p>Some background details here.</p>
        <h1>Conclusion</h1>
        <p>Final thoughts.</p>
    </body>
</html>
"""

documents = splitter.split_text(html_content)

# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Final thoughts."
METHOD DESCRIPTION
__init__

Initialize with headers to split on.

split_text

Split the given text into a list of Document objects.

split_text_from_url

Fetch text content from a URL and split it into documents.

split_text_from_file

Split HTML content from a file into a list of Document objects.

__init__

__init__(
    headers_to_split_on: list[tuple[str, str]], return_each_element: bool = False
) -> None

Initialize with headers to split on.

PARAMETER DESCRIPTION
headers_to_split_on

A list of (header_tag, header_name) pairs representing the headers that define splitting boundaries. For example, [("h1", "Header 1"), ("h2", "Header 2")] will split content by h1 and h2 tags, assigning their textual content to the Document metadata.

TYPE: list[tuple[str, str]]

return_each_element

If True, every HTML element encountered (including headers, paragraphs, etc.) is returned as a separate Document. If False, content under the same header hierarchy is aggregated into fewer Document objects.

TYPE: bool DEFAULT: False

split_text

split_text(text: str) -> list[Document]

Split the given text into a list of Document objects.

PARAMETER DESCRIPTION
text

The HTML text to split.

TYPE: str

RETURNS DESCRIPTION
list[Document]

A list of split Document objects. Each Document contains page_content holding the extracted text and metadata that maps the header hierarchy to their corresponding titles.

split_text_from_url

split_text_from_url(url: str, timeout: int = 10, **kwargs: Any) -> list[Document]

Fetch text content from a URL and split it into documents.

PARAMETER DESCRIPTION
url

The URL to fetch content from.

TYPE: str

timeout

Timeout for the request.

TYPE: int DEFAULT: 10

**kwargs

Additional keyword arguments for the request.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
list[Document]

A list of split Document objects. Each Document contains page_content holding the extracted text and metadata that maps the header hierarchy to their corresponding titles.

RAISES DESCRIPTION
RequestException

If the HTTP request fails.

split_text_from_file

split_text_from_file(file: str | IO[str]) -> list[Document]

Split HTML content from a file into a list of Document objects.

PARAMETER DESCRIPTION
file

A file path or a file-like object containing HTML content.

TYPE: str | IO[str]

RETURNS DESCRIPTION
list[Document]

A list of split Document objects. Each Document contains page_content holding the extracted text and metadata that maps the header hierarchy to their corresponding titles.

HTMLSectionSplitter

Splitting HTML files based on specified tag and font sizes.

Requires lxml package.

METHOD DESCRIPTION
__init__

Create a new HTMLSectionSplitter.

split_documents

Split documents.

split_text

Split HTML text string.

create_documents

Create a list of Document objects from a list of texts.

split_html_by_headers

Split an HTML document into sections based on specified header tags.

convert_possible_tags_to_header

Convert specific HTML tags to headers using an XSLT transformation.

split_text_from_file

Split HTML content from a file into a list of Document objects.

__init__

__init__(headers_to_split_on: list[tuple[str, str]], **kwargs: Any) -> None

Create a new HTMLSectionSplitter.

PARAMETER DESCRIPTION
headers_to_split_on

list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [("h1", "Header 1"), ("h2", "Header 2"].

TYPE: list[tuple[str, str]]

**kwargs

Additional optional arguments for customizations.

TYPE: Any DEFAULT: {}

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

split_text

split_text(text: str) -> list[Document]

Split HTML text string.

PARAMETER DESCRIPTION
text

HTML text

TYPE: str

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_html_by_headers

split_html_by_headers(html_doc: str) -> list[dict[str, str | None]]

Split an HTML document into sections based on specified header tags.

This method uses BeautifulSoup to parse the HTML content and divides it into sections based on headers defined in headers_to_split_on. Each section contains the header text, content under the header, and the tag name.

PARAMETER DESCRIPTION
html_doc

The HTML document to be split into sections.

TYPE: str

RETURNS DESCRIPTION
list[dict[str, str | None]]

A list of dictionaries representing sections. Each dictionary contains:

  • 'header': The header text or a default title for the first section.
  • 'content': The content under the header.
  • 'tag_name': The name of the header tag (e.g., h1, h2).

convert_possible_tags_to_header

convert_possible_tags_to_header(html_content: str) -> str

Convert specific HTML tags to headers using an XSLT transformation.

This method uses an XSLT file to transform the HTML content, converting certain tags into headers for easier parsing. If no XSLT path is provided, the HTML content is returned unchanged.

PARAMETER DESCRIPTION
html_content

The HTML content to be transformed.

TYPE: str

RETURNS DESCRIPTION
str

The transformed HTML content as a string.

split_text_from_file

split_text_from_file(file: StringIO) -> list[Document]

Split HTML content from a file into a list of Document objects.

PARAMETER DESCRIPTION
file

A file path or a file-like object containing HTML content.

TYPE: StringIO

RETURNS DESCRIPTION
list[Document]

A list of split Document objects.

HTMLSemanticPreservingSplitter

Bases: BaseDocumentTransformer

Split HTML content preserving semantic structure.

Splits HTML content by headers into generalized chunks, preserving semantic structure. If chunks exceed the maximum chunk size, it uses RecursiveCharacterTextSplitter for further splitting.

The splitter preserves full HTML elements and converts links to Markdown-like links. It can also preserve images, videos, and audio elements by converting them into Markdown format. Note that some chunks may exceed the maximum size to maintain semantic integrity.

Added in version 0.3.5

Example
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter

def custom_iframe_extractor(iframe_tag):
    ```
    Custom handler function to extract the 'src' attribute from an <iframe> tag.
    Converts the iframe to a Markdown-like link: [iframe:<src>](src).

    Args:
        iframe_tag (bs4.element.Tag): The <iframe> tag to be processed.

    Returns:
        str: A formatted string representing the iframe in Markdown-like format.
    ```
    iframe_src = iframe_tag.get('src', '')
    return f"[iframe:{iframe_src}]({iframe_src})"

text_splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
    max_chunk_size=500,
    preserve_links=True,
    preserve_images=True,
    custom_handlers={"iframe": custom_iframe_extractor}
)
METHOD DESCRIPTION
atransform_documents

Asynchronously transform a list of documents.

__init__

Initialize splitter.

split_text

Splits the provided HTML text into smaller chunks based on the configuration.

transform_documents

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

__init__

__init__(
    headers_to_split_on: list[tuple[str, str]],
    *,
    max_chunk_size: int = 1000,
    chunk_overlap: int = 0,
    separators: list[str] | None = None,
    elements_to_preserve: list[str] | None = None,
    preserve_links: bool = False,
    preserve_images: bool = False,
    preserve_videos: bool = False,
    preserve_audio: bool = False,
    custom_handlers: dict[str, Callable[[Tag], str]] | None = None,
    stopword_removal: bool = False,
    stopword_lang: str = "english",
    normalize_text: bool = False,
    external_metadata: dict[str, str] | None = None,
    allowlist_tags: list[str] | None = None,
    denylist_tags: list[str] | None = None,
    preserve_parent_metadata: bool = False,
    keep_separator: bool | Literal["start", "end"] = True,
) -> None

Initialize splitter.

PARAMETER DESCRIPTION
headers_to_split_on

HTML headers (e.g., h1, h2) that define content sections.

TYPE: list[tuple[str, str]]

max_chunk_size

Maximum size for each chunk, with allowance for exceeding this limit to preserve semantics.

TYPE: int DEFAULT: 1000

chunk_overlap

Number of characters to overlap between chunks to ensure contextual continuity.

TYPE: int DEFAULT: 0

separators

Delimiters used by RecursiveCharacterTextSplitter for further splitting.

TYPE: list[str] | None DEFAULT: None

elements_to_preserve

HTML tags (e.g., table, ul) to remain intact during splitting.

TYPE: list[str] | None DEFAULT: None

preserve_links

Converts a tags to Markdown links ([text](url)).

TYPE: bool DEFAULT: False

preserve_images

Converts img tags to Markdown images (![alt](src)).

TYPE: bool DEFAULT: False

preserve_videos

Converts video tags to Markdown video links (![video](src)).

TYPE: bool DEFAULT: False

preserve_audio

Converts audio tags to Markdown audio links (![audio](src)).

TYPE: bool DEFAULT: False

custom_handlers

Optional custom handlers for specific HTML tags, allowing tailored extraction or processing.

TYPE: dict[str, Callable[[Tag], str]] | None DEFAULT: None

stopword_removal

Optionally remove stopwords from the text.

TYPE: bool DEFAULT: False

stopword_lang

The language of stopwords to remove.

TYPE: str DEFAULT: 'english'

normalize_text

Optionally normalize text (e.g., lowercasing, removing punctuation).

TYPE: bool DEFAULT: False

external_metadata

Additional metadata to attach to the Document objects.

TYPE: dict[str, str] | None DEFAULT: None

allowlist_tags

Only these tags will be retained in the HTML.

TYPE: list[str] | None DEFAULT: None

denylist_tags

These tags will be removed from the HTML.

TYPE: list[str] | None DEFAULT: None

preserve_parent_metadata

Whether to pass through parent document metadata to split documents when calling transform_documents/atransform_documents().

TYPE: bool DEFAULT: False

keep_separator

Whether separators should be at the beginning of a chunk, at the end, or not at all.

TYPE: bool | Literal['start', 'end'] DEFAULT: True

split_text

split_text(text: str) -> list[Document]

Splits the provided HTML text into smaller chunks based on the configuration.

PARAMETER DESCRIPTION
text

The HTML content to be split.

TYPE: str

RETURNS DESCRIPTION
list[Document]

A list of Document objects containing the split content.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> list[Document]

Transform sequence of documents by splitting them.

RecursiveJsonSplitter

Splits JSON data into smaller, structured chunks while preserving hierarchy.

This class provides methods to split JSON data into smaller dictionaries or JSON-formatted strings based on configurable maximum and minimum chunk sizes. It supports nested JSON structures, optionally converts lists into dictionaries for better chunking, and allows the creation of document objects for further use.

METHOD DESCRIPTION
__init__

Initialize the chunk size configuration for text processing.

split_json

Splits JSON into a list of JSON chunks.

split_text

Splits JSON into a list of JSON formatted strings.

create_documents

Create a list of Document objects from a list of json objects (dict).

max_chunk_size class-attribute instance-attribute

max_chunk_size: int = max_chunk_size

The maximum size for each chunk.

min_chunk_size class-attribute instance-attribute

min_chunk_size: int = (
    min_chunk_size if min_chunk_size is not None else max(max_chunk_size - 200, 50)
)

The minimum size for each chunk, derived from max_chunk_size if not explicitly provided.

__init__

__init__(max_chunk_size: int = 2000, min_chunk_size: int | None = None) -> None

Initialize the chunk size configuration for text processing.

This constructor sets up the maximum and minimum chunk sizes, ensuring that the min_chunk_size defaults to a value slightly smaller than the max_chunk_size if not explicitly provided.

PARAMETER DESCRIPTION
max_chunk_size

The maximum size for a chunk.

TYPE: int DEFAULT: 2000

min_chunk_size

The minimum size for a chunk. If None, defaults to the maximum chunk size minus 200, with a lower bound of 50.

TYPE: int | None DEFAULT: None

split_json

split_json(
    json_data: dict[str, Any], convert_lists: bool = False
) -> list[dict[str, Any]]

Splits JSON into a list of JSON chunks.

split_text

split_text(
    json_data: dict[str, Any], convert_lists: bool = False, ensure_ascii: bool = True
) -> list[str]

Splits JSON into a list of JSON formatted strings.

create_documents

create_documents(
    texts: list[dict[str, Any]],
    convert_lists: bool = False,
    ensure_ascii: bool = True,
    metadatas: list[dict[Any, Any]] | None = None,
) -> list[Document]

Create a list of Document objects from a list of json objects (dict).

JSFrameworkTextSplitter

Bases: RecursiveCharacterTextSplitter

Text splitter that handles React (JSX), Vue, and Svelte code.

This splitter extends RecursiveCharacterTextSplitter to handle React (JSX), Vue, and Svelte code by:

  1. Detecting and extracting custom component tags from the text
  2. Using those tags as additional separators along with standard JS syntax

The splitter combines:

  • Custom component tags as separators (e.g. <Component, <div)
  • JavaScript syntax elements (function, const, if, etc)
  • Standard text splitting on newlines

This allows chunks to break at natural boundaries in React, Vue, and Svelte component code.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

from_language

Return an instance of this class based on a specific language.

get_separators_for_language

Retrieve a list of separators specific to the given language.

__init__

Initialize the JS Framework text splitter.

split_text

Split text into chunks.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

from_language classmethod

from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter

Return an instance of this class based on a specific language.

This method initializes the text splitter with language-specific separators.

PARAMETER DESCRIPTION
language

The language to configure the text splitter for.

TYPE: Language

**kwargs

Additional keyword arguments to customize the splitter.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
RecursiveCharacterTextSplitter

An instance of the text splitter configured for the specified language.

get_separators_for_language staticmethod

get_separators_for_language(language: Language) -> list[str]

Retrieve a list of separators specific to the given language.

PARAMETER DESCRIPTION
language

The language for which to get the separators.

TYPE: Language

RETURNS DESCRIPTION
list[str]

A list of separators appropriate for the specified language.

__init__

__init__(
    separators: list[str] | None = None,
    chunk_size: int = 2000,
    chunk_overlap: int = 0,
    **kwargs: Any,
) -> None

Initialize the JS Framework text splitter.

PARAMETER DESCRIPTION
separators

Optional list of custom separator strings to use

TYPE: list[str] | None DEFAULT: None

chunk_size

Maximum size of chunks to return

TYPE: int DEFAULT: 2000

chunk_overlap

Overlap in characters between chunks

TYPE: int DEFAULT: 0

**kwargs

Additional arguments to pass to parent class

TYPE: Any DEFAULT: {}

split_text

split_text(text: str) -> list[str]

Split text into chunks.

This method splits the text into chunks by:

  • Extracting unique opening component tags using regex
  • Creating separators list with extracted tags and JS separators
  • Splitting the text using the separators by calling the parent class method
PARAMETER DESCRIPTION
text

String containing code to split

TYPE: str

RETURNS DESCRIPTION
list[str]

List of text chunks split on component and JS boundaries

KonlpyTextSplitter

Bases: TextSplitter

Splitting text using Konlpy package.

It is good for splitting Korean text.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

__init__

Initialize the Konlpy text splitter.

split_text

Split incoming text and return chunks.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

__init__

__init__(separator: str = '\n\n', **kwargs: Any) -> None

Initialize the Konlpy text splitter.

split_text

split_text(text: str) -> list[str]

Split incoming text and return chunks.

LatexTextSplitter

Bases: RecursiveCharacterTextSplitter

Attempts to split the text along Latex-formatted layout elements.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

split_text

Split the input text into smaller chunks based on predefined separators.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

from_language

Return an instance of this class based on a specific language.

get_separators_for_language

Retrieve a list of separators specific to the given language.

__init__

Initialize a LatexTextSplitter.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

split_text

split_text(text: str) -> list[str]

Split the input text into smaller chunks based on predefined separators.

PARAMETER DESCRIPTION
text

The input text to be split.

TYPE: str

RETURNS DESCRIPTION
list[str]

A list of text chunks obtained after splitting.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

from_language classmethod

from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter

Return an instance of this class based on a specific language.

This method initializes the text splitter with language-specific separators.

PARAMETER DESCRIPTION
language

The language to configure the text splitter for.

TYPE: Language

**kwargs

Additional keyword arguments to customize the splitter.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
RecursiveCharacterTextSplitter

An instance of the text splitter configured for the specified language.

get_separators_for_language staticmethod

get_separators_for_language(language: Language) -> list[str]

Retrieve a list of separators specific to the given language.

PARAMETER DESCRIPTION
language

The language for which to get the separators.

TYPE: Language

RETURNS DESCRIPTION
list[str]

A list of separators appropriate for the specified language.

__init__

__init__(**kwargs: Any) -> None

Initialize a LatexTextSplitter.

ExperimentalMarkdownSyntaxTextSplitter

An experimental text splitter for handling Markdown syntax.

This splitter aims to retain the exact whitespace of the original text while extracting structured metadata, such as headers. It is a re-implementation of the MarkdownHeaderTextSplitter with notable changes to the approach and additional features.

Key Features:

  • Retains the original whitespace and formatting of the Markdown text.
  • Extracts headers, code blocks, and horizontal rules as metadata.
  • Splits out code blocks and includes the language in the "Code" metadata key.
  • Splits text on horizontal rules (---) as well.
  • Defaults to sensible splitting behavior, which can be overridden using the headers_to_split_on parameter.

Example:

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
splitter = ExperimentalMarkdownSyntaxTextSplitter(
    headers_to_split_on=headers_to_split_on
)
chunks = splitter.split(text)
for chunk in chunks:
    print(chunk)

This class is currently experimental and subject to change based on feedback and further development.

METHOD DESCRIPTION
__init__

Initialize the text splitter with header splitting and formatting options.

split_text

Split the input text into structured chunks.

__init__

__init__(
    headers_to_split_on: list[tuple[str, str]] | None = None,
    return_each_line: bool = False,
    strip_headers: bool = True,
) -> None

Initialize the text splitter with header splitting and formatting options.

This constructor sets up the required configuration for splitting text into chunks based on specified headers and formatting preferences.

PARAMETER DESCRIPTION
headers_to_split_on

A list of tuples, where each tuple contains a header tag (e.g., "h1") and its corresponding metadata key. If None, default headers are used.

TYPE: Union[list[tuple[str, str]], None] DEFAULT: None

return_each_line

Whether to return each line as an individual chunk. Defaults to False, which aggregates lines into larger chunks.

TYPE: bool DEFAULT: False

strip_headers

Whether to exclude headers from the resulting chunks.

TYPE: bool DEFAULT: True

split_text

split_text(text: str) -> list[Document]

Split the input text into structured chunks.

This method processes the input text line by line, identifying and handling specific patterns such as headers, code blocks, and horizontal rules to split it into structured chunks based on headers, code blocks, and horizontal rules.

PARAMETER DESCRIPTION
text

The input text to be split into chunks.

TYPE: str

RETURNS DESCRIPTION
list[Document]

A list of Document objects representing the structured

list[Document]

chunks of the input text. If return_each_line is enabled, each line

list[Document]

is returned as a separate Document.

HeaderType

Bases: TypedDict

Header type as typed dict.

LineType

Bases: TypedDict

Line type as typed dict.

MarkdownHeaderTextSplitter

Splitting markdown files based on specified headers.

METHOD DESCRIPTION
__init__

Create a new MarkdownHeaderTextSplitter.

aggregate_lines_to_chunks

Combine lines with common metadata into chunks.

split_text

Split markdown file.

__init__

__init__(
    headers_to_split_on: list[tuple[str, str]],
    return_each_line: bool = False,
    strip_headers: bool = True,
    custom_header_patterns: dict[str, int] | None = None,
) -> None

Create a new MarkdownHeaderTextSplitter.

PARAMETER DESCRIPTION
headers_to_split_on

Headers we want to track

TYPE: list[tuple[str, str]]

return_each_line

Return each line w/ associated headers

TYPE: bool DEFAULT: False

strip_headers

Strip split headers from the content of the chunk

TYPE: bool DEFAULT: True

custom_header_patterns

Optional dict mapping header patterns to their levels. For example: {"": 1, "*": 2} to treat Header as level 1 and Header as level 2 headers.

TYPE: dict[str, int] | None DEFAULT: None

aggregate_lines_to_chunks

aggregate_lines_to_chunks(lines: list[LineType]) -> list[Document]

Combine lines with common metadata into chunks.

PARAMETER DESCRIPTION
lines

Line of text / associated header metadata

TYPE: list[LineType]

split_text

split_text(text: str) -> list[Document]

Split markdown file.

PARAMETER DESCRIPTION
text

Markdown file

TYPE: str

MarkdownTextSplitter

Bases: RecursiveCharacterTextSplitter

Attempts to split the text along Markdown-formatted headings.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

split_text

Split the input text into smaller chunks based on predefined separators.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

from_language

Return an instance of this class based on a specific language.

get_separators_for_language

Retrieve a list of separators specific to the given language.

__init__

Initialize a MarkdownTextSplitter.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

split_text

split_text(text: str) -> list[str]

Split the input text into smaller chunks based on predefined separators.

PARAMETER DESCRIPTION
text

The input text to be split.

TYPE: str

RETURNS DESCRIPTION
list[str]

A list of text chunks obtained after splitting.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

from_language classmethod

from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter

Return an instance of this class based on a specific language.

This method initializes the text splitter with language-specific separators.

PARAMETER DESCRIPTION
language

The language to configure the text splitter for.

TYPE: Language

**kwargs

Additional keyword arguments to customize the splitter.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
RecursiveCharacterTextSplitter

An instance of the text splitter configured for the specified language.

get_separators_for_language staticmethod

get_separators_for_language(language: Language) -> list[str]

Retrieve a list of separators specific to the given language.

PARAMETER DESCRIPTION
language

The language for which to get the separators.

TYPE: Language

RETURNS DESCRIPTION
list[str]

A list of separators appropriate for the specified language.

__init__

__init__(**kwargs: Any) -> None

Initialize a MarkdownTextSplitter.

NLTKTextSplitter

Bases: TextSplitter

Splitting text using NLTK package.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

__init__

Initialize the NLTK splitter.

split_text

Split incoming text and return chunks.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

__init__

__init__(
    separator: str = "\n\n",
    language: str = "english",
    *,
    use_span_tokenize: bool = False,
    **kwargs: Any,
) -> None

Initialize the NLTK splitter.

split_text

split_text(text: str) -> list[str]

Split incoming text and return chunks.

PythonCodeTextSplitter

Bases: RecursiveCharacterTextSplitter

Attempts to split the text along Python syntax.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

split_text

Split the input text into smaller chunks based on predefined separators.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

from_language

Return an instance of this class based on a specific language.

get_separators_for_language

Retrieve a list of separators specific to the given language.

__init__

Initialize a PythonCodeTextSplitter.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

split_text

split_text(text: str) -> list[str]

Split the input text into smaller chunks based on predefined separators.

PARAMETER DESCRIPTION
text

The input text to be split.

TYPE: str

RETURNS DESCRIPTION
list[str]

A list of text chunks obtained after splitting.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

from_language classmethod

from_language(language: Language, **kwargs: Any) -> RecursiveCharacterTextSplitter

Return an instance of this class based on a specific language.

This method initializes the text splitter with language-specific separators.

PARAMETER DESCRIPTION
language

The language to configure the text splitter for.

TYPE: Language

**kwargs

Additional keyword arguments to customize the splitter.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
RecursiveCharacterTextSplitter

An instance of the text splitter configured for the specified language.

get_separators_for_language staticmethod

get_separators_for_language(language: Language) -> list[str]

Retrieve a list of separators specific to the given language.

PARAMETER DESCRIPTION
language

The language for which to get the separators.

TYPE: Language

RETURNS DESCRIPTION
list[str]

A list of separators appropriate for the specified language.

__init__

__init__(**kwargs: Any) -> None

Initialize a PythonCodeTextSplitter.

SentenceTransformersTokenTextSplitter

Bases: TextSplitter

Splitting text to tokens using sentence model tokenizer.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

__init__

Create a new TextSplitter.

split_text

Splits the input text into smaller components by splitting text on tokens.

count_tokens

Counts the number of tokens in the given text.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

__init__

__init__(
    chunk_overlap: int = 50,
    model_name: str = "sentence-transformers/all-mpnet-base-v2",
    tokens_per_chunk: int | None = None,
    **kwargs: Any,
) -> None

Create a new TextSplitter.

split_text

split_text(text: str) -> list[str]

Splits the input text into smaller components by splitting text on tokens.

This method encodes the input text using a private _encode method, then strips the start and stop token IDs from the encoded result. It returns the processed segments as a list of strings.

PARAMETER DESCRIPTION
text

The input text to be split.

TYPE: str

RETURNS DESCRIPTION
list[str]

A list of string components derived from the input text after encoding and

list[str]

processing.

count_tokens

count_tokens(*, text: str) -> int

Counts the number of tokens in the given text.

This method encodes the input text using a private _encode method and calculates the total number of tokens in the encoded result.

PARAMETER DESCRIPTION
text

The input text for which the token count is calculated.

TYPE: str

RETURNS DESCRIPTION
int

The number of tokens in the encoded text.

TYPE: int

SpacyTextSplitter

Bases: TextSplitter

Splitting text using Spacy package.

Per default, Spacy's en_core_web_sm model is used and its default max_length is 1000000 (it is the length of maximum character this model takes which can be increased for large files). For a faster, but potentially less accurate splitting, you can use pipeline='sentencizer'.

METHOD DESCRIPTION
transform_documents

Transform sequence of documents by splitting them.

atransform_documents

Asynchronously transform a list of documents.

create_documents

Create a list of Document objects from a list of texts.

split_documents

Split documents.

from_huggingface_tokenizer

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder

Text splitter that uses tiktoken encoder to count length.

__init__

Initialize the spacy text splitter.

split_text

Split incoming text and return chunks.

transform_documents

transform_documents(documents: Sequence[Document], **kwargs: Any) -> Sequence[Document]

Transform sequence of documents by splitting them.

atransform_documents async

atransform_documents(
    documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]

Asynchronously transform a list of documents.

PARAMETER DESCRIPTION
documents

A sequence of Document objects to be transformed.

TYPE: Sequence[Document]

RETURNS DESCRIPTION
Sequence[Document]

A sequence of transformed Document objects.

create_documents

create_documents(
    texts: list[str], metadatas: list[dict[Any, Any]] | None = None
) -> list[Document]

Create a list of Document objects from a list of texts.

split_documents

split_documents(documents: Iterable[Document]) -> list[Document]

Split documents.

from_huggingface_tokenizer classmethod

from_huggingface_tokenizer(
    tokenizer: PreTrainedTokenizerBase, **kwargs: Any
) -> TextSplitter

Text splitter that uses Hugging Face tokenizer to count length.

from_tiktoken_encoder classmethod

from_tiktoken_encoder(
    encoding_name: str = "gpt2",
    model_name: str | None = None,
    allowed_special: Literal["all"] | Set[str] = set(),
    disallowed_special: Literal["all"] | Collection[str] = "all",
    **kwargs: Any,
) -> Self

Text splitter that uses tiktoken encoder to count length.

__init__

__init__(
    separator: str = "\n\n",
    pipeline: str = "en_core_web_sm",
    max_length: int = 1000000,
    *,
    strip_whitespace: bool = True,
    **kwargs: Any,
) -> None

Initialize the spacy text splitter.

split_text

split_text(text: str) -> list[str]

Split incoming text and return chunks.

split_text_on_tokens

split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]

Split incoming text and return chunks using tokenizer.