Class●Since v0.2

ExperimentalMarkdownSyntaxTextSplitter

An experimental text splitter for handling Markdown syntax.

This splitter aims to retain the exact whitespace of the original text while extracting structured metadata, such as headers. It is a re-implementation of the MarkdownHeaderTextSplitter with notable changes to the approach and additional features.

Key Features:

Retains the original whitespace and formatting of the Markdown text.
Extracts headers, code blocks, and horizontal rules as metadata.
Splits out code blocks and includes the language in the "Code" metadata key.
Splits text on horizontal rules (---) as well.
Defaults to sensible splitting behavior, which can be overridden using the headers_to_split_on parameter.

ExperimentalMarkdownSyntaxTextSplitter(
  self,
  headers_to_split_on: list[tuple[str, str]] | None = None,
  return_each_line: bool = False,
  strip_headers: bool = True
)

Example:

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
splitter = ExperimentalMarkdownSyntaxTextSplitter(
    headers_to_split_on=headers_to_split_on
)
chunks = splitter.split(text)
for chunk in chunks:
    print(chunk)

This class is currently experimental and subject to change based on feedback and further development.

Parameters

Name	Type	Description
`headers_to_split_on`	`list[tuple[str, str]] \| None`	Default:`None` A list of tuples, where each tuple contains a header tag (e.g., "h1") and its corresponding metadata key. If `None`, default headers are used.
`return_each_line`	`bool`	Default:`False` Whether to return each line as an individual chunk. Defaults to `False`, which aggregates lines into larger chunks.
`strip_headers`	`bool`	Default:`True` Whether to exclude headers from the resulting chunks.

Constructors

constructor

__init__

Name	Type
headers_to_split_on	list[tuple[str, str]] \| None
return_each_line	bool
strip_headers	bool

Attributes

attribute

chunks: list[Document]

attribute

current_chunk

attribute

current_header_stack: list[tuple[int, str]]

attribute

strip_headers: strip_headers

attribute

splittable_headers

attribute

return_each_line: return_each_line

Methods

method

split_text

Split the input text into structured chunks.

This method processes the input text line by line, identifying and handling specific patterns such as headers, code blocks, and horizontal rules to split it into structured chunks based on headers, code blocks, and horizontal rules.

View source on GitHub

ExperimentalMarkdownSyntaxTextSplitter

An experimental text splitter for handling Markdown syntax.

Key Features:

Retains the original whitespace and formatting of the Markdown text.
Extracts headers, code blocks, and horizontal rules as metadata.
Splits out code blocks and includes the language in the "Code" metadata key.
Splits text on horizontal rules (---) as well.
Defaults to sensible splitting behavior, which can be overridden using the headers_to_split_on parameter.

headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ] splitter = ExperimentalMarkdownSyntaxTextSplitter( headers_to_split_on=headers_to_split_on ) chunks = splitter.split(text) for chunk in chunks: print(chunk)

Parameters

Name	Type	Description
`headers_to_split_on`	`list[tuple[str, str]] \| None`	Default:`None` A list of tuples, where each tuple contains a header tag (e.g., "h1") and its corresponding metadata key. If `None`, default headers are used.
`return_each_line`	`bool`	Default:`False` Whether to return each line as an individual chunk. Defaults to `False`, which aggregates lines into larger chunks.
`strip_headers`	`bool`	Default:`True` Whether to exclude headers from the resulting chunks.