| Name | Type | Description |
|---|---|---|
headers_to_split_on* | list[tuple[str, str]] | HTML headers (e.g., |
max_chunk_size | int | Default: 1000Maximum size for each chunk, with allowance for exceeding this limit to preserve semantics. |
chunk_overlap | int | Default: 0 |
separators | list[str] | None | Default: None |
elements_to_preserve | list[str] | None | Default: None |
preserve_links | bool | Default: False |
preserve_images | bool | Default: False |
preserve_videos | bool | Default: False |
preserve_audio | bool | Default: False |
custom_handlers | dict[str, Callable[[Tag], str]] | None | Default: None |
stopword_removal | bool | Default: False |
stopword_lang | str | Default: 'english' |
normalize_text | bool | Default: False |
external_metadata | dict[str, str] | None | Default: None |
allowlist_tags | list[str] | None | Default: None |
denylist_tags | list[str] | None | Default: None |
preserve_parent_metadata | bool | Default: False |
keep_separator | bool | Literal['start', 'end'] | Default: True |
| Name | Type |
|---|---|
| headers_to_split_on | list[tuple[str, str]] |
| max_chunk_size | int |
| chunk_overlap | int |
| separators | list[str] | None |
| elements_to_preserve | list[str] | None |
| preserve_links | bool |
| preserve_images | bool |
| preserve_videos | bool |
| preserve_audio | bool |
| custom_handlers | dict[str, Callable[[Tag], str]] | None |
| stopword_removal | bool |
| stopword_lang | str |
| normalize_text | bool |
| external_metadata | dict[str, str] | None |
| allowlist_tags | list[str] | None |
| denylist_tags | list[str] | None |
| preserve_parent_metadata | bool |
| keep_separator | bool | Literal['start', 'end'] |
Split HTML content preserving semantic structure.
Splits HTML content by headers into generalized chunks, preserving semantic
structure. If chunks exceed the maximum chunk size, it uses
RecursiveCharacterTextSplitter for further splitting.
The splitter preserves full HTML elements and converts links to Markdown-like links. It can also preserve images, videos, and audio elements by converting them into Markdown format. Note that some chunks may exceed the maximum size to maintain semantic integrity.
Example:
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter
def custom_iframe_extractor(iframe_tag):
Custom handler function to extract the 'src' attribute from an
Args: iframe_tag (bs4.element.Tag): The
Returns: str: A formatted string representing the iframe in Markdown-like format.
iframe_src = iframe_tag.get('src', '')
return f"[iframe:{iframe_src}]({iframe_src})"
_splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
max_chunk_size=500,
preserve_links=True,
preserve_images=True,
custom_handlers={"iframe": custom_iframe_extractor}
)
Number of characters to overlap between chunks to ensure contextual continuity.
Delimiters used by RecursiveCharacterTextSplitter for
further splitting.
HTML tags (e.g., table, ul) to remain
intact during splitting.
Converts a tags to Markdown links ([text](url)).
Converts img tags to Markdown images ().
Converts video tags to Markdown video links
().
Converts audio tags to Markdown audio links
().
Optional custom handlers for specific HTML tags, allowing tailored extraction or processing.
Optionally remove stopwords from the text.
The language of stopwords to remove.
Optionally normalize text (e.g., lowercasing, removing punctuation).
Additional metadata to attach to the Document objects.
Only these tags will be retained in the HTML.
These tags will be removed from the HTML.
Whether to pass through parent document metadata
to split documents when calling
transform_documents/atransform_documents().
Whether separators should be at the beginning of a chunk, at the end, or not at all.