Class●Since v0.3

HTMLSemanticPreservingSplitter

HTMLSemanticPreservingSplitter(
  self,
  headers_to_split_on: list[tuple[str, str]

Bases

BaseDocumentTransformer

Constructors

Methods

Inherited fromBaseDocumentTransformer(langchain_core)

Methods

Matransform_documents

View source on GitHub

Parameters

Name	Type	Description
`headers_to_split_on`*	`list[tuple[str, str]]`	HTML headers (e.g., `h1`, `h2`) that define content sections.
`max_chunk_size`	`int`	Default:`1000` Maximum size for each chunk, with allowance for exceeding this limit to preserve semantics.
`chunk_overlap`	`int`	Default:`0`
`separators`	`list[str] \| None`	Default:`None`
`elements_to_preserve`	`list[str] \| None`	Default:`None`
`preserve_links`	`bool`	Default:`False`
`preserve_images`	`bool`	Default:`False`
`preserve_videos`	`bool`	Default:`False`
`preserve_audio`	`bool`	Default:`False`
`custom_handlers`	`dict[str, Callable[[Tag], str]] \| None`	Default:`None`
`stopword_removal`	`bool`	Default:`False`
`stopword_lang`	`str`	Default:`'english'`
`normalize_text`	`bool`	Default:`False`
`external_metadata`	`dict[str, str] \| None`	Default:`None`
`allowlist_tags`	`list[str] \| None`	Default:`None`
`denylist_tags`	`list[str] \| None`	Default:`None`
`preserve_parent_metadata`	`bool`	Default:`False`
`keep_separator`	`bool \| Literal['start', 'end']`	Default:`True`

constructor

__init__

Name	Type
headers_to_split_on	list[tuple[str, str]]
max_chunk_size	int
chunk_overlap	int
separators	list[str] \| None
elements_to_preserve	list[str] \| None
preserve_links	bool
preserve_images	bool
preserve_videos	bool
preserve_audio	bool
custom_handlers	dict[str, Callable[[Tag], str]] \| None
stopword_removal	bool
stopword_lang	str
normalize_text	bool
external_metadata	dict[str, str] \| None
allowlist_tags	list[str] \| None
denylist_tags	list[str] \| None
preserve_parent_metadata	bool
keep_separator	bool \| Literal['start', 'end']

Split HTML content preserving semantic structure.

Splits HTML content by headers into generalized chunks, preserving semantic structure. If chunks exceed the maximum chunk size, it uses RecursiveCharacterTextSplitter for further splitting.

The splitter preserves full HTML elements and converts links to Markdown-like links. It can also preserve images, videos, and audio elements by converting them into Markdown format. Note that some chunks may exceed the maximum size to maintain semantic integrity.

Example:

from langchain_text_splitters.html import HTMLSemanticPreservingSplitter

def custom_iframe_extractor(iframe_tag):

Custom handler function to extract the 'src' attribute from an tag. Converts the iframe to a Markdown-like link: <a href="src">iframe:<src></a>.</p> <p>Args: iframe_tag (bs4.element.Tag): The <iframe> tag to be processed.</p> <p>Returns: str: A formatted string representing the iframe in Markdown-like format.</p> <pre><code>iframe_src = iframe_tag.get('src', '') return f"[iframe:{iframe_src}]({iframe_src})" _splitter = HTMLSemanticPreservingSplitter( headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")], max_chunk_size=500, preserve_links=True, preserve_images=True, custom_handlers={"iframe": custom_iframe_extractor} ) </code></pre></div></div><script>$RS("S:f5","P:f5")</script><div hidden id="S:fa"><div class="markdown-content markdown-compact "><p>Number of characters to overlap between chunks to ensure contextual continuity.</p></div></div><script>$RS("S:fa","P:fa")</script><div hidden id="S:fb"><div class="markdown-content markdown-compact "><p>Delimiters used by <code>RecursiveCharacterTextSplitter</code> for further splitting.</p></div></div><script>$RS("S:fb","P:fb")</script><div hidden id="S:fc"><div class="markdown-content markdown-compact "><p>HTML tags (e.g., <code>table</code>, <code>ul</code>) to remain intact during splitting.</p></div></div><script>$RS("S:fc","P:fc")</script><div hidden id="S:fd"><div class="markdown-content markdown-compact "><p>Converts <code>a</code> tags to Markdown links (<code>[text](url)</code>).</p></div></div><script>$RS("S:fd","P:fd")</script><div hidden id="S:fe"><div class="markdown-content markdown-compact "><p>Converts <code>img</code> tags to Markdown images (<code>![alt](src)</code>).</p></div></div><script>$RS("S:fe","P:fe")</script><div hidden id="S:ff"><div class="markdown-content markdown-compact "><p>Converts <code>video</code> tags to Markdown video links (<code>![video](src)</code>).</p></div></div><script>$RS("S:ff","P:ff")</script><div hidden id="S:100"><div class="markdown-content markdown-compact "><p>Converts <code>audio</code> tags to Markdown audio links (<code>![audio](src)</code>).</p></div></div><script>$RS("S:100","P:100")</script><div hidden id="S:101"><div class="markdown-content markdown-compact "><p>Optional custom handlers for specific HTML tags, allowing tailored extraction or processing.</p></div></div><script>$RS("S:101","P:101")</script><div hidden id="S:102"><div class="markdown-content markdown-compact "><p>Optionally remove stopwords from the text.</p></div></div><script>$RS("S:102","P:102")</script><div hidden id="S:103"><div class="markdown-content markdown-compact "><p>The language of stopwords to remove.</p></div></div><script>$RS("S:103","P:103")</script><div hidden id="S:104"><div class="markdown-content markdown-compact "><p>Optionally normalize text (e.g., lowercasing, removing punctuation).</p></div></div><script>$RS("S:104","P:104")</script><div hidden id="S:105"><div class="markdown-content markdown-compact "><p>Additional metadata to attach to the Document objects.</p></div></div><script>$RS("S:105","P:105")</script><div hidden id="S:106"><div class="markdown-content markdown-compact "><p>Only these tags will be retained in the HTML.</p></div></div><script>$RS("S:106","P:106")</script><div hidden id="S:107"><div class="markdown-content markdown-compact "><p>These tags will be removed from the HTML.</p></div></div><script>$RS("S:107","P:107")</script><div hidden id="S:108"><div class="markdown-content markdown-compact "><p>Whether to pass through parent document metadata to split documents when calling <code>transform_documents/atransform_documents()</code>.</p></div></div><script>$RS("S:108","P:108")</script><div hidden id="S:109"><div class="markdown-content markdown-compact "><p>Whether separators should be at the beginning of a chunk, at the end, or not at all.</p></div></div><script>$RS("S:109","P:109")</script><script>$RB=[];$RV=function(a){$RT=performance.now();for(var b=0;b<a.length;b+=2){var c=a[b],e=a[b+1];null!==e.parentNode&&e.parentNode.removeChild(e);var f=c.parentNode;if(f){var g=c.previousSibling,h=0;do{if(c&&8===c.nodeType){var d=c.data;if("/$"===d||"/&"===d)if(0===h)break;else h--;else"$"!==d&&"$?"!==d&&"$~"!==d&&"$!"!==d&&"&"!==d||h++}d=c.nextSibling;f.removeChild(c);c=d}while(c);for(;e.firstChild;)f.insertBefore(e.firstChild,c);g.data="$";g._reactRetry&&requestAnimationFrame(g._reactRetry)}}a.length=0}; $RC=function(a,b){if(b=document.getElementById(b))(a=document.getElementById(a))?(a.previousSibling.data="$~",$RB.push(a,b),2===$RB.length&&("number"!==typeof $RT?requestAnimationFrame($RV.bind(null,$RB)):(a=performance.now(),setTimeout($RV.bind(null,$RB),2300>a&&2E3<a?2300-a:$RT+300-a)))):b.parentNode.removeChild(b)};$RC("B:0","S:0")</script></body></html>

LangChain Assistant

Menu

HTMLSemanticPreservingSplitter

Bases

Constructors

Methods

Inherited fromBaseDocumentTransformer(langchain_core)

Methods

Parameters

Menu

HTMLSemanticPreservingSplitter

Bases

Used in Docs

Constructors

Methods

Inherited fromBaseDocumentTransformer(langchain_core)

Methods

Parameters