Split HTML content into structured Documents based on specified headers.
Splits HTML content by detecting specified header tags and creating hierarchical
Document objects that reflect the semantic structure of the original content. For
each identified section, the splitter associates the extracted text with metadata
corresponding to the encountered headers.
If no specified headers are found, the entire content is returned as a single
Document. This allows for flexible handling of HTML input, ensuring that
information is organized according to its semantic headers.
The splitter provides the option to return each HTML element as a separate
Document or aggregate them into semantically meaningful chunks. It also
gracefully handles multiple levels of nested headers, creating a rich,
hierarchical representation of the content.
HTMLHeaderTextSplitter(
self,
headers_to_split_on: list[tuple[str, str]],
return_each_element: bool = False
)Example:
from langchain_text_splitters.html_header_text_splitter import (
HTMLHeaderTextSplitter,
)
# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
return_each_element=False
)
html_content = """
<html>
<body>
<h1>Introduction</h1>
<p>Welcome to the introduction section.</p>
<h2>Background</h2>
<p>Some background details here.</p>
<h1>Conclusion</h1>
<p>Final thoughts.</p>
</body>
</html>
"""
documents = splitter.split_text(html_content)
# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Final thoughts."| Name | Type | Description |
|---|---|---|
headers_to_split_on* | list[tuple[str, str]] | A list of For example, |
return_each_element | bool | Default: FalseIf If |