Split HTML content into structured Documents based on specified headers.
Splits HTML content by detecting specified header tags and creating hierarchical
Document objects that reflect the semantic structure of the original content. For
each identified section, the splitter associates the extracted text with metadata
corresponding to the encountered headers.
If no specified headers are found, the entire content is returned as a single
Document. This allows for flexible handling of HTML input, ensuring that
information is organized according to its semantic headers.
The splitter provides the option to return each HTML element as a separate
Document or aggregate them into semantically meaningful chunks. It also
gracefully handles multiple levels of nested headers, creating a rich,
hierarchical representation of the content.
Example:
from langchain_text_splitters.html_header_text_splitter import (
HTMLHeaderTextSplitter,
)
# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
return_each_element=False
)
html_content = """
<html>
<body>
<h1>Introduction</h1>
<p>Welcome to the introduction section.</p>
<h2>Background</h2>
<p>Some background details here.</p>
<h1>Conclusion</h1>
<p>Final thoughts.</p>
</body>
</html>
"""
documents = splitter.split_text(html_content)
# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
# content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
# "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
# content="Final thoughts."A list of (header_tag, header_name) pairs representing the headers that define splitting
boundaries.
For example, [("h1", "Header 1"), ("h2", "Header 2")] will split
content by h1 and h2 tags, assigning their textual content to the
Document metadata.
If True, every HTML element encountered
(including headers, paragraphs, etc.) is returned as a separate
Document.
If False, content under the same header hierarchy is aggregated into
fewer Document objects.
Split the given text into a list of Document objects.
Fetch text content from a URL and split it into documents.
Split HTML content from a file into a list of Document objects.