Class●Since v0.0

HTMLHeaderTextSplitter

Split HTML content into structured Documents based on specified headers.

Splits HTML content by detecting specified header tags and creating hierarchical Document objects that reflect the semantic structure of the original content. For each identified section, the splitter associates the extracted text with metadata corresponding to the encountered headers.

If no specified headers are found, the entire content is returned as a single Document. This allows for flexible handling of HTML input, ensuring that information is organized according to its semantic headers.

The splitter provides the option to return each HTML element as a separate Document or aggregate them into semantically meaningful chunks. It also gracefully handles multiple levels of nested headers, creating a rich, hierarchical representation of the content.

HTMLHeaderTextSplitter(
  self,
  headers_to_split_on: list[tuple[str, str]],
  return_each_element: bool = False
)

Example:

from langchain_text_splitters.html_header_text_splitter import (
    HTMLHeaderTextSplitter,
)

# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    return_each_element=False
)

html_content = """
<html>
    <body>
        <h1>Introduction</h1>
        <p>Welcome to the introduction section.</p>
        <h2>Background</h2>
        <p>Some background details here.</p>
        <h1>Conclusion</h1>
        <p>Final thoughts.</p>
    </body>
</html>
"""

documents = splitter.split_text(html_content)

# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Final thoughts."

Parameters

Name	Type	Description
`headers_to_split_on`*	`list[tuple[str, str]]`	A list of `(header_tag, header_name)` pairs representing the headers that define splitting boundaries. For example, `[("h1", "Header 1"), ("h2", "Header 2")]` will split content by `h1` and `h2` tags, assigning their textual content to the `Document` metadata.
`return_each_element`	`bool`	Default:`False` If `True`, every HTML element encountered (including headers, paragraphs, etc.) is returned as a separate `Document`. If `False`, content under the same header hierarchy is aggregated into fewer `Document` objects.

Constructors

constructor

__init__

Name	Type
headers_to_split_on	list[tuple[str, str]]
return_each_element	bool

Attributes

return_each_element: return_each_element

Methods

method

split_text

Split the given text into a list of Document objects.

method

split_text_from_url

Fetch text content from a URL and split it into documents.

method

split_text_from_file

Split HTML content from a file into a list of Document objects.

View source on GitHub

HTMLHeaderTextSplitter

Split HTML content into structured Documents based on specified headers.

from langchain_text_splitters.html_header_text_splitter import ( HTMLHeaderTextSplitter, ) # Define headers for splitting on h1 and h2 tags. headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")] splitter = HTMLHeaderTextSplitter( headers_to_split_on=headers_to_split_on, return_each_element=False ) html_content = """ <html> <body> <h1>Introduction</h1> <p>Welcome to the introduction section.</p> <h2>Background</h2> <p>Some background details here.</p> <h1>Conclusion</h1> <p>Final thoughts.</p> </body> </html> """ documents = splitter.split_text(html_content) # 'documents' now contains Document objects reflecting the hierarchy: # - Document with metadata={"Main Topic": "Introduction"} and # content="Introduction" # - Document with metadata={"Main Topic": "Introduction"} and # content="Welcome to the introduction section." # - Document with metadata={"Main Topic": "Introduction", # "Sub Topic": "Background"} and content="Background" # - Document with metadata={"Main Topic": "Introduction", # "Sub Topic": "Background"} and content="Some background details here." # - Document with metadata={"Main Topic": "Conclusion"} and # content="Conclusion" # - Document with metadata={"Main Topic": "Conclusion"} and # content="Final thoughts."

Parameters

Name	Type	Description
`headers_to_split_on`*	`list[tuple[str, str]]`	A list of `(header_tag, header_name)` pairs representing the headers that define splitting boundaries. For example, `[("h1", "Header 1"), ("h2", "Header 2")]` will split content by `h1` and `h2` tags, assigning their textual content to the `Document` metadata.
`return_each_element`	`bool`	Default:`False` If `True`, every HTML element encountered (including headers, paragraphs, etc.) is returned as a separate `Document`. If `False`, content under the same header hierarchy is aggregated into fewer `Document` objects.

Name

Type

headers_to_split_on

list[tuple[str, str]]

return_each_element

bool

HTMLHeaderTextSplitter

Parameters

Constructors

Attributes

Methods

LangChain Assistant

Menu

HTMLHeaderTextSplitter

Parameters

Constructors

Attributes

Methods

HTMLHeaderTextSplitter

Used in Docs

Parameters

Constructors

Attributes

Methods

Menu

HTMLHeaderTextSplitter

Used in Docs

Parameters

Constructors

Attributes

Methods