# HTMLHeaderTextSplitter

> **Class** in `langchain_text_splitters`

📖 [View in docs](https://reference.langchain.com/python/langchain-text-splitters/html/HTMLHeaderTextSplitter)

Split HTML content into structured Documents based on specified headers.

Splits HTML content by detecting specified header tags and creating hierarchical
`Document` objects that reflect the semantic structure of the original content. For
each identified section, the splitter associates the extracted text with metadata
corresponding to the encountered headers.

If no specified headers are found, the entire content is returned as a single
`Document`. This allows for flexible handling of HTML input, ensuring that
information is organized according to its semantic headers.

The splitter provides the option to return each HTML element as a separate
`Document` or aggregate them into semantically meaningful chunks. It also
gracefully handles multiple levels of nested headers, creating a rich,
hierarchical representation of the content.

## Signature

```python
HTMLHeaderTextSplitter(
    self,
    headers_to_split_on: list[tuple[str, str]],
    return_each_element: bool = False,
)
```

## Description

**Example:**

```python
from langchain_text_splitters.html_header_text_splitter import (
    HTMLHeaderTextSplitter,
)

# Define headers for splitting on h1 and h2 tags.
headers_to_split_on = [("h1", "Main Topic"), ("h2", "Sub Topic")]

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    return_each_element=False
)

html_content = """
<html>
    <body>
        <h1>Introduction</h1>
        <p>Welcome to the introduction section.</p>
        <h2>Background</h2>
        <p>Some background details here.</p>
        <h1>Conclusion</h1>
        <p>Final thoughts.</p>
    </body>
</html>
"""

documents = splitter.split_text(html_content)

# 'documents' now contains Document objects reflecting the hierarchy:
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Introduction"
# - Document with metadata={"Main Topic": "Introduction"} and
#   content="Welcome to the introduction section."
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Background"
# - Document with metadata={"Main Topic": "Introduction",
#   "Sub Topic": "Background"} and content="Some background details here."
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Conclusion"
# - Document with metadata={"Main Topic": "Conclusion"} and
#   content="Final thoughts."
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `headers_to_split_on` | `list[tuple[str, str]]` | Yes | A list of `(header_tag, header_name)` pairs representing the headers that define splitting boundaries.  For example, `[("h1", "Header 1"), ("h2", "Header 2")]` will split content by `h1` and `h2` tags, assigning their textual content to the `Document` metadata. |
| `return_each_element` | `bool` | No | If `True`, every HTML element encountered (including headers, paragraphs, etc.) is returned as a separate `Document`.  If `False`, content under the same header hierarchy is aggregated into fewer `Document` objects. (default: `False`) |

## Constructors

```python
__init__(
    self,
    headers_to_split_on: list[tuple[str, str]],
    return_each_element: bool = False,
) -> None
```

| Name | Type |
|------|------|
| `headers_to_split_on` | `list[tuple[str, str]]` |
| `return_each_element` | `bool` |


## Properties

- `headers_to_split_on`
- `header_mapping`
- `header_tags`
- `return_each_element`

## Methods

- [`split_text()`](https://reference.langchain.com/python/langchain-text-splitters/html/HTMLHeaderTextSplitter/split_text)
- [`split_text_from_url()`](https://reference.langchain.com/python/langchain-text-splitters/html/HTMLHeaderTextSplitter/split_text_from_url)
- [`split_text_from_file()`](https://reference.langchain.com/python/langchain-text-splitters/html/HTMLHeaderTextSplitter/split_text_from_file)

---

[View source on GitHub](https://github.com/langchain-ai/langchain/blob/02991cb4cf2063d51a07268edafb05fe53de1826/libs/text-splitters/langchain_text_splitters/html.py#L82)