# HTMLSemanticPreservingSplitter

> **Class** in `langchain_text_splitters`

📖 [View in docs](https://reference.langchain.com/python/langchain-text-splitters/html/HTMLSemanticPreservingSplitter)

Split HTML content preserving semantic structure.

Splits HTML content by headers into generalized chunks, preserving semantic
structure. If chunks exceed the maximum chunk size, it uses
`RecursiveCharacterTextSplitter` for further splitting.

The splitter preserves full HTML elements and converts links to Markdown-like links.
It can also preserve images, videos, and audio elements by converting them into
Markdown format. Note that some chunks may exceed the maximum size to maintain
semantic integrity.

!!! version-added "Added in `langchain-text-splitters` 0.3.5"

## Signature

```python
HTMLSemanticPreservingSplitter(
    self,
    headers_to_split_on: list[tuple[str, str]],
    *,
    max_chunk_size: int = 1000,
    chunk_overlap: int = 0,
    separators: list[str] | None = None,
    elements_to_preserve: list[str] | None = None,
    preserve_links: bool = False,
    preserve_images: bool = False,
    preserve_videos: bool = False,
    preserve_audio: bool = False,
    custom_handlers: dict[str, Callable[[Tag], str]] | None = None,
    stopword_removal: bool = False,
    stopword_lang: str = 'english',
    normalize_text: bool = False,
    external_metadata: dict[str, str] | None = None,
    allowlist_tags: list[str] | None = None,
    denylist_tags: list[str] | None = None,
    preserve_parent_metadata: bool = False,
    keep_separator: bool | Literal['start', 'end'] = True,
)
```

## Description

**Example:**

```python
from langchain_text_splitters.html import HTMLSemanticPreservingSplitter

def custom_iframe_extractor(iframe_tag):
    ```
    Custom handler function to extract the 'src' attribute from an <iframe> tag.
    Converts the iframe to a Markdown-like link: [iframe:<src>](src).

    Args:
        iframe_tag (bs4.element.Tag): The <iframe> tag to be processed.

    Returns:
        str: A formatted string representing the iframe in Markdown-like format.
    ```
    iframe_src = iframe_tag.get('src', '')
    return f"[iframe:{iframe_src}]({iframe_src})"

text_splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
    max_chunk_size=500,
    preserve_links=True,
    preserve_images=True,
    custom_handlers={"iframe": custom_iframe_extractor}
)
```

## Parameters

| Name | Type | Required | Description |
|------|------|----------|-------------|
| `headers_to_split_on` | `list[tuple[str, str]]` | Yes | HTML headers (e.g., `h1`, `h2`) that define content sections. |
| `max_chunk_size` | `int` | No | Maximum size for each chunk, with allowance for exceeding this limit to preserve semantics. (default: `1000`) |
| `chunk_overlap` | `int` | No | Number of characters to overlap between chunks to ensure contextual continuity. (default: `0`) |
| `separators` | `list[str] \| None` | No | Delimiters used by `RecursiveCharacterTextSplitter` for further splitting. (default: `None`) |
| `elements_to_preserve` | `list[str] \| None` | No | HTML tags (e.g., `table`, `ul`) to remain intact during splitting. (default: `None`) |
| `preserve_links` | `bool` | No | Converts `a` tags to Markdown links (`[text](url)`). (default: `False`) |
| `preserve_images` | `bool` | No | Converts `img` tags to Markdown images (`![alt](src)`). (default: `False`) |
| `preserve_videos` | `bool` | No | Converts `video` tags to Markdown video links (`![video](src)`). (default: `False`) |
| `preserve_audio` | `bool` | No | Converts `audio` tags to Markdown audio links (`![audio](src)`). (default: `False`) |
| `custom_handlers` | `dict[str, Callable[[Tag], str]] \| None` | No | Optional custom handlers for specific HTML tags, allowing tailored extraction or processing. (default: `None`) |
| `stopword_removal` | `bool` | No | Optionally remove stopwords from the text. (default: `False`) |
| `stopword_lang` | `str` | No | The language of stopwords to remove. (default: `'english'`) |
| `normalize_text` | `bool` | No | Optionally normalize text (e.g., lowercasing, removing punctuation). (default: `False`) |
| `external_metadata` | `dict[str, str] \| None` | No | Additional metadata to attach to the Document objects. (default: `None`) |
| `allowlist_tags` | `list[str] \| None` | No | Only these tags will be retained in the HTML. (default: `None`) |
| `denylist_tags` | `list[str] \| None` | No | These tags will be removed from the HTML. (default: `None`) |
| `preserve_parent_metadata` | `bool` | No | Whether to pass through parent document metadata to split documents when calling `transform_documents/atransform_documents()`. (default: `False`) |
| `keep_separator` | `bool \| Literal['start', 'end']` | No | Whether separators should be at the beginning of a chunk, at the end, or not at all. (default: `True`) |

## Extends

- `BaseDocumentTransformer`

## Constructors

```python
__init__(
    self,
    headers_to_split_on: list[tuple[str, str]],
    *,
    max_chunk_size: int = 1000,
    chunk_overlap: int = 0,
    separators: list[str] | None = None,
    elements_to_preserve: list[str] | None = None,
    preserve_links: bool = False,
    preserve_images: bool = False,
    preserve_videos: bool = False,
    preserve_audio: bool = False,
    custom_handlers: dict[str, Callable[[Tag], str]] | None = None,
    stopword_removal: bool = False,
    stopword_lang: str = 'english',
    normalize_text: bool = False,
    external_metadata: dict[str, str] | None = None,
    allowlist_tags: list[str] | None = None,
    denylist_tags: list[str] | None = None,
    preserve_parent_metadata: bool = False,
    keep_separator: bool | Literal['start', 'end'] = True,
) -> None
```

| Name | Type |
|------|------|
| `headers_to_split_on` | `list[tuple[str, str]]` |
| `max_chunk_size` | `int` |
| `chunk_overlap` | `int` |
| `separators` | `list[str] \| None` |
| `elements_to_preserve` | `list[str] \| None` |
| `preserve_links` | `bool` |
| `preserve_images` | `bool` |
| `preserve_videos` | `bool` |
| `preserve_audio` | `bool` |
| `custom_handlers` | `dict[str, Callable[[Tag], str]] \| None` |
| `stopword_removal` | `bool` |
| `stopword_lang` | `str` |
| `normalize_text` | `bool` |
| `external_metadata` | `dict[str, str] \| None` |
| `allowlist_tags` | `list[str] \| None` |
| `denylist_tags` | `list[str] \| None` |
| `preserve_parent_metadata` | `bool` |
| `keep_separator` | `bool \| Literal['start', 'end']` |


## Methods

- [`split_text()`](https://reference.langchain.com/python/langchain-text-splitters/html/HTMLSemanticPreservingSplitter/split_text)
- [`transform_documents()`](https://reference.langchain.com/python/langchain-text-splitters/html/HTMLSemanticPreservingSplitter/transform_documents)

---

[View source on GitHub](https://github.com/langchain-ai/langchain/blob/9f232caa7a8fe1ca042a401942d5d90d54ceb1a6/libs/text-splitters/langchain_text_splitters/html.py#L561)