Class●Since v0.0

HTMLSectionSplitter

Splitting HTML files based on specified tag and font sizes.

Requires lxml package.

HTMLSectionSplitter(
  self,
  headers_to_split_on: list[tuple[str, str]],
  **kwargs: Any = {}
)

Parameters

Name	Type	Description
`headers_to_split_on`*	`list[tuple[str, str]]`	List of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, e.g.: `[("h1", "Header 1"), ("h2", "Header 2"]`.
`**kwargs`	`Any`	Default:`{}` Additional optional arguments for customizations.

Constructors

constructor

__init__

Name	Type
headers_to_split_on	list[tuple[str, str]]

Attributes

Methods

Split documents.

Split HTML text string.

method

create_documents

Create a list of Document objects from a list of texts.

method

split_html_by_headers

Split an HTML document into sections based on specified header tags.

This method uses BeautifulSoup to parse the HTML content and divides it into sections based on headers defined in headers_to_split_on. Each section contains the header text, content under the header, and the tag name.

method

convert_possible_tags_to_header

Convert specific HTML tags to headers using an XSLT transformation.

This method uses an XSLT file to transform the HTML content, converting certain tags into headers for easier parsing. If no XSLT path is provided, the HTML content is returned unchanged.

method

split_text_from_file

Split HTML content from a file into a list of Document objects.

View source on GitHub

HTMLSectionSplitter

Parameters

Constructors

Attributes

Methods

LangChain Assistant

Menu

HTMLSectionSplitter

Parameters

Constructors

Attributes

Methods

HTMLSectionSplitter

Used in Docs

Parameters

Constructors

Attributes

Methods

Menu

HTMLSectionSplitter

Used in Docs

Parameters

Constructors

Attributes

Methods