Splitting HTML files based on specified tag and font sizes.
Requires lxml package.
| Name | Type | Description |
|---|---|---|
headers_to_split_on* | list[tuple[str, str]] | List of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: |
**kwargs | Any | Default: {}Additional optional arguments for customizations. |
Split documents.
Split HTML text string.
Create a list of Document objects from a list of texts.
Split an HTML document into sections based on specified header tags.
This method uses BeautifulSoup to parse the HTML content and divides it into
sections based on headers defined in headers_to_split_on. Each section
contains the header text, content under the header, and the tag name.
Convert specific HTML tags to headers using an XSLT transformation.
This method uses an XSLT file to transform the HTML content, converting certain tags into headers for easier parsing. If no XSLT path is provided, the HTML content is returned unchanged.
Split HTML content from a file into a list of Document objects.