| Name | Type | Description |
|---|---|---|
documents* | Sequence[Document] | A sequence of |
unwanted_tags | Union[List[str], Tuple[str, ...]] | Default: ('script', 'style') |
tags_to_extract | Union[List[str], Tuple[str, ...]] | Default: ('p', 'li', 'div', 'a') |
remove_lines | bool | Default: True |
unwanted_classnames | Union[Tuple[str, ...], List[str]] | Default: () |
remove_comments | bool | Default: False |
Transform a list of Document objects by cleaning their HTML content.
A list of tags to be removed from the HTML.
A list of tags whose content will be extracted.
If set to True, unnecessary lines will be removed.
A list of class names to be removed from the HTML
If set to True, comments will be removed.