Transform a list of Document objects by cleaning their HTML content.
transform_documents(
self,
documents: Sequence[Document],
unwanted_tags: Union[List[str], Tuple[str, ...]] = ('script', 'style'),
tags_to_extract: Union[List[str], Tuple[str, ...]] = ('p', 'li', 'div', 'a'),
remove_lines: bool = True,
*,
unwanted_classnames: Union[Tuple[str, ...], List[str]] = (),
remove_comments: bool = False,
**kwargs: Any = {}
) -> Sequence[Document]| Name | Type | Description |
|---|---|---|
documents* | Sequence[Document] | A sequence of |
unwanted_tags | Union[List[str], Tuple[str, ...]] | Default: ('script', 'style')A list of tags to be removed from the HTML. |
tags_to_extract | Union[List[str], Tuple[str, ...]] | Default: ('p', 'li', 'div', 'a')A list of tags whose content will be extracted. |
remove_lines | bool | Default: TrueIf set to |
unwanted_classnames | Union[Tuple[str, ...], List[str]] | Default: ()A list of class names to be removed from the HTML |
remove_comments | bool | Default: FalseIf set to |