| Name | Type | Description |
|---|---|---|
web_page* | str | The web page to load or the starting point from where relative paths are discovered. |
load_all_paths | bool | Default: FalseIf set to True, all relative paths in the navbar
are loaded instead of only |
base_url | Optional[str] | Default: None |
content_selector | str | Default: 'main' |
continue_on_failure | bool | Default: False |
show_progress | bool | Default: True |
sitemap_url | Optional[str] | Default: None |
allowed_domains | Optional[Set[str]] | Default: None |
Load GitBook data.
When load_all_paths=True, the loader parses XML sitemaps and requires the
lxml package to be installed (pip install lxml).
If load_all_paths is True, the relative paths are
appended to this base url. Defaults to web_page.
The CSS selector for the content to load. Defaults to "main".
whether to continue loading the sitemap if an error occurs loading a url, emitting a warning instead of raising an exception. Setting this to True makes the loader more robust, but also may result in missing data. Default: False
whether to show a progress bar while loading. Default: True
Custom sitemap URL to use when load_all_paths is True. Defaults to "{base_url}/sitemap.xml".
Optional set of allowed domains to fetch from.
If None (default), the loader will restrict crawling to the domain
of the web_page URL to prevent potential SSRF vulnerabilities.
Provide an explicit set (e.g., {"example.com", "docs.example.com"})
to allow crawling across multiple domains. Use with caution in
server environments where users might control the input URLs.