WebBaseLoader document loader integration
Load a sitemap and its URLs.
Security Note: This loader can be used to load all URLs specified in a sitemap. If a malicious actor gets access to the sitemap, they could force the server to load URLs from other domains by modifying the sitemap. This could lead to server-side request forgery (SSRF) attacks; e.g., with the attacker forcing the server to load URLs from internal service endpoints that are not publicly accessible. While the attacker may not immediately gain access to this data, this data could leak into downstream systems (e.g., data loader is used to load data for indexing).
This loader is a crawler and web crawlers should generally NOT be deployed
with network access to any internal servers.
Control access to who can submit crawling requests and what network access
the crawler has.
By default, the loader will only load URLs from the same domain as the sitemap
if the site map is not a local file. This can be disabled by setting
restrict_to_same_domain to False (not recommended).
If the site map is a local file, no such risk mitigation is applied by default.
Use the filter URLs argument to limit which URLs can be loaded.
See https://python.langchain.com/docs/security