SitemapLoader

SitemapLoader(
  self,
  web_path: str,
  filter_urls: Optional[List[str

Bases

WebBaseLoader

Constructors

Attributes

Methods

Inherited fromWebBaseLoader

Attributes

Aweb_paths: list Arequests_per_second: requests_per_second Adefault_parser: default_parser Arequests_kwargs

View source on GitHub

restrict_to_same_domain

Inherited fromBaseLoader(langchain_core)

Methods

Mload Maload Mload_and_split Malazy_load

Parameters

Name	Type	Description
`web_path`*	`str`	url of the sitemap. can also be a local path
`filter_urls`	`Optional[List[str]]`	Default:`None` a list of regexes. If specified, only URLS that match one of the filter URLs will be loaded. WARNING The filter URLs are interpreted as regular expressions. Remember to escape special characters if you do not want them to be interpreted as regular expression syntax. For example, `.` appears frequently in URLs and should be escaped if you want to match a literal `.` rather than any character. restrict_to_same_domain takes precedence over filter_urls when restrict_to_same_domain is True and the sitemap is not a local file.
`parsing_function`	`Optional[Callable]`	Default:`None`
`blocksize`	`Optional[int]`	Default:`None`
`blocknum`	`int`	Default:`0`
`meta_function`	`Optional[Callable]`	Default:`None`
`is_local`	`bool`	Default:`False`
`continue_on_failure`	`bool`	Default:`False`
`restrict_to_same_domain`	`bool`	Default:`True`
`max_depth`	`int`	Default:`10`

constructor

__init__

Name	Type
web_path	str
filter_urls	Optional[List[str]]
parsing_function	Optional[Callable]
blocksize	Optional[int]
blocknum	int
meta_function	Optional[Callable]
is_local	bool
continue_on_failure	bool
restrict_to_same_domain	bool
max_depth	int

Load a sitemap and its URLs.

Security Note: This loader can be used to load all URLs specified in a sitemap. If a malicious actor gets access to the sitemap, they could force the server to load URLs from other domains by modifying the sitemap. This could lead to server-side request forgery (SSRF) attacks; e.g., with the attacker forcing the server to load URLs from internal service endpoints that are not publicly accessible. While the attacker may not immediately gain access to this data, this data could leak into downstream systems (e.g., data loader is used to load data for indexing).

This loader is a crawler and web crawlers should generally NOT be deployed
with network access to any internal servers.

Control access to who can submit crawling requests and what network access
the crawler has.

By default, the loader will only load URLs from the same domain as the sitemap
if the site map is not a local file. This can be disabled by setting
restrict_to_same_domain to False (not recommended).

If the site map is a local file, no such risk mitigation is applied by default.

Use the filter URLs argument to limit which URLs can be loaded.

See https://python.langchain.com/docs/security

Function to parse bs4.Soup output

number of sitemap locations per block

the number of the block that should be loaded - zero indexed. Default: 0

Function to parse bs4.Soup output for metadata remember when setting this method to also copy metadata["loc"] to metadata["source"] if you are using this field

whether the sitemap is a local file. Default: False

whether to continue loading the sitemap if an error occurs loading a url, emitting a warning instead of raising an exception. Setting this to True makes the loader more robust, but also may result in missing data. Default: False

whether to restrict loading to URLs to the same domain as the sitemap. Attention: This is only applied if the sitemap is not a local file!

maximum depth to follow sitemap links. Default: 10

LangChain Assistant

Menu

SitemapLoader

Bases

Constructors

Attributes

Methods

Inherited fromWebBaseLoader

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Parameters

Menu

SitemapLoader

Bases

Used in Docs

Constructors

Attributes

Methods

Inherited fromWebBaseLoader

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Parameters