Class●Since v0.3

RecursiveUrlLoader

RecursiveUrlLoader(
  self,
  url: str,
  max_depth: Optional[int]

Bases

BaseLoader

Constructors

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Mload Maload Mload_and_split Malazy_load

View source on GitHub

Parameters

Name	Type	Description
`url`*	`str`	The URL to crawl.
`max_depth`	`Optional[int]`	Default:`2` The max depth of the recursive loading.
`use_async`	`Optional[bool]`	Default:`None`
`extractor`	`Optional[Callable[[str], str]]`	Default:`None`
`metadata_extractor`	`Optional[_MetadataExtractorType]`	Default:`None`
`exclude_dirs`	`Optional[Sequence[str]]`	Default:`()`
`timeout`	`Optional[int]`	Default:`10`
`prevent_outside`	`bool`	Default:`True`
`link_regex`	`Union[str, re.Pattern, None]`	Default:`None`
`headers`	`Optional[dict]`	Default:`None`
`check_response_status`	`bool`	Default:`False`
`continue_on_failure`	`bool`	Default:`True`
`base_url`	`Optional[str]`	Default:`None`
`autoset_encoding`	`bool`	Default:`True`
`encoding`	`Optional[str]`	Default:`None`
`proxies`	`Optional[dict]`	Default:`None`
`ssl`	`bool`	Default:`True`

constructor

__init__

Name	Type
url	str
max_depth	Optional[int]
use_async	Optional[bool]
extractor	Optional[Callable[[str], str]]
metadata_extractor	Optional[_MetadataExtractorType]
exclude_dirs	Optional[Sequence[str]]
timeout	Optional[int]
prevent_outside	bool
link_regex	Union[str, re.Pattern, None]
headers	Optional[dict]
check_response_status	bool
continue_on_failure	bool
base_url	Optional[str]
autoset_encoding	bool
encoding	Optional[str]
proxies	Optional[dict]
ssl	bool

Recursively load all child links from a root URL.

Security Note: This loader is a crawler that will start crawling at a given URL and then expand to crawl child links recursively.

Web crawlers should generally NOT be deployed with network access
to any internal servers.

Control access to who can submit crawling requests and what network access
the crawler has.

While crawling, the crawler may encounter malicious URLs that would lead to a
server-side request forgery (SSRF) attack.

To mitigate risks, the crawler by default will only load URLs from the same
domain as the start URL (controlled via prevent_outside named argument).

This will mitigate the risk of SSRF attacks, but will not eliminate it.

For example, if crawling a host which hosts several sites:

https://some_host/alice_site/
https://some_host/bob_site/

A malicious URL on Alice's site could cause the crawler to make a malicious
GET request to an endpoint on Bob's site. Both sites are hosted on the
same host, so such a request would not be prevented by default.

See https://python.langchain.com/docs/security/

Setup:

This class has no required additional dependencies. You can optionally install
``beautifulsoup4`` for richer default metadata extraction:

.. code-block:: bash

    pip install -U beautifulsoup4

Instantiate:

.. code-block:: python

from langchain_community.document_loaders import RecursiveUrlLoader

loader = RecursiveUrlLoader( "https://docs.python.org/3.9/", # max_depth=2, # use_async=False, # extractor=None, # metadata_extractor=None, # exclude_dirs=(), # timeout=10, # check_response_status=True, # continue_on_failure=True, # prevent_outside=True, # base_url=None, # ... )

Lazy load:

.. code-block:: python

docs = []
docs_lazy = loader.lazy_load()

# async variant:
# docs_lazy = await loader.alazy_load()

for doc in docs_lazy:
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" /><
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}

Async load:

.. code-block:: python

docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)

.. code-block:: python

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" /><
{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}

Content parsing / extraction: By default the loader sets the raw HTML from each link as the Document page content. To parse this HTML into a more human/LLM-friendly format you can pass in a custom extractor method:

.. code-block:: python

# This example uses `beautifulsoup4` and `lxml`
import re
from bs4 import BeautifulSoup

def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
    extractor=bs4_extractor,
)
print(loader.load()[0].page_content[:200])

.. code-block:: python

3.9.19 Documentation

Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit

Metadata extraction:

Similarly to content extraction, you can specify a metadata extraction function to customize how Document metadata is extracted from the HTTP response.

.. code-block:: python

import aiohttp
import requests
from typing import Union

def simple_metadata_extractor(
    raw_html: str, url: str, response: Union[requests.Response, aiohttp.ClientResponse]
) -> dict:
    content_type = getattr(response, "headers").get("Content-Type", "")
    return {"source": url, "content_type": content_type}

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
    metadata_extractor=simple_metadata_extractor,
)
loader.load()[0].metadata

.. code-block:: python

{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html'}

Filtering URLs:

You may not always want to pull every URL from a website. There are four parameters that allow us to control what URLs we pull recursively. First, we can set the prevent_outside parameter to prevent URLs outside of the base_url from being pulled. Note that the base_url does not need to be the same as the URL we pass in, as shown below. We can also use link_regex and exclude_dirs to be more specific with the URLs that we select. In this example, we only pull websites from the python docs, which contain the string "index" somewhere and are not located in the FAQ section of the website.

.. code-block:: python

loader = RecursiveUrlLoader(
    "https://docs.python.org/3.9/",
    prevent_outside=True,
    base_url="https://docs.python.org",
    link_regex=r'<a\s+(?:[^>]*?\s+)?href="([^"]*(?=index)[^"]*)"',
    exclude_dirs=['https://docs.python.org/3.9/faq']
)
docs = loader.load()

.. code-block:: python

['https://docs.python.org/3.9/',
'https://docs.python.org/3.9/py-modindex.html',
'https://docs.python.org/3.9/genindex.html',
'https://docs.python.org/3.9/tutorial/index.html',
'https://docs.python.org/3.9/using/index.html',
'https://docs.python.org/3.9/extending/index.html',
'https://docs.python.org/3.9/installing/index.html',
'https://docs.python.org/3.9/library/index.html',
'https://docs.python.org/3.9/c-api/index.html',
'https://docs.python.org/3.9/howto/index.html',
'https://docs.python.org/3.9/distributing/index.html',
'https://docs.python.org/3.9/reference/index.html',
'https://docs.python.org/3.9/whatsnew/index.html']

Whether to use asynchronous loading. If True, lazy_load() will not be lazy, but it will still work in the expected way, just not lazy.

A function to extract metadata from args: raw HTML, the source url, and the requests.Response/aiohttp.ClientResponse object (args in that order).

Default extractor will attempt to use BeautifulSoup4 to extract the title, description and language of the page.

..code-block:: python

import requests
import aiohttp

def simple_metadata_extractor(
    raw_html: str, url: str, response: Union[requests.Response, aiohttp.ClientResponse]
) -> dict:
    content_type = getattr(response, "headers").get("Content-Type", "")
    return {"source": url, "content_type": content_type}

Whether to automatically set the encoding of the response. If True, the encoding of the response will be set to the apparent encoding, unless the encoding argument has already been explicitly set.

A dictionary mapping protocol names to the proxy URLs to be used for requests. This allows the crawler to route its requests through specified proxy servers. If None, no proxies will be used and requests will go directly to the target URL.

Example usage:

..code-block:: python

proxies = {
    "http": "http://10.10.1.10:3128",
    "https": "https://10.10.1.10:1080",
}

Whether to verify SSL certificates during requests. By default, SSL certificate verification is enabled (ssl=True), ensuring secure HTTPS connections. Setting this to False disables SSL certificate verification, which can be useful when crawling internal services, development environments, or sites with misconfigured or self-signed certificates.

Use with caution: Disabling SSL verification exposes your crawler to man-in-the-middle (MitM) attacks, data tampering, and potential interception of sensitive information. This significantly compromises the security and integrity of the communication. It should never be used in production or when handling sensitive data.

LangChain Assistant

Menu

RecursiveUrlLoader

Bases

Constructors

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Parameters

Menu

RecursiveUrlLoader

Bases

Used in Docs

Constructors

Attributes

Methods

Inherited fromBaseLoader(langchain_core)

Methods

Parameters