| Name | Type | Description |
|---|---|---|
web_paths | Sequence[str] | Default: ()Web paths to load from. |
requests_per_second | int | Default: 2Max number of concurrent requests to make. |
default_parser | str | Default: 'html.parser' |
requests_kwargs | Optional[Dict[str, Any]] | Default: None |
raise_for_status | bool | Default: False |
bs_get_text_kwargs | Optional[Dict[str, Any]] | Default: None |
bs_kwargs | Optional[Dict[str, Any]] | Default: None |
show_progress | bool | Default: True |
trust_env | bool | Default: False |
| Name | Type |
|---|---|
| web_path | Union[str, Sequence[str]] |
| header_template | Optional[dict] |
| verify_ssl | bool |
| proxies | Optional[dict] |
| continue_on_failure | bool |
| autoset_encoding | bool |
| encoding | Optional[str] |
| web_paths | Sequence[str] |
| requests_per_second | int |
| default_parser | str |
| requests_kwargs | Optional[Dict[str, Any]] |
| raise_for_status | bool |
| bs_get_text_kwargs | Optional[Dict[str, Any]] |
| bs_kwargs | Optional[Dict[str, Any]] |
| session | Any |
| show_progress | bool |
| trust_env | bool |
WebBaseLoader document loader integration
Setup:
Install langchain_community.
.. code-block:: bash
pip install -U langchain_community
Instantiate:
.. code-block:: python
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader( web_path = "https://www.espn.com/" # header_template = None, # verify_ssl = True, # proxies = None, # continue_on_failure = False, # autoset_encoding = True, # encoding = None, # web_paths = (), # requests_per_second = 2, # default_parser = "html.parser", # requests_kwargs = None, # raise_for_status = False, # bs_get_text_kwargs = None, # bs_kwargs = None, # session = None, # show_progress = True, # trust_env = False, )
Lazy load:
.. code-block:: python
docs = []
for doc in loader.lazy_load():
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
ESPN - Serving Sports Fans. Anytime. Anywhere.
{'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}
Async load:
.. code-block:: python
docs = []
async for doc in loader.alazy_load():
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
.. code-block:: python
ESPN - Serving Sports Fans. Anytime. Anywhere.
{'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}
.. versionchanged:: 0.3.14
Deprecated aload (which was not async) and implemented a native async
alazy_load. Expand below for more details.
.. dropdown:: How to update aload
Instead of using ``aload``, you can use ``load`` for synchronous loading or
``alazy_load`` for asynchronous lazy loading.
Example using ``load`` (synchronous):
.. code-block:: python
docs: List[Document] = loader.load()
Example using ``alazy_load`` (asynchronous):
.. code-block:: python
docs: List[Document] = []
async for doc in loader.alazy_load():
docs.append(doc)
This is in preparation for accommodating an asynchronous ``aload`` in the
future:
.. code-block:: python
docs: List[Document] = await loader.aload()
Default parser to use for BeautifulSoup.
kwargs for requests
Raise an exception if http status code denotes an error.
kwargs for beatifulsoup4 get_text
kwargs for beatifulsoup4 web page parsing
Show progress bar when loading pages.
set to True if using proxy to make web requests, for example using http(s)_proxy environment variables. Defaults to False.