Document Loaders are classes to load Documents.
Document Loaders are usually used to load a lot of Documents in a single run.
Class hierarchy:
.. code-block::
BaseLoader --> <name>Loader # Examples: TextLoader, UnstructuredFileLoader
Main helpers:
.. code-block::
Document, <name>TextSplitter
Load acreom vault from a directory.
Load with an Airbyte source connector implemented using the CDK.
Load from Gong using an Airbyte source connector.
Load from Hubspot using an Airbyte source connector.
Load from Salesforce using an Airbyte source connector.
Load from Shopify using an Airbyte source connector.
Load from Stripe using an Airbyte source connector.
Load from Typeform using an Airbyte source connector.
Load from Zendesk Support using an Airbyte source connector.
Load local Airbyte json files.
Load the Airtable tables.
Load records from an ArcGIS FeatureLayer.
Load a query result from Arxiv.
The loader converts the original PDF format into the text.
Load AssemblyAI audio transcripts.
It uses the AssemblyAI API to get an existing transcription and loads the transcribed text into one or more Documents, depending on the specified format.
Load AssemblyAI audio transcripts.
It uses the AssemblyAI API to transcribe audio files and loads the transcribed text into one or more Documents, depending on the specified format.
To use, you should have the assemblyai python package installed, and the
environment variable ASSEMBLYAI_API_KEY set with your API key.
Alternatively, the API key can also be passed as an argument.
Audio files can be specified via an URL or a local file path.
Load HTML asynchronously.
Load documents from AWS Athena.
Each document represents one row of the result.
page_content of the document
and none into the metadata of the document.metadata_columns are provided then these columns are written
into the metadata of the document while the rest of the columns
are written into the page_content of the document.To authenticate, the AWS client uses this method to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.
Make sure the credentials / roles used have the required policies to access the Amazon Textract service.
Load AZLyrics webpages.
Load from Azure AI Data.
Load a bibtex file.
Each document represents one entry from the bibtex file.
If a PDF file is present in the file bibtex field, the original PDF
is loaded into the document text. If no such file entry is present,
the abstract field is used instead.
Load fetching transcripts from BiliBili videos.
Load a Blackboard course.
This loader is not compatible with all Blackboard courses. It is only compatible with courses that use the new Blackboard interface. To use this loader, you must have the BbRouter cookie. You can get this cookie by logging into the course and then copying the value of the BbRouter cookie from the browser's developer tools.
Load blobs from cloud URL or file:.
Example:
.. code-block:: python
loader = CloudBlobLoader("s3://mybucket/id")
for blob in loader.yield_blobs():
print(blob)
Load blobs in the local file system.
Example:
.. code-block:: python
from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader
loader = FileSystemBlobLoader("/path/to/directory")
for blob in loader.yield_blobs():
print(blob) # noqa: T201
Load YouTube urls as audio file(s).
Load elements from a blockchain smart contract.
See supported blockchains here: https://python.langchain.com/v0.2/api_reference/community/document_loaders/langchain_community.document_loaders.blockchain.BlockchainType.html
If no BlockchainType is specified, the default is Ethereum mainnet.
The Loader uses the Alchemy API to interact with the blockchain. ALCHEMY_API_KEY environment variable must be set to use this loader.
The API returns 100 NFTs per request and can be paginated using the startToken parameter.
If get_all_tokens is set to True, the loader will get all tokens on the contract. Note that for contracts with a large number of tokens, this may take a long time (e.g. 10k tokens is 100 requests). Default value is false for this reason.
The max_execution_time (sec) can be set to limit the execution time of the loader.
Load with Brave Search engine.
Load pre-rendered web pages using a headless browser hosted on Browserbase.
Depends on browserbase and playwright packages.
Get your API key from https://browserbase.com
Load webpages with Browserless /content endpoint.
Load conversations from exported ChatGPT data.
Load CHM files using Unstructured.
CHM means Microsoft Compiled HTML Help.
from langchain_community.document_loaders import UnstructuredCHMLoader
loader = UnstructuredCHMLoader("example.chm") docs = loader.load()
https://github.com/dottedmag/pychm http://www.jedrea.com/chmlib/
Scrape HTML pages from URLs using a headless instance of the Chromium.
Load College Confidential webpages.
Load and pars Documents concurrently.
Load Confluence pages.
Port of https://llamahub.ai/l/confluence This currently supports username/api_key, Oauth2 login, personal access token or cookies authentication.
Specify a list page_ids and/or space_key to load in the corresponding pages into Document objects, if both are specified the union of both sets will be returned.
You can also specify a boolean include_attachments to include attachments, this
is set to False by default, if set to True all attachments will be downloaded and
ConfluenceLoader will extract the text from the attachments and add it to the
Document object. Currently supported attachment types are: PDF, PNG, JPEG/JPG,
SVG, Word and Excel.
Confluence API supports difference format of page content. The storage format is the
raw XML representation for storage. The view format is the HTML representation for
viewing with macros are rendered as though it is viewed by users. You can pass
a enum content_format argument to specify the content format, this is
set to ContentFormat.STORAGE by default, the supported values are:
ContentFormat.EDITOR, ContentFormat.EXPORT_VIEW,
ContentFormat.ANONYMOUS_EXPORT_VIEW, ContentFormat.STORAGE,
and ContentFormat.VIEW.
Hint: space_key and page_id can both be found in the URL of a page in Confluence
Load CoNLL-U files.
Load documents from Couchbase.
Each document represents one row of the result. The page_content_fields are
written into the page_contentof the document. The metadata_fields are written
into the metadata of the document. By default, all columns are written into
the page_content and none into the metadata.
Load a CSV file into a list of Document objects.
Each document represents one row of the CSV file. Every row is converted into a key/value pair and outputted to a new line in the document's page_content.
The source for each document loaded from csv is set to the value of the
file_path argument for all documents by default.
You can override this by setting the source_column argument to the
name of a column in the CSV file.
The source of each document will then be set to the value of the column
with the name specified in source_column.
Load CSV files using Unstructured.
Like other Unstructured loaders, UnstructuredCSVLoader can be used in both "single" and "elements" mode. If you use the loader in "elements" mode, the CSV file will be a single Unstructured Table element. If you use the loader in "elements" mode, an HTML representation of the table will be available in the "text_as_html" key in the document metadata.
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader
loader = UnstructuredCSVLoader("stanley-cups.csv", mode="elements") docs = loader.load()
Load Cube semantic layer metadata.
Load Datadog logs.
Logs are written into the page_content and into the metadata.
Load Pandas DataFrame.
Load files using dedoc API.
The file loader automatically detects the file type (even with the wrong extension).
By default, the loader makes a call to the locally hosted dedoc API.
More information about dedoc API can be found in dedoc documentation:
https://dedoc.readthedocs.io/en/latest/dedoc_api_usage/api.html
Please see the documentation of DedocBaseLoader to get more details.
DedocFileLoader document loader integration to load files using dedoc.
The file loader automatically detects the file type (with the correct extension). The list of supported file types is gives at https://dedoc.readthedocs.io/en/latest/index.html#id1. Please see the documentation of DedocBaseLoader to get more details.
Load Diffbot json file.
Load from a directory.
Load Discord chat logs.
Load a PDF with Azure Document Intelligence.
Load from Docusaurus Documentation.
It leverages the SitemapLoader to loop through the generated pages of a
Docusaurus Documentation website and extracts the content by looking for specific
HTML tags. By default, the parser searches for the main content of the Docusaurus
page, which is normally the
Load files from Dropbox.
In addition to common files such as text and PDF files, it also supports Dropbox Paper files.
Load from DuckDB.
Each document represents one row of the result. The page_content_columns
are written into the page_content of the document. The metadata_columns
are written into the metadata of the document. By default, all columns
are written into the page_content and none into the metadata.
Loads Outlook Message files using extract_msg.
Load email files using Unstructured.
Works with both .eml and .msg files. You can process attachments in addition to the e-mail message itself by passing process_attachments=True into the constructor for the loader. By default, attachments will be processed with the unstructured partition function. If you already know the document types of the attachments, you can specify another partitioning function with the attachment partitioner kwarg.
from langchain_community.document_loaders import UnstructuredEmailLoader
loader = UnstructuredEmailLoader("example_data/fake-email.eml", mode="elements") loader.load()
from langchain_community.document_loaders import UnstructuredEmailLoader
loader = UnstructuredEmailLoader( "example_data/fake-email-attachment.eml", mode="elements", process_attachments=True, ) loader.load()
Load EPub files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredEPubLoader
loader = UnstructuredEPubLoader( "example.epub", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-epub
Load transactions from Ethereum mainnet.
The Loader use Etherscan API to interact with Ethereum mainnet.
ETHERSCAN_API_KEY environment variable must be set use this loader.
Document loader for EverNote ENEX export files.
Loads EverNote notebook export files (.enex format) into LangChain Documents.
Extracts plain text content from HTML and preserves note metadata including
titles, timestamps, and attachments. Uses secure XML parsing to prevent
vulnerabilities.
The loader supports two modes:
Instructions for creating ENEX files <https://help.evernote.com/hc/en-us/articles/209005557-Export-notes-and-notebooks-as-ENEX-or-HTML>__
Example:
.. code-block:: python
from langchain_community.document_loaders import EverNoteLoader
# Load all notes as a single document
loader = EverNoteLoader("my_notebook.enex")
documents = loader.load()
# Load each note as a separate document:
# documents = [ document1, document2, ... ]
loader = EverNoteLoader("my_notebook.enex", load_single_document=False)
documents = loader.load()
# Lazy loading for large files
for doc in loader.lazy_load():
print(f"Title: {doc.metadata.get('title', 'Untitled')}")
print(f"Content: {doc.page_content[:100]}...")
Load Microsoft Excel files using Unstructured.
Like other Unstructured loaders, UnstructuredExcelLoader can be used in both "single" and "elements" mode. If you use the loader in "elements" mode, each sheet in the Excel file will be an Unstructured Table element. If you use the loader in "single" mode, an HTML representation of the table will be available in the "text_as_html" key in the document metadata.
from langchain_community.document_loaders.excel import UnstructuredExcelLoader
loader = UnstructuredExcelLoader("stanley-cups.xlsx", mode="elements") docs = loader.load()
Load Facebook Chat messages directory dump.
Load from FaunaDB.
Load Figma file.
FireCrawlLoader document loader integration
Load geopandas Dataframe.
Load Git repository files.
The Repository can be local on disk available at repo_path,
or remote at clone_url that will be cloned to repo_path.
Currently, supports only text files.
Each document represents one file in the repository. The path points to
the local Git repository, and the branch specifies the branch to load
files from. By default, it loads from the main branch.
Load GitBook data.
When load_all_paths=True, the loader parses XML sitemaps and requires the
lxml package to be installed (pip install lxml).
Load GitHub File
Load issues of a GitHub repository.
Load table schemas from AWS Glue.
This loader fetches the schema of each table within a specified AWS Glue database. The schema details include column names and their data types, similar to pandas dtype representation.
AWS credentials are automatically loaded using boto3, following the standard AWS method: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
If a specific AWS profile is required, it can be specified and will be used to establish the session.
Load from Gutenberg.org.
Load Hacker News data.
It loads data from either main page results or the comments page.
Load HTML files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader( "example.html", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-html
ModuleName document loader integration
Load from Hugging Face Hub datasets.
Load model information from Hugging Face Hub, including README content.
This loader interfaces with the Hugging Face Models API to fetch and load model metadata and README files. The API allows you to search and filter models based on specific criteria such as model tags, authors, and more.
API URL: https://huggingface.co/api/models DOC URL: https://huggingface.co/docs/hub/en/api
Examples:
.. code-block:: python
from langchain_community.document_loaders import HuggingFaceModelLoader
# Initialize the loader with search criteria
loader = HuggingFaceModelLoader(search="bert", limit=10)
# Load models
documents = loader.load()
# Iterate through the fetched documents
for doc in documents:
print(doc.page_content) # README content of the model
print(doc.metadata) # Metadata of the model
Load iFixit repair guides, device wikis and answers.
iFixit is the largest, open repair community on the web. The site contains nearly 100k repair manuals, 200k Questions & Answers on 42k devices, and all the data is licensed under CC-BY.
This loader will allow you to download the text of a repair guide, text of Q&A's and wikis from devices on iFixit using their open APIs and web scraping.
Load PNG and JPG files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredImageLoader
loader = UnstructuredImageLoader( "example.png", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-image
Load image captions.
By default, the loader utilizes the pre-trained Salesforce BLIP image captioning model. https://huggingface.co/Salesforce/blip-image-captioning-base
Load IMSDb webpages.
Load from IUGU.
Load notes from Joplin.
In order to use this loader, you need to have Joplin running with the Web Clipper enabled (look for "Web Clipper" in the app settings).
To get the access token, you need to go to the Web Clipper options and under "Advanced Options" you will find the access token.
You can find more information about the Web Clipper service here: https://joplinapp.org/clipper/
Load a JSON file using a jq schema.
Load from Kinetica API.
Each document represents one row of the result. The page_content_columns
are written into the page_content of the document. The metadata_columns
are written into the metadata of the document. By default, all columns
are written into the page_content and none into the metadata.
Load from lakeFS.
Load from LarkSuite (FeiShu).
Load Documents using LLMSherpa.
LLMSherpaFileLoader use LayoutPDFReader, which is part of the LLMSherpa library. This tool is designed to parse PDFs while preserving their layout information, which is often lost when using most PDF to text parsers.
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader
loader = LLMSherpaFileLoader( "example.pdf", strategy="chunks", llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all", ) docs = loader.load()
Load Markdown files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
Load the Mastodon 'toots'.
Load from Alibaba Cloud MaxCompute table.
Load MediaWiki dump from an XML file.
Merge documents from a list of loaders
Parse MHTML files with BeautifulSoup.
Load from Modern Treasury.
Load MongoDB documents.
NeedleLoader is a document loader for managing documents stored in a collection.
Load news articles from URLs using Unstructured.
Load Jupyter notebook (.ipynb) files.
Load Notion directory dump.
Load from Notion DB.
Reads content from pages within a Notion Database. Args: integration_token (str): Notion integration token. database_id (str): Notion database id. request_timeout_sec (int): Timeout for Notion requests in seconds. Defaults to 10. filter_object (Dict[str, Any]): Filter object used to limit returned entries based on specified criteria. E.g.: { "timestamp": "last_edited_time", "last_edited_time": { "on_or_after": "2024-02-07" } } -> will only return entries that were last edited on or after 2024-02-07 Notion docs: https://developers.notion.com/reference/post-database-query-filter Defaults to None, which will return ALL entries.
Load from Huawei OBS directory.
Load from the Huawei OBS file.
Load Obsidian files from directory.
Load OpenOffice ODT files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredODTLoader
loader = UnstructuredODTLoader( "example.odt", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-odt
Load documents from Microsoft OneDrive.
Uses SharePointLoader under the hood.
Load a file from Microsoft OneDrive.
Load from Open City.
Load Org-Mode files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredOrgModeLoader
loader = UnstructuredOrgModeLoader( "example.org", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-org
Load PDF files from a local file system, HTTP or S3.
To authenticate, the AWS client uses the following methods to automatically load credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
If a specific credential profile should be used, you must pass the name of the profile from the ~/.aws/credentials file that is to be used.
Make sure the credentials / roles used have the required policies to access the Amazon Textract service.
DedocPDFLoader document loader integration to load PDF files using dedoc.
The file loader can automatically detect the correctness of a textual layer in the
PDF document.
Note that __init__ method supports parameters that differ from ones of
DedocBaseLoader.
Load PDF files using Mathpix service.
Load online PDF.
Load and parse a PDF file using 'pdfminer.six' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting images, and
defining extraction mode. It integrates the pdfminer.six library for PDF
processing and offers both synchronous and asynchronous document loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pdfminer.six
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PDFMinerLoader
loader = PDFMinerLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load PDF files as HTML content using PDFMiner.
Load PDF files using pdfplumber.
Load and parse a PDF file using 'PyMuPDF' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting tables,
extracting images, and defining extraction mode. It integrates the PyMuPDF
library for PDF processing and offers both synchronous and asynchronous document
loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pymupdf
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_parser = TesseractBlobParser(), # extract_tables = "markdown", # extract_tables_settings = None, )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load and parse a directory of PDF files using 'pypdf' library.
This class provides methods to load and parse multiple PDF documents in a directory,
supporting options for recursive search, handling password-protected files,
extracting images, and defining extraction modes. It integrates the pypdf library
for PDF processing and offers synchronous document loading.
Load and parse a PDF file using the pypdfium2 library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting images, and
defining extraction mode.
It integrates the pypdfium2 library for PDF processing and offers both
synchronous and asynchronous document loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdfium2
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyPDFium2Loader
loader = PyPDFium2Loader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_to_text = convert_images_to_text_with_tesseract(), )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load and parse a PDF file using 'pypdf' library.
This class provides methods to load and parse PDF documents, supporting various
configurations such as handling password-protected files, extracting images, and
defining extraction mode. It integrates the pypdf library for PDF processing and
offers both synchronous and asynchronous document loading.
Examples: Setup:
.. code-block:: bash
pip install -U langchain-community pypdf
Instantiate the loader:
.. code-block:: python
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
file_path = "./example_data/layout-parser-paper.pdf",
# headers = None
# password = None,
mode = "single",
pages_delimiter = "
", # extract_images = True, # images_parser = RapidOCRBlobParser(), )
Lazy load documents:
.. code-block:: python
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load documents asynchronously:
.. code-block:: python
docs = await loader.aload()
print(docs[0].page_content[:100])
print(docs[0].metadata)
Load PDF files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader( "example.pdf", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf
Pebblo Safe Loader class is a wrapper around document loaders enabling the data to be scrutinized.
Loader for text data.
Since PebbloSafeLoader is a wrapper around document loaders, this loader is used to load text data directly into Documents.
Load Polars DataFrame.
Load Microsoft PowerPoint files using Unstructured.
Works with both .ppt and .pptx files. You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredPowerPointLoader
loader = UnstructuredPowerPointLoader( "example.pptx", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-pptx
Load from Psychic.dev.
Load from the PubMed biomedical library.
Load PySpark DataFrames.
Load Python files, respecting any non-default encoding if specified.
Load ReadTheDocs documentation directory.
Recursively load all child links from a root URL.
Security Note: This loader is a crawler that will start crawling at a given URL and then expand to crawl child links recursively.
Web crawlers should generally NOT be deployed with network access
to any internal servers.
Control access to who can submit crawling requests and what network access
the crawler has.
While crawling, the crawler may encounter malicious URLs that would lead to a
server-side request forgery (SSRF) attack.
To mitigate risks, the crawler by default will only load URLs from the same
domain as the start URL (controlled via prevent_outside named argument).
This will mitigate the risk of SSRF attacks, but will not eliminate it.
For example, if crawling a host which hosts several sites:
https://some_host/alice_site/
https://some_host/bob_site/
A malicious URL on Alice's site could cause the crawler to make a malicious
GET request to an endpoint on Bob's site. Both sites are hosted on the
same host, so such a request would not be prevented by default.
See https://python.langchain.com/docs/security/
Setup:
This class has no required additional dependencies. You can optionally install
``beautifulsoup4`` for richer default metadata extraction:
.. code-block:: bash
pip install -U beautifulsoup4
Load Reddit posts.
Read posts on a subreddit. First, you need to go to https://www.reddit.com/prefs/apps/ and create your application
Load Roam files from a directory.
Load from a Rockset database.
To use, you should have the rockset python package installed.
Load news articles from RSS feeds using Unstructured.
Load RST files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredRSTLoader
loader = UnstructuredRSTLoader( "example.rst", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-rst
Load RTF files using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredRTFLoader
loader = UnstructuredRTFLoader( "example.rtf", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-rtf
Load from Amazon AWS S3 directory.
Load from Amazon AWS S3 file.
Turn a url to llm accessible markdown with Scrapfly.io.
For further details, visit: https://scrapfly.io/docs/sdk/python
Turn an url to LLM accessible markdown with ScrapingAnt.
For further details, visit: https://docs.scrapingant.com/python-client
Load from SharePoint.
Load a sitemap and its URLs.
Security Note: This loader can be used to load all URLs specified in a sitemap. If a malicious actor gets access to the sitemap, they could force the server to load URLs from other domains by modifying the sitemap. This could lead to server-side request forgery (SSRF) attacks; e.g., with the attacker forcing the server to load URLs from internal service endpoints that are not publicly accessible. While the attacker may not immediately gain access to this data, this data could leak into downstream systems (e.g., data loader is used to load data for indexing).
This loader is a crawler and web crawlers should generally NOT be deployed
with network access to any internal servers.
Control access to who can submit crawling requests and what network access
the crawler has.
By default, the loader will only load URLs from the same domain as the sitemap
if the site map is not a local file. This can be disabled by setting
restrict_to_same_domain to False (not recommended).
If the site map is a local file, no such risk mitigation is applied by default.
Use the filter URLs argument to limit which URLs can be loaded.
See https://python.langchain.com/docs/security
Load from a Slack directory dump.
Load from Snowflake API.
Each document represents one row of the result. The page_content_columns
are written into the page_content of the document. The metadata_columns
are written into the metadata of the document. By default, all columns
are written into the page_content and none into the metadata.
Load web pages as Documents using Spider AI.
Must have the Python package spider-client installed and a Spider API key.
See https://spider.cloud for more.
Load from Spreedly API.
Load documents by querying database tables supported by SQLAlchemy.
For talking to the database, the document loader uses the SQLDatabase
utility from the LangChain integration toolkit.
Each document represents one row of the result.
Load .srt (subtitle) files.
Load from Stripe API.
Load SurrealDB documents.
Load Telegram chat json directory dump.
Load from Telegram chat dump.
Load from Tencent Cloud COS directory.
Load from Tencent Cloud COS file.
Load from TensorFlow Dataset.
Load text file.
Load documents from TiDB.
Load HTML using 2markdown API.
Load TOML files.
It can load a single source file or several files in a single directory.
Load cards from a Trello board.
Load TSV files using Unstructured.
Like other Unstructured loaders, UnstructuredTSVLoader can be used in both "single" and "elements" mode. If you use the loader in "elements" mode, the TSV file will be a single Unstructured Table element. If you use the loader in "elements" mode, an HTML representation of the table will be available in the "text_as_html" key in the document metadata.
from langchain_community.document_loaders.tsv import UnstructuredTSVLoader
loader = UnstructuredTSVLoader("stanley-cups.tsv", mode="elements") docs = loader.load()
Load Twitter tweets.
Read tweets of the user's Twitter handle.
First you need to go to
https://developer.twitter.com/en/docs/twitter-api /getting-started/getting-access-to-the-twitter-api
to get your token. And create a v2 version of the app.
Load files from remote URLs using Unstructured.
Use the unstructured partition function to detect the MIME type and route the file to the appropriate partitioner.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredURLLoader
loader = UnstructuredURLLoader(
urls=["
https://unstructured-io.github.io/unstructured/bricks.html#partition
Load HTML pages with Playwright and parse with Unstructured.
This is useful for loading pages that require javascript to render.
Load HTML pages with Selenium and parse with Unstructured.
This is useful for loading pages that require javascript to render.
Load weather data with Open Weather Map API.
Reads the forecast & current weather of any location using OpenWeatherMap's free API. Checkout 'https://openweathermap.org/appid' for more on how to generate a free OpenWeatherMap API.
WebBaseLoader document loader integration
Load WhatsApp messages text file.
Load from Wikipedia.
The hard limit on the length of the query is 300 for now.
Each wiki page represents one Document.
Load DOCX file using docx2txt and chunks at character level.
Defaults to check for local file, but if the file is a web path, it will download it to a temporary file, and use that, then clean up the temporary file after completion
Load Microsoft Word file using Unstructured.
Works with both .docx and .doc files. You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
loader = UnstructuredWordDocumentLoader( "example.docx", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-docx
Load XML file using Unstructured.
You can run the loader in one of two modes: "single" and "elements". If you use "single" mode, the document will be returned as a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.
from langchain_community.document_loaders import UnstructuredXMLLoader
loader = UnstructuredXMLLoader( "example.xml", mode="elements", strategy="fast", ) docs = loader.load()
https://unstructured-io.github.io/unstructured/bricks.html#partition-xml
Load Xorbits DataFrame.
Generic Google API Client.
To use, you should have the google_auth_oauthlib,youtube_transcript_api,google
python package installed.
As the google api expects credentials you need to set up a google account and
register your Service. "https://developers.google.com/docs/api/quickstart/python"
Security Note: Note that parsing of the transcripts relies on the standard xml library but the input is viewed as trusted in this case.
Load all Videos from a YouTube Channel.
To use, you should have the googleapiclient,youtube_transcript_api
python package installed.
As the service needs a google_api_client, you first have to initialize
the GoogleApiClient.
Additionally you have to either provide a channel name or a list of videoids "https://developers.google.com/docs/api/quickstart/python"
Load YouTube video transcripts.
Load documents from Yuque.
Load datasets from Apify web scraping, crawling, and data extraction platform.
For details, see https://docs.apify.com/platform/integrations/langchain
Load from Azure Blob Storage container.
Load from Azure Blob Storage files.
Load from the Google Cloud Platform BigQuery.
Each document represents one row of the result. The page_content_columns
are written into the page_content of the document. The metadata_columns
are written into the metadata of the document. By default, all columns
are written into the page_content and none into the metadata.
Load from Docugami.
To use, you should have the dgml-utils python package installed.
Load from GCS directory.
Load from GCS file.
Loader for Google Cloud Speech-to-Text audio transcripts.
It uses the Google Cloud Speech-to-Text API to transcribe audio files and loads the transcribed text into one or more Documents, depending on the specified format.
To use, you should have the google-cloud-speech python package installed.
Audio files can be specified via a Google Cloud Storage uri or a local file path.
For a detailed explanation of Google Cloud Speech-to-Text, refer to the product documentation. https://cloud.google.com/speech-to-text
Load Google Docs from Google Drive.
Load from oracle adb
Autonomous Database connection can be made by either connection_string
or tns name. wallet_location and wallet_password are required
for TLS connection.
Each document will represent one row of the query result.
Columns are written into the page_content and 'metadata' in
constructor is written into 'metadata' of document,
by default, the 'metadata' is None.
Read documents using OracleDocLoader Args: conn: Oracle Connection, params: Loader parameters.
Splitting text using Oracle chunker.
Send file-like objects with unstructured-client sdk to the Unstructured API.
By default, the loader makes a call to the hosted Unstructured API. If you are running the unstructured API locally, you can change the API rule by passing in the url parameter when you initialize the loader. The hosted Unstructured API requires an API key. See the links below to learn more about our API offerings and get an API key.
You can run the loader in different modes: "single", "elements", and "paged". The default "single" mode will return a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText and return those as individual langchain Document objects. In addition to these post-processing modes (which are specific to the LangChain Loaders), Unstructured has its own "chunking" parameters for post-processing elements into more useful chunks for uses cases such as Retrieval Augmented Generation (RAG). You can pass in additional unstructured kwargs to configure different unstructured settings.
from langchain_community.document_loaders import UnstructuredAPIFileLoader
with open("example.pdf", "rb") as f: loader = UnstructuredAPIFileIOLoader( f, mode="elements", strategy="fast", api_key="MY_API_KEY", ) docs = loader.load()
https://docs.unstructured.io/api-reference/api-services/sdk https://docs.unstructured.io/api-reference/api-services/overview https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking
Load file-like objects opened in read mode using Unstructured.
The file loader uses the unstructured partition function and will automatically detect the file type. You can run the loader in different modes: "single", "elements", and "paged". The default "single" mode will return a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText and return those as individual langchain Document objects. In addition to these post-processing modes (which are specific to the LangChain Loaders), Unstructured has its own "chunking" parameters for post-processing elements into more useful chunks for uses cases such as Retrieval Augmented Generation (RAG). You can pass in additional unstructured kwargs to configure different unstructured settings.
from langchain_community.document_loaders import UnstructuredFileIOLoader
with open("example.pdf", "rb") as f: loader = UnstructuredFileIOLoader( f, mode="elements", strategy="fast", ) docs = loader.load()
https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking
Load files using Unstructured.
The file loader uses the unstructured partition function and will automatically detect the file type. You can run the loader in different modes: "single", "elements", and "paged". The default "single" mode will return a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText and return those as individual langchain Document objects. In addition to these post-processing modes (which are specific to the LangChain Loaders), Unstructured has its own "chunking" parameters for post-processing elements into more useful chunks for uses cases such as Retrieval Augmented Generation (RAG). You can pass in additional unstructured kwargs to configure different unstructured settings.
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader( "example.pdf", mode="elements", strategy="fast", ) docs = loader.load()
https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking
Loads .ipynb notebook files.
Loads data from OneNote Notebooks
Pebblo's safe dataloader is a wrapper for document loaders
Loads word documents.
Loads rich text files.
Loader that uses unstructured to load HTML files.
Web base loader class.
Document Loader for ArcGIS FeatureLayers.
Loads YouTube transcript.
Base class for all loaders that uses O365 Package
Document loader for EverNote ENEX export files.
This module provides functionality to securely load and parse EverNote notebook
export files (.enex format) into LangChain Document objects.
Simple reader that reads weather data from OpenWeatherMap API
Document loader helpers.
Loader that loads data from Sharepoint Document Library
Scrapfly Web Reader.
Loads Microsoft Excel files.
Loader that uses Playwright to load a page, then uses unstructured to parse html.
Load Documents from Docusarus Documentation
Loads RST files.
Loader that uses unstructured to load HTML files.
Loader that uses Selenium to load a page, then uses unstructured to load the html.
Loader that uses unstructured to load files.
ScrapingAnt Web Extractor.
Loads Microsoft Excel files.
Load files using Unstructured API.
By default, the loader makes a call to the hosted Unstructured API. If you are running the unstructured API locally, you can change the API rule by passing in the url parameter when you initialize the loader. The hosted Unstructured API requires an API key. See the links below to learn more about our API offerings and get an API key.
You can run the loader in different modes: "single", "elements", and "paged". The default "single" mode will return a single langchain Document object. If you use "elements" mode, the unstructured library will split the document into elements such as Title and NarrativeText and return those as individual langchain Document objects. In addition to these post-processing modes (which are specific to the LangChain Loaders), Unstructured has its own "chunking" parameters for post-processing elements into more useful chunks for uses cases such as Retrieval Augmented Generation (RAG). You can pass in additional unstructured kwargs to configure different unstructured settings.
Examples
from langchain_community.document_loaders import UnstructuredAPIFileLoader
loader = UnstructuredAPIFileLoader(
"example.pdf", mode="elements", strategy="fast", api_key="MY_API_KEY",
)
docs = loader.load()
References
----------
https://docs.unstructured.io/api-reference/api-services/sdk
https://docs.unstructured.io/api-reference/api-services/overview
https://docs.unstructured.io/open-source/core-functionality/partitioning
https://docs.unstructured.io/open-source/core-functionality/chunking