Document Loaders

OBSFileLoader

Load from the Huawei OBS file.

ParseOracleDocMetadata

Parse Oracle doc metadata...

OracleDocReader

Read a file

NotebookLoader

Load Jupyter notebook (.ipynb) files.

S3DirectoryLoader

Load from Amazon AWS S3 directory.

TiDBLoader

Load documents from TiDB.

AirbyteJSONLoader

Load local Airbyte json files.

SitemapLoader

Load a sitemap and its URLs.

TensorflowDatasetLoader

Load from TensorFlow Dataset.

UnstructuredCHMLoader

Load CHM files using Unstructured.

CHMParser

Microsoft Compiled HTML Help (CHM) Parser.

UnstructuredOrgModeLoader

Load Org-Mode files using Unstructured.

HuggingFaceDatasetLoader

Load from Hugging Face Hub datasets.

RoamLoader

Load Roam files from a directory.

BaseDataFrameLoader

DataFrameLoader

Load Pandas DataFrame.

DropboxLoader

Load files from Dropbox.

OneNoteLoader

Load pages from OneNote notebooks.

TelegramChatFileLoader

Load from Telegram chat dump.

TelegramChatApiLoader

Load Telegram chat json directory dump.

LLMSherpaFileLoader

Load Documents using LLMSherpa.

PebbloSafeLoader

Pebblo Safe Loader class is a wrapper around document loaders enabling the data

PebbloTextLoader

Loader for text data.

IFixitLoader

Load iFixit repair guides, device wikis and answers.

Docx2txtLoader

Load DOCX file using docx2txt and chunks at character level.

UnstructuredWordDocumentLoader

Load Microsoft Word file using Unstructured.

OBSDirectoryLoader

Load from Huawei OBS directory.

SnowflakeLoader

Load from Snowflake API.

HuggingFaceModelLoader

Load model information from Hugging Face Hub, including README content.

CassandraLoader

MintbaseDocumentLoader

Load elements from a blockchain smart contract.

UnstructuredRTFLoader

Load RTF files using Unstructured.

UnstructuredPDFLoader

Load PDF files using Unstructured.

BasePDFLoader

Base Loader class for PDF files.

OnlinePDFLoader

Load online PDF.

PyPDFLoader

Load and parse a PDF file using 'pypdf' library.

PyPDFium2Loader

Load and parse a PDF file using the pypdfium2 library.

PyPDFDirectoryLoader

Load and parse a directory of PDF files using 'pypdf' library.

PDFMinerLoader

Load and parse a PDF file using 'pdfminer.six' library.

PDFMinerPDFasHTMLLoader

Load PDF files as HTML content using PDFMiner.

PyMuPDFLoader

Load and parse a PDF file using 'PyMuPDF' library.

MathpixPDFLoader

Load PDF files using Mathpix service.

PDFPlumberLoader

Load PDF files using pdfplumber.

AmazonTextractPDFLoader

Load PDF files from a local file system, HTTP or S3.

DedocPDFLoader

DedocPDFLoader document loader integration to load PDF files using dedoc.

DocumentIntelligenceLoader

Load a PDF with Azure Document Intelligence

ZeroxPDFLoader

Document loader utilizing Zerox library:

YuqueLoader

Load documents from Yuque.

OpenCityDataLoader

Load from Open City.

XorbitsLoader

Load Xorbits DataFrame.

LakeFSClient

Client for lakeFS.

LakeFSLoader

Load from lakeFS.

UnstructuredLakeFSLoader

Load from lakeFS as unstructured data.

AthenaLoader

Load documents from AWS Athena.

OneDriveLoader

Load documents from Microsoft OneDrive.

BaiduBOSFileLoader

Load from Baidu Cloud BOS file.

UnstructuredEPubLoader

Load EPub files using Unstructured.

ChatGPTLoader

Load conversations from exported ChatGPT data.

BrowserlessLoader

Load webpages with Browserless /content endpoint.

AsyncChromiumLoader

Scrape HTML pages from URLs using a

AsyncHtmlLoader

Load HTML asynchronously.

CoNLLULoader

Load CoNLL-U files.

UnstructuredURLLoader

Load files from remote URLs using Unstructured.

ImageCaptionLoader

Load image captions.

NotionDirectoryLoader

Load Notion directory dump.

IuguLoader

Load from IUGU.

AzureAIDataLoader

Load from Azure AI Data.

FaunaLoader

Load from FaunaDB.

MongodbLoader

Load MongoDB documents.

WebBaseLoader

WebBaseLoader document loader integration

DirectoryLoader

Load from a directory.

ArcGISLoader

Load records from an ArcGIS FeatureLayer.

QuipLoader

Load Quip pages.

ConcurrentLoader

Load and pars Documents concurrently.

TranscriptFormat

Transcript format to use for the document loader.

AssemblyAIAudioTranscriptLoader

Load AssemblyAI audio transcripts.

AssemblyAIAudioLoaderById

Load AssemblyAI audio transcripts.

TomlLoader

Load TOML files.

AirtableLoader

Load the Airtable tables.

CollegeConfidentialLoader

Load College Confidential webpages.

PolarsDataFrameLoader

Load Polars DataFrame.

GeoDataFrameLoader

Load geopandas Dataframe.

GenericLoader

Generic Document Loader.

SurrealDBLoader

Load SurrealDB documents.

ArxivLoader

Load a query result from Arxiv.

BibtexLoader

Load a bibtex file.

GoogleApiClient

Generic Google API Client.

TranscriptFormat

Output formats of transcripts from YoutubeLoader.

YoutubeLoader

Load YouTube video transcripts.

GoogleApiYoutubeLoader

Load all Videos from a YouTube Channel.

RSSFeedLoader

Load news articles from RSS feeds using Unstructured.

CubeSemanticLoader

Load Cube semantic layer metadata.

LarkSuiteDocLoader

Load from LarkSuite (FeiShu).

LarkSuiteWikiLoader

Load from LarkSuite (FeiShu) wiki.

JoplinLoader

Load notes from Joplin.

MaxComputeLoader

Load from Alibaba Cloud MaxCompute table.

TwitterTweetLoader

Load Twitter tweets.

DatadogLogsLoader

Load Datadog logs.

CouchbaseLoader

Load documents from Couchbase.

SpreedlyLoader

Load from Spreedly API.

SQLDatabaseLoader

Load documents by querying database tables supported by SQLAlchemy.

IMSDbLoader

Load IMSDb webpages.

FigmaFileLoader

Load Figma file.

O365BaseLoader

Base class for all loaders that uses O365 Package

ContentFormat

Enumerator of the content formats of Confluence page.

ConfluenceLoader

Load Confluence pages.

AirbyteCDKLoader

Load with an Airbyte source connector implemented using the CDK.

CDKIntegration

A wrapper around the CDK integration.

AirbyteHubspotLoader

Load from Hubspot using an Airbyte source connector.

AirbyteStripeLoader

Load from Stripe using an Airbyte source connector.

AirbyteTypeformLoader

Load from Typeform using an Airbyte source connector.

AirbyteZendeskSupportLoader

Load from Zendesk Support using an Airbyte source connector.

AirbyteShopifyLoader

Load from Shopify using an Airbyte source connector.

AirbyteSalesforceLoader

Load from Salesforce using an Airbyte source connector.

AirbyteGongLoader

Load from Gong using an Airbyte source connector.

ReadTheDocsLoader

Load ReadTheDocs documentation directory.

SlackDirectoryLoader

Load from a Slack directory dump.

AZLyricsLoader

Load AZLyrics webpages.

KineticaLoader

Load from Kinetica API.

AzureAIDocumentIntelligenceLoader

Load a PDF with Azure Document Intelligence.

ObsidianLoader

Load Obsidian files from directory.

EverNoteLoader

Document loader for EverNote ENEX export files.

PythonLoader

Load Python files, respecting any non-default encoding if specified.

HNLoader

Load Hacker News data.

UnstructuredMarkdownLoader

Load Markdown files using Unstructured.

WeatherDataLoader

Load weather data with Open Weather Map API.

FileEncoding

File encoding as the NamedTuple.

NeedleLoader

NeedleLoader is a document loader for managing documents stored in a collection.

SharePointLoader

Load from SharePoint.

VsdxLoader

NucliaLoader

Load from any file type using Nuclia Understanding API.

UnstructuredPowerPointLoader

Load Microsoft PowerPoint files using Unstructured.

DedocBaseLoader

Base Loader that uses dedoc (https://dedoc.readthedocs.io).

DedocFileLoader

DedocFileLoader document loader integration to load files using dedoc.

DedocAPIFileLoader

Load files using dedoc API.

SRTLoader

Load .srt (subtitle) files.

DiffbotLoader

Load Diffbot json file.

TencentCOSDirectoryLoader

Load from Tencent Cloud COS directory.

PySparkDataFrameLoader

Load PySpark DataFrames.

ColumnNotFoundError

Column not found error.

RocksetLoader

Load from a Rockset database.

ScrapflyLoader

Turn a url to llm accessible markdown with Scrapfly.io.

DuckDBLoader

Load from DuckDB.

GitbookLoader

Load GitBook data.

CSVLoader

Load a CSV file into a list of Document objects.

UnstructuredCSVLoader

Load CSV files using Unstructured.

BlackboardLoader

Load a Blackboard course.

GutenbergLoader

Load from Gutenberg.org.

AcreomLoader

Load acreom vault from a directory.

StripeLoader

Load from Stripe API.

UnstructuredXMLLoader

Load XML file using Unstructured.

MergedDataLoader

Merge documents from a list of loaders

BaiduBOSDirectoryLoader

Load from Baidu BOS directory.

FacebookChatLoader

Load Facebook Chat messages directory dump.

BSHTMLLoader

ModuleName document loader integration

UnstructuredTSVLoader

Load TSV files using Unstructured.

S3FileLoader

Load from Amazon AWS S3 file.

UnstructuredImageLoader

Load PNG and JPG files using Unstructured.

JSONLoader

Load a JSON file using a jq schema.

PlaywrightEvaluator

Abstract base class for all evaluators.

UnstructuredHtmlEvaluator

Evaluate the page HTML content using the unstructured library.

PlaywrightURLLoader

Load HTML pages with Playwright and parse with Unstructured.

ToMarkdownLoader

Load HTML using 2markdown API.

BlockchainType

Enumerator of the supported blockchains.

BlockchainDocumentLoader

Load elements from a blockchain smart contract.

DocusaurusLoader

Load from Docusaurus Documentation.

OneDriveFileLoader

Load a file from Microsoft OneDrive.

MWDumpLoader

Load MediaWiki dump from an XML file.

UnstructuredRSTLoader

Load RST files using Unstructured.

MastodonTootsLoader

Load the Mastodon 'toots'.

RecursiveUrlLoader

Recursively load all child links from a root URL.

TextLoader

Load text file.

MHTMLLoader

Parse MHTML files with BeautifulSoup.

GitLoader

Load Git repository files.

WikipediaLoader

Load from Wikipedia.

UnstructuredODTLoader

Load OpenOffice ODT files using Unstructured.

FireCrawlLoader

FireCrawlLoader document loader integration

NewsURLLoader

Load news articles from URLs using Unstructured.

RedditPostsLoader

Load Reddit posts.

SeleniumURLLoader

Load HTML pages with Selenium and parse with Unstructured.

TrelloLoader

Load cards from a Trello board.

ModernTreasuryLoader

Load from Modern Treasury.

PubMedLoader

Load from the PubMed biomedical library.

UnstructuredBaseLoader

Base Loader that uses Unstructured.

EtherscanLoader

Load transactions from Ethereum mainnet.

UnstructuredHTMLLoader

Load HTML files using Unstructured.

WhatsAppChatLoader

Load WhatsApp messages text file.

UnstructuredEmailLoader

Load email files using Unstructured.

OutlookMessageLoader

Loads Outlook Message files using extract_msg.

GlueCatalogLoader

Load table schemas from AWS Glue.

RSpaceLoader

Load content from RSpace notebooks, folders, documents or PDF Gallery files.

BraveSearchLoader

Load with Brave Search engine.

NotionDBLoader

Load from Notion DB.

TencentCOSFileLoader

Load from Tencent Cloud COS file.

DiscordChatLoader

Load Discord chat logs.

PsychicLoader

Load from Psychic.dev.

ScrapingAntLoader

Turn an url to LLM accessible markdown with ScrapingAnt.

BaseGitHubLoader

Load GitHub repository Issues.

GitHubIssuesLoader

Load issues of a GitHub repository.

GithubFileLoader

Load GitHub File

BiliBiliLoader

Load fetching transcripts from BiliBili videos.

BrowserbaseLoader

Load pre-rendered web pages using a headless browser hosted on Browserbase.

SpiderLoader

Load web pages as Documents using Spider AI.

UnstructuredExcelLoader

Load Microsoft Excel files using Unstructured.

CloudBlobLoader

Load blobs from cloud URL or file:.

YoutubeAudioLoader

Load YouTube urls as audio file(s).

FileSystemBlobLoader

Load blobs in the local file system.

MsWordParser

Parse the Microsoft Word documents from a blob.

PyPDFParser

Parse a blob from a PDF using pypdf library.

PDFMinerParser

Parse a blob from a PDF using pdfminer.six library.

PyMuPDFParser

Parse a blob from a PDF using PyMuPDF library.

PyPDFium2Parser

Parse a blob from a PDF using PyPDFium2 library.

PDFPlumberParser

Parse PDF with PDFPlumber.

AmazonTextractPDFParser

Send PDF files to Amazon Textract and parse them.

DocumentIntelligenceParser

Loads a PDF with Azure Document Intelligence

AzureOpenAIWhisperParser

Transcribe and parse audio files using Azure OpenAI Whisper.

OpenAIWhisperParser

Transcribe and parse audio files.

OpenAIWhisperParserLocal

Transcribe and parse audio files with OpenAI Whisper model.

YandexSTTParser

Transcribe and parse audio files.

FasterWhisperParser

Transcribe and parse audio files with faster-whisper.

DocumentLoaderAsParser

A wrapper class that adapts a document loader to function as a parser.

MimeTypeBasedParser

Parser that uses mime-types to parse a blob.

DocAIParsingResults

Dataclass to store Document AI parsing results.

AzureAIDocumentIntelligenceParser

Loads a PDF with Azure Document Intelligence

TextParser

Parser for text blobs.

VsdxParser

Parser for vsdx files.

BaseImageBlobParser

Abstract base class for parsing image blobs into text.

RapidOCRBlobParser

Parser for extracting text from images using the RapidOCR library.

TesseractBlobParser

Parse for extracting text from images using the Tesseract OCR library.

LLMImageBlobParser

Parser for analyzing images using a language model (LLM).

ServerUnavailableException

Exception raised when the Grobid server is unavailable.

GrobidParser

Load article PDF files using Grobid.

GoSegmenter

Code segmenter for Go.

PHPSegmenter

Code segmenter for PHP.

LanguageParser

Parse using the respective programming language syntax.

CSegmenter

Code segmenter for C.

LuaSegmenter

Code segmenter for Lua.

ScalaSegmenter

Code segmenter for Scala.

RubySegmenter

Code segmenter for Ruby.

TypeScriptSegmenter

Code segmenter for TypeScript.

SQLSegmenter

Code segmenter for SQL.

PythonSegmenter

Code segmenter for Python.

CSharpSegmenter

Code segmenter for C#.

CobolSegmenter

Code segmenter for COBOL.

CodeSegmenter

Abstract class for the code segmenter.

JavaSegmenter

Code segmenter for Java.

ElixirSegmenter

Code segmenter for Elixir.

JavaScriptSegmenter

Code segmenter for JavaScript.

TreeSitterSegmenter

Abstract class for CodeSegmenters that use the tree-sitter library.

PerlSegmenter

Code segmenter for Perl.

KotlinSegmenter

Code segmenter for Kotlin.

RustSegmenter

Code segmenter for Rust.

CPPSegmenter

Code segmenter for C++.

BS4HTMLParser

Parse HTML files using Beautiful Soup.

VolcengineRerank

Document compressor that uses Volcengine Rerank API.

FlashrankRerank

Document compressor using Flashrank interface.

JinaRerank

Document compressor that uses Jina Rerank API.

RerankRequest

Request for reranking.

OpenVINOReranker

OpenVINO rerank models.

RankLLMRerank

Document compressor using Flashrank interface.

ModelType

LLMLinguaCompressor

Compress using LLMLingua Project.

InfinityRerank

Document compressor that uses Infinity Rerank API.

DashScopeRerank

Document compressor that uses DashScope Rerank API.

BeautifulSoupTransformer

Transform HTML content by extracting specific tags and removing unwanted ones.

NucliaTextTransformer

Nuclia Text Transformer.

Html2TextTransformer

Replace occurrences of a particular search pattern with a replacement string

LongContextReorder

Reorder long context.

OpenAIMetadataTagger

Extract metadata tags from document contents using OpenAI functions.

DoctranPropertyExtractor

Extract properties from text documents using doctran.

DoctranTextTranslator

Translate text documents using doctran.

MarkdownifyTransformer

Converts HTML documents to Markdown format with customizable options for handling

DoctranQATransformer

Extract QA from text documents using doctran.

EmbeddingsRedundantFilter

Filter that drops redundant documents by comparing their embeddings.

EmbeddingsClusteringFilter

Perform K-means clustering on document vectors.

TelegramChatLoader

Load telegram conversations to LangChain chat messages.

LangSmithRunChatLoader

Load chat sessions from a list of LangSmith "llm" runs.

LangSmithDatasetChatLoader

Load chat sessions from a LangSmith dataset with the "chat" data type.

IMessageChatLoader

Load chat sessions from the iMessage chat.db SQLite file.

SlackChatLoader

Load Slack conversations from a dump zip file.

WhatsAppChatLoader

Load WhatsApp conversations from a dump zip file or directory.

SingleFileFacebookMessengerChatLoader

Load Facebook Messenger chat data from a single file.

FolderFacebookMessengerChatLoader

Load Facebook Messenger chat data from a folder.

GMailLoader

Load chat sessions from Gmail.

OracleAutonomousDatabaseLoader

Load from oracle adb

OracleDocLoader

Read documents using OracleDocLoader

OracleTextSplitter

Splitting text using Oracle chunker.

AzureBlobStorageContainerLoader

Load from Azure Blob Storage container.

BigQueryLoader

Load from the Google Cloud Platform BigQuery.

GCSFileLoader

Load from GCS file.

GoogleDriveLoader

Load Google Docs from Google Drive.

GCSDirectoryLoader

Load from GCS directory.

GoogleSpeechToTextLoader

Loader for Google Cloud Speech-to-Text audio transcripts.

AzureBlobStorageFileLoader

Load from Azure Blob Storage files.

ApifyDatasetLoader

Load datasets from Apify web scraping, crawling, and data extraction platform.

AstraDBLoader

UnstructuredFileLoader

Load files using Unstructured.

UnstructuredAPIFileLoader

Load files using Unstructured API.

UnstructuredFileIOLoader

Load file-like objects opened in read mode using Unstructured.

UnstructuredAPIFileIOLoader

Send file-like objects with unstructured-client sdk to the Unstructured API.

DocugamiLoader

Load from Docugami.

DocAIParser

Google Cloud Document AI parser.

GoogleTranslateTransformer

Translate text documents using Google Cloud Translation.

Functions

concatenate_cells

Combine cells information in a readable format ready to be used.

remove_newlines

Recursively remove newlines, no matter the data structure they are stored in.

concatenate_rows

Combine message information in a readable format ready to be used.

text_to_docs

Convert a string or list of strings to a list of Documents with metadata.

default_loader_func

concatenate_rows

Combine message information in a readable format ready to be used.

fetch_mime_types

Fetch the mime types for the specified file types.

fetch_extensions

Fetch the mime types for the specified file types.

detect_file_encodings

Try to detect the file encoding.

default_joiner

Default joiner for content columns.

concatenate_rows

Combine message information in a readable format ready to be used.

satisfies_min_unstructured_version

Check if the installed Unstructured version exceeds the minimum version

validate_unstructured_version

Raise an error if the Unstructured version does not exceed the

get_elements_from_api

Retrieve a list of elements from the Unstructured API.

concatenate_rows

Combine message information in a readable format ready to be used.

extract_from_images_with_rapidocr

Extract text from images with RapidOCR.

get_parser

Get a parser by parser name.

require_model_export

get_navigable_strings

Get all navigable strings from a BeautifulSoup element.

create_metadata_tagger

Create a DocumentTransformer that uses an OpenAI function chain to automatically

get_stateful_documents

Convert a list of documents to a list of documents with state.

merge_chat_runs_in_session

Merge chat runs together in a chat session.

merge_chat_runs

Merge chat runs together.

map_ai_messages_in_session

Convert messages from the specified 'sender' to AI messages.

map_ai_messages

Convert messages from the specified 'sender' to AI messages.