Document loaders
langchain_core.document_loaders
¶
Document loaders.
BaseLoader
¶
Bases: ABC
Interface for Document Loader.
Implementations should implement the lazy-loading method using generators to avoid loading all documents into memory at once.
load is provided just for user convenience and should not be overridden.
| METHOD | DESCRIPTION |
|---|---|
load |
Load data into |
aload |
Load data into |
load_and_split |
Load |
lazy_load |
A lazy loader for |
alazy_load |
A lazy loader for |
load
¶
aload
async
¶
load_and_split
¶
load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]
Load Document and split into chunks. Chunks are returned as Document.
Danger
Do not override this method. It should be considered to be deprecated!
| PARAMETER | DESCRIPTION |
|---|---|
text_splitter
|
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If |
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
List of |
alazy_load
async
¶
alazy_load() -> AsyncIterator[Document]
A lazy loader for Document.
| YIELDS | DESCRIPTION |
|---|---|
AsyncIterator[Document]
|
The |
BaseBlobParser
¶
Bases: ABC
Abstract interface for blob parsers.
A blob parser provides a way to parse raw data stored in a blob into one
or more Document objects.
The parser can be composed with blob loaders, making it easy to reuse a parser independent of how the blob was originally loaded.
| METHOD | DESCRIPTION |
|---|---|
lazy_parse |
Lazy parsing interface. |
parse |
Eagerly parse the blob into a |
lazy_parse
abstractmethod
¶
parse
¶
Eagerly parse the blob into a Document or list of Document objects.
This is a convenience method for interactive development environment.
Production applications should favor the lazy_parse method instead.
Subclasses should generally not over-ride this parse method.
| PARAMETER | DESCRIPTION |
|---|---|
blob
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
List of |
BlobLoader
¶
Bases: ABC
Abstract interface for blob loaders implementation.
Implementer should be able to load raw content from a storage system according to some criteria and return the raw content lazily as a stream of blobs.
| METHOD | DESCRIPTION |
|---|---|
yield_blobs |
A lazy loader for raw data represented by LangChain's |
LangSmithLoader
¶
Bases: BaseLoader
Load LangSmith Dataset examples as Document objects.
Loads the example inputs as the Document page content and places the entire
example into the Document metadata. This allows you to easily create few-shot
example retrievers from the loaded documents.
Lazy loading example
| METHOD | DESCRIPTION |
|---|---|
load |
Load data into |
aload |
Load data into |
load_and_split |
Load |
alazy_load |
A lazy loader for |
__init__ |
Create a LangSmith loader. |
lazy_load |
A lazy loader for |
load
¶
aload
async
¶
load_and_split
¶
load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]
Load Document and split into chunks. Chunks are returned as Document.
Danger
Do not override this method. It should be considered to be deprecated!
| PARAMETER | DESCRIPTION |
|---|---|
text_splitter
|
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ImportError
|
If |
| RETURNS | DESCRIPTION |
|---|---|
list[Document]
|
List of |
alazy_load
async
¶
alazy_load() -> AsyncIterator[Document]
A lazy loader for Document.
| YIELDS | DESCRIPTION |
|---|---|
AsyncIterator[Document]
|
The |
__init__
¶
__init__(
*,
dataset_id: UUID | str | None = None,
dataset_name: str | None = None,
example_ids: Sequence[UUID | str] | None = None,
as_of: datetime | str | None = None,
splits: Sequence[str] | None = None,
inline_s3_urls: bool = True,
offset: int = 0,
limit: int | None = None,
metadata: dict | None = None,
filter: str | None = None,
content_key: str = "",
format_content: Callable[..., str] | None = None,
client: Client | None = None,
**client_kwargs: Any,
) -> None
Create a LangSmith loader.
| PARAMETER | DESCRIPTION |
|---|---|
dataset_id
|
The ID of the dataset to filter by. |
dataset_name
|
The name of the dataset to filter by.
TYPE:
|
content_key
|
The inputs key to set as Document page content.
TYPE:
|
format_content
|
Function for converting the content extracted from the example inputs into a string. Defaults to JSON-encoding the contents. |
example_ids
|
The IDs of the examples to filter by. |
as_of
|
The dataset version tag or timestamp to retrieve the examples as of. Response examples will only be those that were present at the time of the tagged (or timestamped) version. |
splits
|
A list of dataset splits, which are
divisions of your dataset such as |
inline_s3_urls
|
Whether to inline S3 URLs.
TYPE:
|
offset
|
The offset to start from.
TYPE:
|
limit
|
The maximum number of examples to return.
TYPE:
|
metadata
|
Metadata to filter by.
TYPE:
|
filter
|
A structured filter string to apply to the examples.
TYPE:
|
client
|
LangSmith Client. If not provided will be initialized from below args.
TYPE:
|
client_kwargs
|
Keyword args to pass to LangSmith client init. Should only be
specified if
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If both |