Skip to content

Document loaders

langchain_core.document_loaders

Document loaders.

BaseLoader

Bases: ABC

Interface for Document Loader.

Implementations should implement the lazy-loading method using generators to avoid loading all documents into memory at once.

load is provided just for user convenience and should not be overridden.

METHOD DESCRIPTION
load

Load data into Document objects.

aload

Load data into Document objects.

load_and_split

Load Document and split into chunks. Chunks are returned as Document.

lazy_load

A lazy loader for Document.

alazy_load

A lazy loader for Document.

load

load() -> list[Document]

Load data into Document objects.

RETURNS DESCRIPTION
list[Document]

The documents.

aload async

aload() -> list[Document]

Load data into Document objects.

RETURNS DESCRIPTION
list[Document]

The documents.

load_and_split

load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]

Load Document and split into chunks. Chunks are returned as Document.

Danger

Do not override this method. It should be considered to be deprecated!

PARAMETER DESCRIPTION
text_splitter

TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

TYPE: TextSplitter | None DEFAULT: None

RAISES DESCRIPTION
ImportError

If langchain-text-splitters is not installed and no text_splitter is provided.

RETURNS DESCRIPTION
list[Document]

List of Document.

lazy_load

lazy_load() -> Iterator[Document]

A lazy loader for Document.

YIELDS DESCRIPTION
Document

The Document objects.

alazy_load async

alazy_load() -> AsyncIterator[Document]

A lazy loader for Document.

YIELDS DESCRIPTION
AsyncIterator[Document]

The Document objects.

BaseBlobParser

Bases: ABC

Abstract interface for blob parsers.

A blob parser provides a way to parse raw data stored in a blob into one or more Document objects.

The parser can be composed with blob loaders, making it easy to reuse a parser independent of how the blob was originally loaded.

METHOD DESCRIPTION
lazy_parse

Lazy parsing interface.

parse

Eagerly parse the blob into a Document or list of Document objects.

lazy_parse abstractmethod

lazy_parse(blob: Blob) -> Iterator[Document]

Lazy parsing interface.

Subclasses are required to implement this method.

PARAMETER DESCRIPTION
blob

Blob instance

TYPE: Blob

RETURNS DESCRIPTION
Iterator[Document]

Generator of Document objects

parse

parse(blob: Blob) -> list[Document]

Eagerly parse the blob into a Document or list of Document objects.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

PARAMETER DESCRIPTION
blob

Blob instance

TYPE: Blob

RETURNS DESCRIPTION
list[Document]

List of Document objects

BlobLoader

Bases: ABC

Abstract interface for blob loaders implementation.

Implementer should be able to load raw content from a storage system according to some criteria and return the raw content lazily as a stream of blobs.

METHOD DESCRIPTION
yield_blobs

A lazy loader for raw data represented by LangChain's Blob object.

yield_blobs abstractmethod

yield_blobs() -> Iterable[Blob]

A lazy loader for raw data represented by LangChain's Blob object.

RETURNS DESCRIPTION
Iterable[Blob]

A generator over blobs

LangSmithLoader

Bases: BaseLoader

Load LangSmith Dataset examples as Document objects.

Loads the example inputs as the Document page content and places the entire example into the Document metadata. This allows you to easily create few-shot example retrievers from the loaded documents.

Lazy loading example
from langchain_core.document_loaders import LangSmithLoader

loader = LangSmithLoader(dataset_id="...", limit=100)
docs = []
for doc in loader.lazy_load():
    docs.append(doc)
# -> [Document("...", metadata={"inputs": {...}, "outputs": {...}, ...}), ...]
METHOD DESCRIPTION
load

Load data into Document objects.

aload

Load data into Document objects.

load_and_split

Load Document and split into chunks. Chunks are returned as Document.

alazy_load

A lazy loader for Document.

__init__

Create a LangSmith loader.

lazy_load

A lazy loader for Document.

load

load() -> list[Document]

Load data into Document objects.

RETURNS DESCRIPTION
list[Document]

The documents.

aload async

aload() -> list[Document]

Load data into Document objects.

RETURNS DESCRIPTION
list[Document]

The documents.

load_and_split

load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]

Load Document and split into chunks. Chunks are returned as Document.

Danger

Do not override this method. It should be considered to be deprecated!

PARAMETER DESCRIPTION
text_splitter

TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

TYPE: TextSplitter | None DEFAULT: None

RAISES DESCRIPTION
ImportError

If langchain-text-splitters is not installed and no text_splitter is provided.

RETURNS DESCRIPTION
list[Document]

List of Document.

alazy_load async

alazy_load() -> AsyncIterator[Document]

A lazy loader for Document.

YIELDS DESCRIPTION
AsyncIterator[Document]

The Document objects.

__init__

__init__(
    *,
    dataset_id: UUID | str | None = None,
    dataset_name: str | None = None,
    example_ids: Sequence[UUID | str] | None = None,
    as_of: datetime | str | None = None,
    splits: Sequence[str] | None = None,
    inline_s3_urls: bool = True,
    offset: int = 0,
    limit: int | None = None,
    metadata: dict | None = None,
    filter: str | None = None,
    content_key: str = "",
    format_content: Callable[..., str] | None = None,
    client: Client | None = None,
    **client_kwargs: Any,
) -> None

Create a LangSmith loader.

PARAMETER DESCRIPTION
dataset_id

The ID of the dataset to filter by.

TYPE: UUID | str | None DEFAULT: None

dataset_name

The name of the dataset to filter by.

TYPE: str | None DEFAULT: None

content_key

The inputs key to set as Document page content. '.' characters are interpreted as nested keys. E.g. content_key="first.second" will result in Document(page_content=format_content(example.inputs["first"]["second"]))

TYPE: str DEFAULT: ''

format_content

Function for converting the content extracted from the example inputs into a string. Defaults to JSON-encoding the contents.

TYPE: Callable[..., str] | None DEFAULT: None

example_ids

The IDs of the examples to filter by.

TYPE: Sequence[UUID | str] | None DEFAULT: None

as_of

The dataset version tag or timestamp to retrieve the examples as of. Response examples will only be those that were present at the time of the tagged (or timestamped) version.

TYPE: datetime | str | None DEFAULT: None

splits

A list of dataset splits, which are divisions of your dataset such as train, test, or validation. Returns examples only from the specified splits.

TYPE: Sequence[str] | None DEFAULT: None

inline_s3_urls

Whether to inline S3 URLs.

TYPE: bool DEFAULT: True

offset

The offset to start from.

TYPE: int DEFAULT: 0

limit

The maximum number of examples to return.

TYPE: int | None DEFAULT: None

metadata

Metadata to filter by.

TYPE: dict | None DEFAULT: None

filter

A structured filter string to apply to the examples.

TYPE: str | None DEFAULT: None

client

LangSmith Client. If not provided will be initialized from below args.

TYPE: Client | None DEFAULT: None

client_kwargs

Keyword args to pass to LangSmith client init. Should only be specified if client isn't.

TYPE: Any DEFAULT: {}

RAISES DESCRIPTION
ValueError

If both client and client_kwargs are provided.

lazy_load

lazy_load() -> Iterator[Document]

A lazy loader for Document.

YIELDS DESCRIPTION
Document

The Document objects.