Document loaders

document_loaders ¶

Document loaders.

BaseLoader ¶

Bases: ABC

Interface for Document Loader.

Implementations should implement the lazy-loading method using generators to avoid loading all documents into memory at once.

load is provided just for user convenience and should not be overridden.

METHOD	DESCRIPTION
`load`	Load data into `Document` objects.
`aload`	Load data into `Document` objects.
`load_and_split`	Load `Document` and split into chunks. Chunks are returned as `Document`.
`lazy_load`	A lazy loader for `Document`.
`alazy_load`	A lazy loader for `Document`.

load ¶

load() -> list[Document]

Load data into Document objects.

RETURNS	DESCRIPTION
`list[Document]`	The documents.

aload `async` ¶

aload() -> list[Document]

Load data into Document objects.

RETURNS	DESCRIPTION
`list[Document]`	The documents.

load_and_split ¶

load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]

Load Document and split into chunks. Chunks are returned as Document.

Danger

Do not override this method. It should be considered to be deprecated!

PARAMETER	DESCRIPTION
`text_splitter`	`TextSplitter` instance to use for splitting documents. Defaults to `RecursiveCharacterTextSplitter`. TYPE: `TextSplitter \| None` DEFAULT: `None`

RAISES	DESCRIPTION
`ImportError`	If `langchain-text-splitters` is not installed and no `text_splitter` is provided.

RETURNS	DESCRIPTION
`list[Document]`	List of `Document`.

lazy_load ¶

lazy_load() -> Iterator[Document]

A lazy loader for Document.

YIELDS	DESCRIPTION
`Document`	The `Document` objects.

alazy_load `async` ¶

alazy_load() -> AsyncIterator[Document]

A lazy loader for Document.

YIELDS	DESCRIPTION
`AsyncIterator[Document]`	The `Document` objects.

BaseBlobParser ¶

Bases: ABC

Abstract interface for blob parsers.

A blob parser provides a way to parse raw data stored in a blob into one or more Document objects.

The parser can be composed with blob loaders, making it easy to reuse a parser independent of how the blob was originally loaded.

METHOD	DESCRIPTION
`lazy_parse`	Lazy parsing interface.
`parse`	Eagerly parse the blob into a `Document` or list of `Document` objects.

lazy_parse `abstractmethod` ¶

lazy_parse(blob: Blob) -> Iterator[Document]

Lazy parsing interface.

Subclasses are required to implement this method.

PARAMETER	DESCRIPTION
`blob`	`Blob` instance TYPE: `Blob`

RETURNS	DESCRIPTION
`Iterator[Document]`	Generator of `Document` objects

parse ¶

parse(blob: Blob) -> list[Document]

Eagerly parse the blob into a Document or list of Document objects.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

PARAMETER	DESCRIPTION
`blob`	`Blob` instance TYPE: `Blob`

RETURNS	DESCRIPTION
`list[Document]`	List of `Document` objects

BlobLoader ¶

Bases: ABC

Abstract interface for blob loaders implementation.

Implementer should be able to load raw content from a storage system according to some criteria and return the raw content lazily as a stream of blobs.

METHOD	DESCRIPTION
`yield_blobs`	A lazy loader for raw data represented by LangChain's `Blob` object.

yield_blobs `abstractmethod` ¶

yield_blobs() -> Iterable[Blob]

A lazy loader for raw data represented by LangChain's Blob object.

RETURNS	DESCRIPTION
`Iterable[Blob]`	A generator over blobs

LangSmithLoader ¶

Bases: BaseLoader

Load LangSmith Dataset examples as Document objects.

Loads the example inputs as the Document page content and places the entire example into the Document metadata. This allows you to easily create few-shot example retrievers from the loaded documents.

Lazy loading example

from langchain_core.document_loaders import LangSmithLoader

loader = LangSmithLoader(dataset_id="...", limit=100)
docs = []
for doc in loader.lazy_load():
    docs.append(doc)

# -> [Document("...", metadata={"inputs": {...}, "outputs": {...}, ...}), ...]

METHOD	DESCRIPTION
`load`	Load data into `Document` objects.
`aload`	Load data into `Document` objects.
`load_and_split`	Load `Document` and split into chunks. Chunks are returned as `Document`.
`alazy_load`	A lazy loader for `Document`.
`__init__`	Create a LangSmith loader.
`lazy_load`	A lazy loader for `Document`.

load ¶

load() -> list[Document]

Load data into Document objects.

RETURNS	DESCRIPTION
`list[Document]`	The documents.

aload `async` ¶

aload() -> list[Document]

Load data into Document objects.

RETURNS	DESCRIPTION
`list[Document]`	The documents.

load_and_split ¶

load_and_split(text_splitter: TextSplitter | None = None) -> list[Document]

Load Document and split into chunks. Chunks are returned as Document.

Danger

Do not override this method. It should be considered to be deprecated!

PARAMETER	DESCRIPTION
`text_splitter`	`TextSplitter` instance to use for splitting documents. Defaults to `RecursiveCharacterTextSplitter`. TYPE: `TextSplitter \| None` DEFAULT: `None`

RAISES	DESCRIPTION
`ImportError`	If `langchain-text-splitters` is not installed and no `text_splitter` is provided.

RETURNS	DESCRIPTION
`list[Document]`	List of `Document`.

alazy_load `async` ¶

alazy_load() -> AsyncIterator[Document]

A lazy loader for Document.

YIELDS	DESCRIPTION
`AsyncIterator[Document]`	The `Document` objects.

init ¶

__init__(
    *,
    dataset_id: UUID | str | None = None,
    dataset_name: str | None = None,
    example_ids: Sequence[UUID | str] | None = None,
    as_of: datetime | str | None = None,
    splits: Sequence[str] | None = None,
    inline_s3_urls: bool = True,
    offset: int = 0,
    limit: int | None = None,
    metadata: dict | None = None,
    filter: str | None = None,
    content_key: str = "",
    format_content: Callable[..., str] | None = None,
    client: Client | None = None,
    **client_kwargs: Any,
) -> None

Create a LangSmith loader.

PARAMETER	DESCRIPTION
`dataset_id`	The ID of the dataset to filter by. TYPE: `UUID \| str \| None` DEFAULT: `None`
`dataset_name`	The name of the dataset to filter by. TYPE: `str \| None` DEFAULT: `None`
`content_key`	The inputs key to set as Document page content. `'.'` characters are interpreted as nested keys. E.g. `content_key="first.second"` will result in `Document(page_content=format_content(example.inputs["first"]["second"]))` TYPE: `str` DEFAULT: `''`
`format_content`	Function for converting the content extracted from the example inputs into a string. Defaults to JSON-encoding the contents. TYPE: `Callable[..., str] \| None` DEFAULT: `None`
`example_ids`	The IDs of the examples to filter by. TYPE: `Sequence[UUID \| str] \| None` DEFAULT: `None`
`as_of`	The dataset version tag or timestamp to retrieve the examples as of. Response examples will only be those that were present at the time of the tagged (or timestamped) version. TYPE: `datetime \| str \| None` DEFAULT: `None`
`splits`	A list of dataset splits, which are divisions of your dataset such as `train`, `test`, or `validation`. Returns examples only from the specified splits. TYPE: `Sequence[str] \| None` DEFAULT: `None`
`inline_s3_urls`	Whether to inline S3 URLs. TYPE: `bool` DEFAULT: `True`
`offset`	The offset to start from. TYPE: `int` DEFAULT: `0`
`limit`	The maximum number of examples to return. TYPE: `int \| None` DEFAULT: `None`
`metadata`	Metadata to filter by. TYPE: `dict \| None` DEFAULT: `None`
`filter`	A structured filter string to apply to the examples. TYPE: `str \| None` DEFAULT: `None`
`client`	LangSmith Client. If not provided will be initialized from below args. TYPE: `Client \| None` DEFAULT: `None`
`client_kwargs`	Keyword args to pass to LangSmith client init. Should only be specified if `client` isn't. TYPE: `Any` DEFAULT: `{}`

RAISES	DESCRIPTION
`ValueError`	If both `client` and `client_kwargs` are provided.

lazy_load ¶

lazy_load() -> Iterator[Document]

A lazy loader for Document.

YIELDS	DESCRIPTION
`Document`	The `Document` objects.

Document loaders