Module contains logic for indexing documents into vector stores.
Index data from the loader into the vector store.
Indexing functionality uses a manager to keep track of which documents are in the vector store.
This allows us to keep track of which documents were updated, and which documents were deleted, which documents should be skipped.
For the time being, documents are indexed using their hashes, and users are not able to specify the uid of the document.
langchain-core 0.3.25Added scoped_full cleanup mode.
scoped_full mode is suitable if determining an appropriate batch size
is challenging or if your data loader cannot return the entire dataset at
once. This mode keeps track of source IDs in memory, which should be fine
for most use cases. If your dataset is large (10M+ docs), you will likely
need to parallelize the indexing process regardless.Async index data from the loader into the vector store.
Indexing functionality uses a manager to keep track of which documents are in the vector store.
This allows us to keep track of which documents were updated, and which documents were deleted, which documents should be skipped.
For the time being, documents are indexed using their hashes, and users are not able to specify the uid of the document.
langchain-core 0.3.25Added scoped_full cleanup mode.
scoped_full mode is suitable if determining an appropriate batch size
is challenging or if your data loader cannot return the entire dataset at
once. This mode keeps track of source IDs in memory, which should be fine
for most use cases. If your dataset is large (10M+ docs), you will likely
need to parallelize the indexing process regardless.Interface for document loader.
Implementations should implement the lazy-loading method using generators to avoid loading all documents into memory at once.
load is provided just for user convenience and should not be overridden.
Class for storing a piece of text and associated metadata.
Document is for retrieval workflows, not chat I/O. For sending text
to an LLM in a conversation, use message types from langchain.messages.
General LangChain exception.
A document retriever that supports indexing operations.
This indexing interface is designed to be a generic abstraction for storing and querying documents that has an ID and metadata associated with it.
The interface is designed to be agnostic to the underlying implementation of the indexing system.
The interface is designed to support the following operations:
Abstract base class representing the interface for a record manager.
The record manager abstraction is used by the langchain indexing API.
The record manager keeps track of which documents have been
written into a VectorStore and when they were written.
The indexing API computes hashes for each document and stores the hash together with the write time and the source id in the record manager.
On subsequent indexing runs, the indexing API can check the record manager to determine which documents have already been indexed and which have not.
This allows the indexing API to avoid re-indexing documents that have already been indexed, and to only index new documents.
The main benefit of this abstraction is that it works across many vectorstores.
To be supported, a VectorStore needs to only support the ability to add and
delete documents by ID. Using the record manager, the indexing API will
be able to delete outdated documents and avoid redundant indexing of documents
that have already been indexed.
The main constraints of this abstraction are:
VectorStore fails.Interface for vector store.
Raised when an indexing operation fails.
Return a detailed a breakdown of the result of the indexing operation.