Async index data from the loader into the vector store.
Indexing functionality uses a manager to keep track of which documents are in the vector store.
This allows us to keep track of which documents were updated, and which documents were deleted, which documents should be skipped.
For the time being, documents are indexed using their hashes, and users are not able to specify the uid of the document.
langchain-core 0.3.25Added scoped_full cleanup mode.
scoped_full mode is suitable if determining an appropriate batch size
is challenging or if your data loader cannot return the entire dataset at
once. This mode keeps track of source IDs in memory, which should be fine
for most use cases. If your dataset is large (10M+ docs), you will likely
need to parallelize the indexing process regardless.aindex(
docs_source: BaseLoader | Iterable[Document] | AsyncIterator[Document],
record_manager: RecordManager,
vector_store: VectorStore | DocumentIndex,
*,
batch_size: int = 100,
cleanup: Literal['incremental', 'full', 'scoped_full'] | None = None,
source_id_key: str | Callable[[Document], str] | None = None,
cleanup_batch_size: int = 1000,
force_update: bool = False,
key_encoder: Literal['sha1', 'sha256', 'sha512', 'blake2b'] | Callable[[Document], str] = 'sha1',
upsert_kwargs: dict[str, Any] | None = None
) -> IndexingResult| Name | Type | Description |
|---|---|---|
docs_source* | BaseLoader | Iterable[Document] | AsyncIterator[Document] | Data loader or iterable of documents to index. |
record_manager* | RecordManager | Timestamped set to keep track of which documents were updated. |
vector_store* | VectorStore | DocumentIndex |
|
batch_size | int | Default: 100Batch size to use when indexing. |
cleanup | Literal['incremental', 'full', 'scoped_full'] | None | Default: NoneHow to handle clean up of documents.
|
source_id_key | str | Callable[[Document], str] | None | Default: NoneOptional key that helps identify the original source of the document. |
cleanup_batch_size | int | Default: 1000Batch size to use when cleaning up documents. |
force_update | bool | Default: FalseForce update documents even if they are present in the record manager. Useful if you are re-indexing with updated embeddings. |
key_encoder | Literal['sha1', 'sha256', 'sha512', 'blake2b'] | Callable[[Document], str] | Default: 'sha1'Hashing algorithm to use for hashing the document content and metadata. Options include "blake2b", "sha256", and "sha512". |
key_encoder | Literal['sha1', 'sha256', 'sha512', 'blake2b'] | Callable[[Document], str] | Default: 'sha1'Hashing algorithm to use for hashing the document. If not provided, a default encoder using SHA-1 will be used. SHA-1 is not collision-resistant, and a motivated attacker could craft two different texts that hash to the same cache key. New applications should use one of the alternative encoders or provide a custom and strong key encoder function to avoid this risk. When changing the key encoder, you must change the index as well to avoid duplicated documents in the cache. |
upsert_kwargs | dict[str, Any] | None | Default: NoneAdditional keyword arguments to pass to the add_documents
method of the |