Embeddings¶
Embeddings.
Modules:
Name | Description |
---|---|
base |
Factory functions for embeddings. |
cache |
Module contains code for a cache backed embedder. |
Classes:
Name | Description |
---|---|
Embeddings |
Interface for embedding models. |
CacheBackedEmbeddings |
Interface for caching results from embedding models. |
Functions:
Name | Description |
---|---|
init_embeddings |
Initialize an embeddings model from a model name and optional provider. |
Embeddings
¶
Bases: ABC
Interface for embedding models.
This is an interface meant for implementing text embedding models.
Text embedding models are used to map text to a vector (a point in n-dimensional space).
Texts that are similar will usually be mapped to points that are close to each other in this space. The exact details of what's considered "similar" and how "distance" is measured in this space are dependent on the specific embedding model.
This abstraction contains a method for embedding a list of documents and a method for embedding a query text. The embedding of a query text is expected to be a single vector, while the embedding of a list of documents is expected to be a list of vectors.
Usually the query embedding is identical to the document embedding, but the abstraction allows treating them independently.
In addition to the synchronous methods, this interface also provides asynchronous versions of the methods.
By default, the asynchronous methods are implemented using the synchronous methods; however, implementations may choose to override the asynchronous methods with an async native implementation for performance reasons.
Methods:
Name | Description |
---|---|
embed_documents |
Embed search docs. |
embed_query |
Embed query text. |
aembed_documents |
Asynchronous Embed search docs. |
aembed_query |
Asynchronous Embed query text. |
CacheBackedEmbeddings
¶
Bases: Embeddings
Interface for caching results from embedding models.
The interface allows works with any store that implements the abstract store interface accepting keys of type str and values of list of floats.
If need be, the interface can be extended to accept other implementations of the value serializer and deserializer, as well as the key encoder.
Note that by default only document embeddings are cached. To cache query embeddings too, pass in a query_embedding_store to constructor.
Examples:
.. code-block: python
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore
from langchain_community.embeddings import OpenAIEmbeddings
store = LocalFileStore('./my_cache')
underlying_embedder = OpenAIEmbeddings()
embedder = CacheBackedEmbeddings.from_bytes_store(
underlying_embedder, store, namespace=underlying_embedder.model
)
# Embedding is computed and cached
embeddings = embedder.embed_documents(["hello", "goodbye"])
# Embeddings are retrieved from the cache, no computation is done
embeddings = embedder.embed_documents(["hello", "goodbye"])
Methods:
Name | Description |
---|---|
__init__ |
Initialize the embedder. |
embed_documents |
Embed a list of texts. |
aembed_documents |
Embed a list of texts. |
embed_query |
Embed query text. |
aembed_query |
Embed query text. |
from_bytes_store |
On-ramp that adds the necessary serialization and encoding to the store. |
__init__
¶
__init__(
underlying_embeddings: Embeddings,
document_embedding_store: BaseStore[str, list[float]],
*,
batch_size: int | None = None,
query_embedding_store: BaseStore[str, list[float]]
| None = None,
) -> None
Initialize the embedder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
underlying_embeddings
|
Embeddings
|
the embedder to use for computing embeddings. |
required |
document_embedding_store
|
BaseStore[str, list[float]]
|
The store to use for caching document embeddings. |
required |
batch_size
|
int | None
|
The number of documents to embed between store updates. |
None
|
query_embedding_store
|
BaseStore[str, list[float]] | None
|
The store to use for caching query embeddings.
If |
None
|
embed_documents
¶
Embed a list of texts.
The method first checks the cache for the embeddings. If the embeddings are not found, the method uses the underlying embedder to embed the documents and stores the results in the cache.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
list[str]
|
A list of texts to embed. |
required |
Returns:
Type | Description |
---|---|
list[list[float]]
|
A list of embeddings for the given texts. |
aembed_documents
async
¶
Embed a list of texts.
The method first checks the cache for the embeddings. If the embeddings are not found, the method uses the underlying embedder to embed the documents and stores the results in the cache.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
list[str]
|
A list of texts to embed. |
required |
Returns:
Type | Description |
---|---|
list[list[float]]
|
A list of embeddings for the given texts. |
embed_query
¶
Embed query text.
By default, this method does not cache queries. To enable caching, set the
cache_query
parameter to True
when initializing the embedder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to embed. |
required |
Returns:
Type | Description |
---|---|
list[float]
|
The embedding for the given text. |
aembed_query
async
¶
Embed query text.
By default, this method does not cache queries. To enable caching, set the
cache_query
parameter to True
when initializing the embedder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to embed. |
required |
Returns:
Type | Description |
---|---|
list[float]
|
The embedding for the given text. |
from_bytes_store
classmethod
¶
from_bytes_store(
underlying_embeddings: Embeddings,
document_embedding_cache: ByteStore,
*,
namespace: str = "",
batch_size: int | None = None,
query_embedding_cache: bool | ByteStore = False,
key_encoder: Callable[[str], str]
| Literal[
"sha1", "blake2b", "sha256", "sha512"
] = "sha1",
) -> CacheBackedEmbeddings
On-ramp that adds the necessary serialization and encoding to the store.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
underlying_embeddings
|
Embeddings
|
The embedder to use for embedding. |
required |
document_embedding_cache
|
ByteStore
|
The cache to use for storing document embeddings. |
required |
namespace
|
str
|
The namespace to use for document cache. This namespace is used to avoid collisions with other caches. For example, set it to the name of the embedding model used. |
''
|
batch_size
|
int | None
|
The number of documents to embed between store updates. |
None
|
query_embedding_cache
|
bool | ByteStore
|
The cache to use for storing query embeddings. True to use the same cache as document embeddings. False to not cache query embeddings. |
False
|
key_encoder
|
Callable[[str], str] | Literal['sha1', 'blake2b', 'sha256', 'sha512']
|
Optional callable to encode keys. If not provided, a default encoder using SHA-1 will be used. SHA-1 is not collision-resistant, and a motivated attacker could craft two different texts that hash to the same cache key. New applications should use one of the alternative encoders or provide a custom and strong key encoder function to avoid this risk. If you change a key encoder in an existing cache, consider just creating a new cache, to avoid (the potential for) collisions with existing keys or having duplicate keys for the same text in the cache. |
'sha1'
|
Returns:
Type | Description |
---|---|
CacheBackedEmbeddings
|
An instance of CacheBackedEmbeddings that uses the provided cache. |
init_embeddings
¶
init_embeddings(
model: str,
*,
provider: str | None = None,
**kwargs: Any,
) -> Embeddings | Runnable[Any, list[float]]
Initialize an embeddings model from a model name and optional provider.
Note: Must have the integration package corresponding to the model provider installed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Name of the model to use. Can be either: - A model string like "openai:text-embedding-3-small" - Just the model name if provider is specified |
required |
provider
|
str | None
|
Optional explicit provider name. If not specified, will attempt to parse from the model string. Supported providers and their required packages: {_get_provider_list()} |
None
|
**kwargs
|
Any
|
Additional model-specific parameters passed to the embedding model. These vary by provider, see the provider-specific documentation for details. |
{}
|
Returns:
Type | Description |
---|---|
Embeddings | Runnable[Any, list[float]]
|
An Embeddings instance that can generate embeddings for text. |
Raises:
Type | Description |
---|---|
ValueError
|
If the model provider is not supported or cannot be determined |
ImportError
|
If the required provider package is not installed |
.. dropdown:: Example Usage :open:
.. code-block:: python
# Using a model string
model = init_embeddings("openai:text-embedding-3-small")
model.embed_query("Hello, world!")
# Using explicit provider
model = init_embeddings(model="text-embedding-3-small", provider="openai")
model.embed_documents(["Hello, world!", "Goodbye, world!"])
# With additional parameters
model = init_embeddings("openai:text-embedding-3-small", api_key="sk-...")
.. versionadded:: 0.3.9