callbacks

Usage/Cost Tracking

For tracking model usage and , you can use the get_usage_callback context manager to track token information similar to get_openai_callback. Additionally, you can specify custom price mappings as necessary (price_map argument), or provide a custom callback manager for advanced use-cases (callback argument).

Note

This feature is currently not supported in streaming modes, but works fine for non-streaming invoke/ainvoke queries.

from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
from langchain_nvidia_ai_endpoints.callbacks import get_usage_callback

## Assume a price map per 1K tokens for a particular deployment plan
price_map = {
    "mixtral_8x7b": 0.00060,
    "gemma_7b": 0.0002,
    "nvolveqa_40k": 0.000016,
}

llm_large = ChatNVIDIA(model="mixtral_8x7b", temperature=0.1)
llm_small = ChatNVIDIA(model="gemma_2b", temperature=0.1)
embedding = NVIDIAEmbeddings(model="nvolveqa_40k")
models = [llm_large, llm_small, embedding]

with get_usage_callback(price_map=price_map) as cb:
    ## Reset either at beginning or end. Statistics will run until cleared
    cb.reset()

    llm_large.invoke("Tell me a joke")
    print(cb, end="

")
    # llm_large.invoke("Tell me a short joke")
    # print(cb, end="

")
    # ## Tracking through streaming coming soon
    # [_ for _ in llm_small.stream("Tell me a joke")]
    # print(cb, end="
[Should not change yet]

")
    ## Tracking for streaming supported
    embedding.embed_query("What a nice day :D")
    print(cb, end="

")
    # ## Sanity check. Should still be tracked fine
    # llm_small.invoke("Tell me a long joke")
    # print(cb, end="

")

## Out of scope. Will not be tracked
llm_small.invoke("Tell me a short joke")
print(cb, end="
[Should not change ever]

")
cb.model_usage

LangChain Assistant

Menu

Attributes

Functions

Classes

Usage/Cost Tracking