Evaluation
evaluation
¶
Evaluation Helpers.
| FUNCTION | DESCRIPTION |
|---|---|
aevaluate |
Evaluate an async target system on a given dataset. |
aevaluate_existing |
Evaluate existing experiment runs asynchronously. |
evaluate |
Evaluate a target system on a given dataset. |
evaluate_comparative |
Evaluate existing experiment runs against each other. |
evaluate_existing |
Evaluate existing experiment runs. |
run_evaluator |
Create a run evaluator from a function. |
EvaluationResult
¶
Bases: BaseModel
Evaluation result.
| METHOD | DESCRIPTION |
|---|---|
check_value_non_numeric |
Check that the value is not numeric. |
score
class-attribute
instance-attribute
¶
The numeric score for this evaluation.
value
class-attribute
instance-attribute
¶
The value for this evaluation, if not numeric.
comment
class-attribute
instance-attribute
¶
comment: str | None = None
An explanation regarding the evaluation.
correction
class-attribute
instance-attribute
¶
correction: dict | None = None
What the correct value should be, if applicable.
evaluator_info
class-attribute
instance-attribute
¶
Additional information about the evaluator.
feedback_config
class-attribute
instance-attribute
¶
feedback_config: FeedbackConfig | dict | None = None
The configuration used to generate this feedback.
source_run_id
class-attribute
instance-attribute
¶
The ID of the trace of the evaluator itself.
target_run_id
class-attribute
instance-attribute
¶
The ID of the trace this evaluation is applied to.
If none provided, the evaluation feedback is applied to the root trace being.
extra
class-attribute
instance-attribute
¶
extra: dict | None = None
Metadata for the evaluator run.
Config
¶
Pydantic model configuration.
EvaluationResults
¶
Bases: TypedDict
Batch evaluation results.
This makes it easy for your evaluator to return multiple metrics at once.
RunEvaluator
¶
Evaluator interface class.
| METHOD | DESCRIPTION |
|---|---|
evaluate_run |
Evaluate an example. |
aevaluate_run |
Evaluate an example asynchronously. |
evaluate_run
abstractmethod
¶
evaluate_run(
run: Run, example: Example | None = None, evaluator_run_id: UUID | None = None
) -> EvaluationResult | EvaluationResults
Evaluate an example.
aevaluate_run
async
¶
aevaluate_run(
run: Run, example: Example | None = None, evaluator_run_id: UUID | None = None
) -> EvaluationResult | EvaluationResults
Evaluate an example asynchronously.
LangChainStringEvaluator
¶
A class for wrapping a LangChain StringEvaluator.
Requires the langchain package to be installed.
| ATTRIBUTE | DESCRIPTION |
|---|---|
evaluator |
The underlying
TYPE:
|
| METHOD | DESCRIPTION |
|---|---|
`as_run_evaluator |
Convert the |
Examples:
Converting a LangChainStringEvaluator to a RunEvaluator
from langsmith.evaluation import LangChainStringEvaluator
from langchain_openai import ChatOpenAI
evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"usefulness": "The prediction is useful if"
" it is correct and/or asks a useful followup question."
},
"llm": ChatOpenAI(model="gpt-4o"),
},
)
run_evaluator = evaluator.as_run_evaluator()
run_evaluator # doctest: +ELLIPSIS
<DynamicRunEvaluator ...>
Customizing the LLM model used by the evaluator
from langsmith.evaluation import LangChainStringEvaluator
from langchain_anthropic import ChatAnthropic
evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"usefulness": "The prediction is useful if"
" it is correct and/or asks a useful followup question."
},
"llm": ChatAnthropic(model="claude-3-opus-20240229"),
},
)
run_evaluator = evaluator.as_run_evaluator()
run_evaluator # doctest: +ELLIPSIS
<DynamicRunEvaluator ...>
Using the evaluate API with different evaluators
def prepare_data(run: Run, example: Example):
# Convert the evaluation data into the format expected by the evaluator
# Only required for datasets with multiple inputs/output keys
return {
"prediction": run.outputs["prediction"],
"reference": example.outputs["answer"],
"input": str(example.inputs),
}
import re
from langchain_anthropic import ChatAnthropic
import langsmith
from langsmith.evaluation import LangChainStringEvaluator, evaluate
criteria_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"usefulness": "The prediction is useful if it is correct"
" and/or asks a useful followup question."
},
"llm": ChatAnthropic(model="claude-3-opus-20240229"),
},
prepare_data=prepare_data,
)
embedding_evaluator = LangChainStringEvaluator("embedding_distance")
exact_match_evaluator = LangChainStringEvaluator("exact_match")
regex_match_evaluator = LangChainStringEvaluator(
"regex_match", config={"flags": re.IGNORECASE}, prepare_data=prepare_data
)
scoring_evaluator = LangChainStringEvaluator(
"labeled_score_string",
config={
"criteria": {
"accuracy": "Score 1: Completely inaccurate\nScore 5: Somewhat accurate\nScore 10: Completely accurate"
},
"normalize_by": 10,
"llm": ChatAnthropic(model="claude-3-opus-20240229"),
},
prepare_data=prepare_data,
)
string_distance_evaluator = LangChainStringEvaluator(
"string_distance",
config={"distance_metric": "levenshtein"},
prepare_data=prepare_data,
)
from langsmith import Client
client = Client()
results = evaluate(
lambda inputs: {"prediction": "foo"},
data=client.list_examples(dataset_name="Evaluate Examples", limit=1),
evaluators=[
embedding_evaluator,
criteria_evaluator,
exact_match_evaluator,
regex_match_evaluator,
scoring_evaluator,
string_distance_evaluator,
],
) # doctest: +ELLIPSIS
__init__
¶
__init__(
evaluator: StringEvaluator | str,
*,
config: dict | None = None,
prepare_data: Callable[[Run, Optional[Example]], SingleEvaluatorInput]
| None = None,
)
Initialize a LangChainStringEvaluator.
| PARAMETER | DESCRIPTION |
|---|---|
evaluator
|
The underlying
TYPE:
|
as_run_evaluator
¶
as_run_evaluator() -> RunEvaluator
Convert the LangChainStringEvaluator to a RunEvaluator.
This is the object used in the LangSmith evaluate API.
| RETURNS | DESCRIPTION |
|---|---|
RunEvaluator
|
The converted
TYPE:
|
aevaluate
async
¶
aevaluate(
target: ATARGET_T | AsyncIterable[dict] | Runnable | str | UUID | TracerSession,
/,
data: DATA_T | AsyncIterable[Example] | Iterable[Example] | None = None,
evaluators: Sequence[EVALUATOR_T | AEVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
experiment_prefix: str | None = None,
description: str | None = None,
max_concurrency: int | None = 0,
num_repetitions: int = 1,
client: Client | None = None,
blocking: bool = True,
experiment: TracerSession | str | UUID | None = None,
upload_results: bool = True,
error_handling: Literal["log", "ignore"] = "log",
**kwargs: Any,
) -> AsyncExperimentResults
Evaluate an async target system on a given dataset.
| PARAMETER | DESCRIPTION |
|---|---|
target
|
The target system or experiment(s) to evaluate. Can be an async function that takes a
TYPE:
|
data
|
The dataset to evaluate on. Can be a dataset name, a list of examples, an async generator of examples, or an async iterable of examples.
TYPE:
|
evaluators
|
A list of evaluators to run on each example.
TYPE:
|
summary_evaluators
|
A list of summary evaluators to run on the entire dataset.
TYPE:
|
metadata
|
Metadata to attach to the experiment.
TYPE:
|
experiment_prefix
|
A prefix to provide for your experiment name.
TYPE:
|
description
|
A description of the experiment.
TYPE:
|
max_concurrency
|
The maximum number of concurrent evaluations to run. If
TYPE:
|
num_repetitions
|
The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times.
TYPE:
|
client
|
The LangSmith client to use.
TYPE:
|
blocking
|
Whether to block until the evaluation is complete.
TYPE:
|
experiment
|
An existing experiment to extend. If provided,
TYPE:
|
error_handling
|
How to handle individual run errors.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
AsyncExperimentResults
|
An async iterator over the experiment results. |
Environment
-
LANGSMITH_TEST_CACHE: If set, API calls will be cached to disk to save time and cost during testing.Recommended to commit the cache files to your repository for faster CI/CD runs.
Requires the
'langsmith[vcr]'package to be installed.
Examples:
>>> from typing import Sequence
>>> from langsmith import Client, aevaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"
Basic usage:
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
>>> import asyncio
>>> async def apredict(inputs: dict) -> dict:
... # This can be any async function or just an API call to your app.
... await asyncio.sleep(0.1)
... return {"output": "Yes"}
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment",
... description="Evaluate the accuracy of the model asynchronously.",
... metadata={
... "my-prompt-version": "abcd-1234",
... },
... )
... )
View the evaluation results for experiment:...
Evaluating over only a subset of the examples using an async generator:
>>> async def example_generator():
... examples = client.list_examples(dataset_name=dataset_name, limit=5)
... for example in examples:
... yield example
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=example_generator(),
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Subset Experiment",
... description="Evaluate a subset of examples asynchronously.",
... )
... )
View the evaluation results for experiment:...
Streaming each prediction to more easily + eagerly debug.
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Streaming Experiment",
... description="Streaming predictions for debugging.",
... blocking=False,
... )
... )
View the evaluation results for experiment:...
>>> async def aenumerate(iterable):
... async for elem in iterable:
... print(elem)
>>> asyncio.run(aenumerate(results))
Running without concurrency:
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment Without Concurrency",
... description="This was run without concurrency.",
... max_concurrency=0,
... )
... )
View the evaluation results for experiment:...
Using Async evaluators:
>>> async def helpfulness(run: Run, example: Example):
... # Row-level evaluator for helpfulness.
... await asyncio.sleep(5) # Replace with your LLM API call
... return {"score": run.outputs["output"] == "Yes"}
>>> results = asyncio.run(
... aevaluate(
... apredict,
... data=dataset_name,
... evaluators=[helpfulness],
... summary_evaluators=[precision],
... experiment_prefix="My Helpful Experiment",
... description="Applying async evaluators example.",
... )
... )
View the evaluation results for experiment:...
Behavior changed in langsmith 0.2.0
'max_concurrency' default updated from None (no limit on concurrency) to 0 (no concurrency at all).
aevaluate_existing
async
¶
aevaluate_existing(
experiment: str | UUID | TracerSession,
/,
evaluators: Sequence[EVALUATOR_T | AEVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
max_concurrency: int | None = 0,
client: Client | None = None,
load_nested: bool = False,
blocking: bool = True,
) -> AsyncExperimentResults
Evaluate existing experiment runs asynchronously.
| PARAMETER | DESCRIPTION |
|---|---|
experiment
|
The identifier of the experiment to evaluate. |
evaluators
|
Optional sequence of evaluators to use for individual run evaluation.
TYPE:
|
summary_evaluators
|
Optional sequence of evaluators to apply over the entire dataset.
TYPE:
|
metadata
|
Optional metadata to include in the evaluation results.
TYPE:
|
max_concurrency
|
The maximum number of concurrent evaluations to run. If
TYPE:
|
client
|
Optional Langsmith client to use for evaluation.
TYPE:
|
load_nested
|
Whether to load all child runs for the experiment. Default is to only load the top-level root runs.
TYPE:
|
blocking
|
Whether to block until evaluation is complete.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
AsyncExperimentResults
|
An async iterator over the experiment results. |
Examples:
Define your evaluators
>>> from typing import Sequence
>>> from langsmith.schemas import Example, Run
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
Load the experiment and run the evaluation.
>>> import asyncio
>>> import uuid
>>> from langsmith import Client, aevaluate, aevaluate_existing
>>> client = Client()
>>> dataset_name = "__doctest_aevaluate_existing_" + uuid.uuid4().hex[:8]
>>> dataset = client.create_dataset(dataset_name)
>>> example = client.create_example(
... inputs={"question": "What is 2+2?"},
... outputs={"answer": "4"},
... dataset_id=dataset.id,
... )
>>> async def apredict(inputs: dict) -> dict:
... await asyncio.sleep(0.001)
... return {"output": "4"}
>>> results = asyncio.run(
... aevaluate(
... apredict, data=dataset_name, experiment_prefix="doctest_experiment"
... )
... )
View the evaluation results for experiment:...
>>> experiment_id = results.experiment_name
>>> # Consume all results to ensure evaluation is complete
>>> async def consume_results():
... result_list = [r async for r in results]
... return len(result_list) > 0
>>> asyncio.run(consume_results())
True
>>> import time
>>> time.sleep(3)
>>> results = asyncio.run(
... aevaluate_existing(
... experiment_id,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... )
... )
View the evaluation results for experiment:...
>>> client.delete_dataset(dataset_id=dataset.id)
evaluate
¶
evaluate(
target: TARGET_T | Runnable | EXPERIMENT_T | tuple[EXPERIMENT_T, EXPERIMENT_T],
/,
data: DATA_T | None = None,
evaluators: Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
experiment_prefix: str | None = None,
description: str | None = None,
max_concurrency: int | None = 0,
num_repetitions: int = 1,
client: Client | None = None,
blocking: bool = True,
experiment: EXPERIMENT_T | None = None,
upload_results: bool = True,
error_handling: Literal["log", "ignore"] = "log",
**kwargs: Any,
) -> ExperimentResults | ComparativeExperimentResults
Evaluate a target system on a given dataset.
| PARAMETER | DESCRIPTION |
|---|---|
target
|
The target system or experiment(s) to evaluate. Can be a function that takes a dict and returns a
TYPE:
|
data
|
The dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples.
TYPE:
|
evaluators
|
A list of evaluators to run on each example. The evaluator signature depends on the target type.
TYPE:
|
summary_evaluators
|
A list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments.
TYPE:
|
metadata
|
Metadata to attach to the experiment.
TYPE:
|
experiment_prefix
|
A prefix to provide for your experiment name.
TYPE:
|
description
|
A free-form text description for the experiment.
TYPE:
|
max_concurrency
|
The maximum number of concurrent evaluations to run. If
TYPE:
|
client
|
The LangSmith client to use.
TYPE:
|
blocking
|
Whether to block until the evaluation is complete.
TYPE:
|
num_repetitions
|
The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times.
TYPE:
|
experiment
|
An existing experiment to extend. If provided, For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.
TYPE:
|
error_handling
|
How to handle individual run errors.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ExperimentResults
|
If target is a function,
TYPE:
|
ComparativeExperimentResults
|
If target is a two-tuple of existing experiments.
TYPE:
|
Examples:
Prepare the dataset:
>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"
Basic usage:
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
>>> def predict(inputs: dict) -> dict:
... # This can be any function or just an API call to your app.
... return {"output": "Yes"}
>>> results = evaluate(
... predict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment",
... description="Evaluating the accuracy of a simple prediction model.",
... metadata={
... "my-prompt-version": "abcd-1234",
... },
... )
View the evaluation results for experiment:...
Evaluating over only a subset of the examples
>>> experiment_name = results.experiment_name
>>> examples = client.list_examples(dataset_name=dataset_name, limit=5)
>>> results = evaluate(
... predict,
... data=examples,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... experiment_prefix="My Experiment",
... description="Just testing a subset synchronously.",
... )
View the evaluation results for experiment:...
Streaming each prediction to more easily + eagerly debug.
>>> results = evaluate(
... predict,
... data=dataset_name,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... description="I don't even have to block!",
... blocking=False,
... )
View the evaluation results for experiment:...
>>> for i, result in enumerate(results):
... pass
Using the evaluate API with an off-the-shelf LangChain evaluator:
>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_openai import ChatOpenAI
>>> def prepare_criteria_data(run: Run, example: Example):
... return {
... "prediction": run.outputs["output"],
... "reference": example.outputs["answer"],
... "input": str(example.inputs),
... }
>>> results = evaluate(
... predict,
... data=dataset_name,
... evaluators=[
... accuracy,
... LangChainStringEvaluator("embedding_distance"),
... LangChainStringEvaluator(
... "labeled_criteria",
... config={
... "criteria": {
... "usefulness": "The prediction is useful if it is correct"
... " and/or asks a useful followup question."
... },
... "llm": ChatOpenAI(model="gpt-4o"),
... },
... prepare_data=prepare_criteria_data,
... ),
... ],
... description="Evaluating with off-the-shelf LangChain evaluators.",
... summary_evaluators=[precision],
... )
View the evaluation results for experiment:...
Evaluating a LangChain object:
>>> from langchain_core.runnables import chain as as_runnable
>>> @as_runnable
... def nested_predict(inputs):
... return {"output": "Yes"}
>>> @as_runnable
... def lc_predict(inputs):
... return nested_predict.invoke(inputs)
>>> results = evaluate(
... lc_predict.invoke,
... data=dataset_name,
... evaluators=[accuracy],
... description="This time we're evaluating a LangChain object.",
... summary_evaluators=[precision],
... )
View the evaluation results for experiment:...
Behavior changed in langsmith 0.2.0
'max_concurrency' default updated from None (no limit on concurrency) to 0 (no concurrency at all).
evaluate_comparative
¶
evaluate_comparative(
experiments: tuple[EXPERIMENT_T, EXPERIMENT_T],
/,
evaluators: Sequence[COMPARATIVE_EVALUATOR_T],
experiment_prefix: str | None = None,
description: str | None = None,
max_concurrency: int = 5,
client: Client | None = None,
metadata: dict | None = None,
load_nested: bool = False,
randomize_order: bool = False,
) -> ComparativeExperimentResults
Evaluate existing experiment runs against each other.
This lets you use pairwise preference scoring to generate more reliable feedback in your experiments.
| PARAMETER | DESCRIPTION |
|---|---|
experiments
|
The identifiers of the experiments to compare. |
evaluators
|
A list of evaluators to run on each example.
TYPE:
|
experiment_prefix
|
A prefix to provide for your experiment name.
TYPE:
|
description
|
A free-form text description for the experiment.
TYPE:
|
max_concurrency
|
The maximum number of concurrent evaluations to run.
TYPE:
|
client
|
The LangSmith client to use.
TYPE:
|
metadata
|
Metadata to attach to the experiment.
TYPE:
|
load_nested
|
Whether to load all child runs for the experiment. Default is to only load the top-level root runs.
TYPE:
|
randomize_order
|
Whether to randomize the order of the outputs for each evaluation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ComparativeExperimentResults
|
The results of the comparative evaluation. |
Examples:
Suppose you want to compare two prompts to see which one is more effective. You would first prepare your dataset:
>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"
Then you would run your different prompts:
>>> import functools
>>> import openai
>>> from langsmith.evaluation import evaluate
>>> from langsmith.wrappers import wrap_openai
>>> oai_client = openai.Client()
>>> wrapped_client = wrap_openai(oai_client)
>>> prompt_1 = "You are a helpful assistant."
>>> prompt_2 = "You are an exceedingly helpful assistant."
>>> def predict(inputs: dict, prompt: str) -> dict:
... completion = wrapped_client.chat.completions.create(
... model="gpt-4o-mini",
... messages=[
... {"role": "system", "content": prompt},
... {
... "role": "user",
... "content": f"Context: {inputs['context']}"
... f"\n\ninputs['question']",
... },
... ],
... )
... return {"output": completion.choices[0].message.content}
>>> results_1 = evaluate(
... functools.partial(predict, prompt=prompt_1),
... data=dataset_name,
... description="Evaluating our basic system prompt.",
... blocking=False, # Run these experiments in parallel
... )
View the evaluation results for experiment:...
>>> results_2 = evaluate(
... functools.partial(predict, prompt=prompt_2),
... data=dataset_name,
... description="Evaluating our advanced system prompt.",
... blocking=False,
... )
View the evaluation results for experiment:...
>>> results_1.wait()
>>> results_2.wait()
Finally, you would compare the two prompts directly:
>>> import json
>>> from langsmith.evaluation import evaluate_comparative
>>> from langsmith import schemas
>>> def score_preferences(runs: list, example: schemas.Example):
... assert len(runs) == 2 # Comparing 2 systems
... assert isinstance(example, schemas.Example)
... assert all(run.reference_example_id == example.id for run in runs)
... pred_a = runs[0].outputs["output"] if runs[0].outputs else ""
... pred_b = runs[1].outputs["output"] if runs[1].outputs else ""
... ground_truth = example.outputs["answer"] if example.outputs else ""
... tools = [
... {
... "type": "function",
... "function": {
... "name": "rank_preferences",
... "description": "Saves the prefered response ('A' or 'B')",
... "parameters": {
... "type": "object",
... "properties": {
... "reasoning": {
... "type": "string",
... "description": "The reasoning behind the choice.",
... },
... "preferred_option": {
... "type": "string",
... "enum": ["A", "B"],
... "description": "The preferred option, either 'A' or 'B'",
... },
... },
... "required": ["preferred_option"],
... },
... },
... }
... ]
... completion = openai.Client().chat.completions.create(
... model="gpt-4o-mini",
... messages=[
... {"role": "system", "content": "Select the better response."},
... {
... "role": "user",
... "content": f"Option A: {pred_a}"
... f"\n\nOption B: {pred_b}"
... f"\n\nGround Truth: {ground_truth}",
... },
... ],
... tools=tools,
... tool_choice={
... "type": "function",
... "function": {"name": "rank_preferences"},
... },
... )
... tool_args = completion.choices[0].message.tool_calls[0].function.arguments
... loaded_args = json.loads(tool_args)
... preference = loaded_args["preferred_option"]
... comment = loaded_args["reasoning"]
... if preference == "A":
... return {
... "key": "ranked_preference",
... "scores": {runs[0].id: 1, runs[1].id: 0},
... "comment": comment,
... }
... else:
... return {
... "key": "ranked_preference",
... "scores": {runs[0].id: 0, runs[1].id: 1},
... "comment": comment,
... }
>>> def score_length_difference(runs: list, example: schemas.Example):
... # Just return whichever response is longer.
... # Just an example, not actually useful in real life.
... assert len(runs) == 2 # Comparing 2 systems
... assert isinstance(example, schemas.Example)
... assert all(run.reference_example_id == example.id for run in runs)
... pred_a = runs[0].outputs["output"] if runs[0].outputs else ""
... pred_b = runs[1].outputs["output"] if runs[1].outputs else ""
... if len(pred_a) > len(pred_b):
... return {
... "key": "length_difference",
... "scores": {runs[0].id: 1, runs[1].id: 0},
... }
... else:
... return {
... "key": "length_difference",
... "scores": {runs[0].id: 0, runs[1].id: 1},
... }
>>> results = evaluate_comparative(
... [results_1.experiment_name, results_2.experiment_name],
... evaluators=[score_preferences, score_length_difference],
... client=client,
... )
View the pairwise evaluation results at:...
>>> eval_results = list(results)
>>> assert len(eval_results) >= 10
>>> assert all(
... "feedback.ranked_preference" in r["evaluation_results"]
... for r in eval_results
... )
>>> assert all(
... "feedback.length_difference" in r["evaluation_results"]
... for r in eval_results
... )
evaluate_existing
¶
evaluate_existing(
experiment: str | UUID | TracerSession,
/,
evaluators: Sequence[EVALUATOR_T] | None = None,
summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,
metadata: dict | None = None,
max_concurrency: int | None = 0,
client: Client | None = None,
load_nested: bool = False,
blocking: bool = True,
) -> ExperimentResults
Evaluate existing experiment runs.
| PARAMETER | DESCRIPTION |
|---|---|
experiment
|
The identifier of the experiment to evaluate. |
evaluators
|
Optional sequence of evaluators to use for individual run evaluation.
TYPE:
|
summary_evaluators
|
Optional sequence of evaluators to apply over the entire dataset.
TYPE:
|
metadata
|
Optional metadata to include in the evaluation results.
TYPE:
|
max_concurrency
|
The maximum number of concurrent evaluations to run. If
TYPE:
|
client
|
Optional Langsmith client to use for evaluation.
TYPE:
|
load_nested
|
Whether to load all child runs for the experiment. Default is to only load the top-level root runs.
TYPE:
|
blocking
|
Whether to block until evaluation is complete.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ExperimentResults
|
The evaluation results. |
Environment
-
LANGSMITH_TEST_CACHE: If set, API calls will be cached to disk to save time and cost during testing.Recommended to commit the cache files to your repository for faster CI/CD runs.
Requires the
'langsmith[vcr]'package to be installed.
Examples:
Define your evaluators
>>> from typing import Sequence
>>> from langsmith.schemas import Example, Run
>>> def accuracy(run: Run, example: Example):
... # Row-level evaluator for accuracy.
... pred = run.outputs["output"]
... expected = example.outputs["answer"]
... return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
... # Experiment-level evaluator for precision.
... # TP / (TP + FP)
... predictions = [run.outputs["output"].lower() for run in runs]
... expected = [example.outputs["answer"].lower() for example in examples]
... # yes and no are the only possible answers
... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
... return {"score": tp / (tp + fp)}
Load the experiment and run the evaluation.
>>> import uuid
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate, evaluate_existing
>>> client = Client()
>>> dataset_name = "__doctest_evaluate_existing_" + uuid.uuid4().hex[:8]
>>> dataset = client.create_dataset(dataset_name)
>>> example = client.create_example(
... inputs={"question": "What is 2+2?"},
... outputs={"answer": "4"},
... dataset_id=dataset.id,
... )
>>> def predict(inputs: dict) -> dict:
... return {"output": "4"}
>>> # First run inference on the dataset
... results = evaluate(
... predict, data=dataset_name, experiment_prefix="doctest_experiment"
... )
View the evaluation results for experiment:...
>>> experiment_id = results.experiment_name
>>> # Wait for the experiment to be fully processed and check if we have results
>>> len(results) > 0
True
>>> import time
>>> time.sleep(2)
>>> results = evaluate_existing(
... experiment_id,
... evaluators=[accuracy],
... summary_evaluators=[precision],
... )
View the evaluation results for experiment:...
>>> client.delete_dataset(dataset_id=dataset.id)