| Name | Type | Description |
|---|---|---|
target* | TARGET_T | Runnable | EXPERIMENT_T | Tuple[EXPERIMENT_T, EXPERIMENT_T] | The target system or experiment(s) to evaluate. Can be a function that takes a dict and returns a |
data | DATA_T | Default: None |
evaluators | Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None | Default: None |
summary_evaluators | Sequence[SUMMARY_EVALUATOR_T] | None | Default: None |
metadata | dict | None | Default: None |
experiment_prefix | str | None | Default: None |
description | str | None | Default: None |
max_concurrency | int | None | Default: 0 |
client | langsmith.Client | None | Default: None |
blocking | bool | Default: True |
num_repetitions | int | Default: 1 |
experiment | schemas.TracerSession | None | Default: None |
error_handling | str, default="log" | Default: 'log' |
Evaluate a target system on a given dataset.
'max_concurrency' default updated from None (no limit on concurrency) to 0 (no concurrency at all).
The dataset to evaluate on.
Can be a dataset name, a list of examples, or a generator of examples.
A list of evaluators to run on each example. The evaluator signature depends on the target type.
A list of summary evaluators to run on the entire dataset.
Should not be specified if comparing two existing experiments.
Metadata to attach to the experiment.
A prefix to provide for your experiment name.
A free-form text description for the experiment.
The maximum number of concurrent evaluations to run.
If None then no limit is set. If 0 then no concurrency.
The LangSmith client to use.
Whether to block until the evaluation is complete.
The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times.
An existing experiment to extend.
If provided, experiment_prefix is ignored.
For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.
How to handle individual run errors.
'log' will trace the runs with the error message as part of the
experiment, 'ignore' will not count the run as part of the experiment at
all.