Evaluate a target system on a given dataset.
evaluate(
self,
target: Union[TARGET_T, Runnable, EXPERIMENT_T, tuple[EXPERIMENT_T, EXPERIMENT_T]],
,
data: Optional[DATA_T] = None,
evaluators: Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]] = None,
summary_evaluators: Optional[Sequence[SUMMARY_EVALUATOR_T]] = None,
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[EXPERIMENT_T] = None,
upload_results: bool = True,
error_handling: Literal['log', 'ignore'] = 'log',
**kwargs: Any = {}
) -> Union[ExperimentResults, ComparativeExperimentResults]| Name | Type | Description |
|---|---|---|
target* | Union[TARGET_T, Runnable, EXPERIMENT_T, Tuple[EXPERIMENT_T, EXPERIMENT_T]] | The target system or experiment(s) to evaluate. Can be a function that takes a |
data | DATA_T | Default: NoneThe dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples. |
evaluators | Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]] | Default: NoneA list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None. |
summary_evaluators | Optional[Sequence[SUMMARY_EVALUATOR_T]] | Default: NoneA list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments. |
metadata | Optional[dict] | Default: NoneMetadata to attach to the experiment. |
experiment_prefix | Optional[str] | Default: NoneA prefix to provide for your experiment name. |
description | Optional[str] | Default: NoneA free-form text description for the experiment. |
max_concurrency | Optional[int], default=0 | Default: 0The maximum number of concurrent evaluations to run. If |
blocking | bool, default=True | Default: TrueWhether to block until the evaluation is complete. |
num_repetitions | int, default=1 | Default: 1The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1. |
experiment | Optional[EXPERIMENT_T] | Default: NoneAn existing experiment to extend. If provided, For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments. |
upload_results | bool, default=True | Default: TrueWhether to upload the results to LangSmith. |
error_handling | str, default="log" | Default: 'log'How to handle individual run errors.
|
**kwargs | Any | Default: {}Additional keyword arguments to pass to the evaluator. |