Evaluate an async target system on a given dataset.
aevaluate(
self,
target: Union[ATARGET_T, AsyncIterable[dict], Runnable, str, uuid.UUID, schemas.TracerSession],
,
data: Union[DATA_T, AsyncIterable[schemas.Example], Iterable[schemas.Example], None] = None,
evaluators: Optional[Sequence[Union[EVALUATOR_T, AEVALUATOR_T]]] = None,
summary_evaluators: Optional[Sequence[SUMMARY_EVALUATOR_T]] = None,
metadata: Optional[dict] = None,
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: Optional[int] = 0,
num_repetitions: int = 1,
blocking: bool = True,
experiment: Optional[Union[schemas.TracerSession, str, uuid.UUID]] = None,
upload_results: bool = True,
error_handling: Literal['log', 'ignore'] = 'log',
**kwargs: Any = {}
) -> AsyncExperimentResultsEnvironment:
LANGSMITH_TEST_CACHE: If set, API calls will be cached to disk to save time and
cost during testing.Recommended to commit the cache files to your repository for faster CI/CD runs.
Requires the 'langsmith[vcr]' package to be installed.
| Name | Type | Description |
|---|---|---|
target* | Union[ATARGET_T, AsyncIterable[dict], Runnable, str, uuid.UUID, TracerSession] | The target system or experiment(s) to evaluate. Can be an async function that takes a |
data | Union[DATA_T, AsyncIterable[Example]] | Default: NoneThe dataset to evaluate on. Can be a dataset name, a list of examples, an async generator of examples, or an async iterable of examples. |
evaluators | Optional[Sequence[EVALUATOR_T]] | Default: NoneA list of evaluators to run on each example. |
summary_evaluators | Optional[Sequence[SUMMARY_EVALUATOR_T]] | Default: NoneA list of summary evaluators to run on the entire dataset. |
metadata | Optional[dict] | Default: NoneMetadata to attach to the experiment. |
experiment_prefix | Optional[str] | Default: NoneA prefix to provide for your experiment name. |
description | Optional[str] | Default: NoneA description of the experiment. |
max_concurrency | Optional[int], default=0 | Default: 0The maximum number of concurrent evaluations to run. If |
num_repetitions | int, default=1 | Default: 1The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1. |
blocking | bool, default=True | Default: TrueWhether to block until the evaluation is complete. |
experiment | Optional[TracerSession] | Default: NoneAn existing experiment to extend. If provided, For advanced usage only. |
upload_results | bool, default=True | Default: TrueWhether to upload the results to LangSmith. |
error_handling | str, default="log" | Default: 'log'How to handle individual run errors.
|
**kwargs | Any | Default: {}Additional keyword arguments to pass to the evaluator. |