Method●Since v0.2

evaluate

Evaluate a target system on a given dataset.

evaluate(
  self,
  target: Union[TARGET_T, Runnable, EXPERIMENT_T, tuple[EXPERIMENT_T, EXPERIMENT_T]],
  ,
  data: Optional[DATA_T] = None,
  evaluators: Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]] = None,
  summary_evaluators: Optional[Sequence[SUMMARY_EVALUATOR_T]] = None,
  metadata: Optional[dict] = None,
  experiment_prefix: Optional[str] = None,
  description: Optional[str] = None,
  max_concurrency: Optional[int] = 0,
  num_repetitions: int = 1,
  blocking: bool = True,
  experiment: Optional[EXPERIMENT_T] = None,
  upload_results: bool = True,
  error_handling: Literal['log', 'ignore'] = 'log',
  **kwargs: Any = {}
) -> Union[ExperimentResults, ComparativeExperimentResults]

Parameters

Name	Type	Description
`target`*	`Union[TARGET_T, Runnable, EXPERIMENT_T, Tuple[EXPERIMENT_T, EXPERIMENT_T]]`	The target system or experiment(s) to evaluate. Can be a function that takes a `dict` and returns a `dict`, a langchain `Runnable`, an existing experiment ID, or a two-tuple of experiment IDs.
`data`	`DATA_T`	Default:`None` The dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples.
`evaluators`	`Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]]`	Default:`None` A list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None.
`summary_evaluators`	`Optional[Sequence[SUMMARY_EVALUATOR_T]]`	Default:`None` A list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments.
`metadata`	`Optional[dict]`	Default:`None` Metadata to attach to the experiment.
`experiment_prefix`	`Optional[str]`	Default:`None` A prefix to provide for your experiment name.
`description`	`Optional[str]`	Default:`None` A free-form text description for the experiment.
`max_concurrency`	`Optional[int], default=0`	Default:`0` The maximum number of concurrent evaluations to run. If `None` then no limit is set. If `0` then no concurrency.
`blocking`	`bool, default=True`	Default:`True` Whether to block until the evaluation is complete.
`num_repetitions`	`int, default=1`	Default:`1` The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1.
`experiment`	`Optional[EXPERIMENT_T]`	Default:`None` An existing experiment to extend. If provided, `experiment_prefix` is ignored. For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.
`upload_results`	`bool, default=True`	Default:`True` Whether to upload the results to LangSmith.
`error_handling`	`str, default="log"`	Default:`'log'` How to handle individual run errors. `'log'` will trace the runs with the error message as part of the experiment, `'ignore'` will not count the run as part of the experiment at all.
`**kwargs`	`Any`	Default:`{}` Additional keyword arguments to pass to the evaluator.

View source on GitHub

evaluate( self, target: Union[TARGET_T, Runnable, EXPERIMENT_T, tuple[EXPERIMENT_T, EXPERIMENT_T]], , data: Optional[DATA_T] = None, evaluators: Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]] = None, summary_evaluators: Optional[Sequence[SUMMARY_EVALUATOR_T]] = None, metadata: Optional[dict] = None, experiment_prefix: Optional[str] = None, description: Optional[str] = None, max_concurrency: Optional[int] = 0, num_repetitions: int = 1, blocking: bool = True, experiment: Optional[EXPERIMENT_T] = None, upload_results: bool = True, error_handling: Literal['log', 'ignore'] = 'log', **kwargs: Any = {} ) -> Union[ExperimentResults, ComparativeExperimentResults]

Parameters

Name	Type	Description
`target`*	`Union[TARGET_T, Runnable, EXPERIMENT_T, Tuple[EXPERIMENT_T, EXPERIMENT_T]]`	The target system or experiment(s) to evaluate. Can be a function that takes a `dict` and returns a `dict`, a langchain `Runnable`, an existing experiment ID, or a two-tuple of experiment IDs.
`data`	`DATA_T`	Default:`None` The dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples.
`evaluators`	`Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]]`	Default:`None` A list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None.
`summary_evaluators`	`Optional[Sequence[SUMMARY_EVALUATOR_T]]`	Default:`None` A list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments.
`metadata`	`Optional[dict]`	Default:`None` Metadata to attach to the experiment.
`experiment_prefix`	`Optional[str]`	Default:`None` A prefix to provide for your experiment name.
`description`	`Optional[str]`	Default:`None` A free-form text description for the experiment.
`max_concurrency`	`Optional[int], default=0`	Default:`0` The maximum number of concurrent evaluations to run. If `None` then no limit is set. If `0` then no concurrency.
`blocking`	`bool, default=True`	Default:`True` Whether to block until the evaluation is complete.
`num_repetitions`	`int, default=1`	Default:`1` The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1.
`experiment`	`Optional[EXPERIMENT_T]`	Default:`None` An existing experiment to extend. If provided, `experiment_prefix` is ignored. For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.
`upload_results`	`bool, default=True`	Default:`True` Whether to upload the results to LangSmith.
`error_handling`	`str, default="log"`	Default:`'log'` How to handle individual run errors. `'log'` will trace the runs with the error message as part of the experiment, `'ignore'` will not count the run as part of the experiment at all.
`**kwargs`	`Any`	Default:`{}` Additional keyword arguments to pass to the evaluator.

evaluate

Parameters

LangChain Assistant

Menu

evaluate

Parameters

evaluate

Used in Docs

Parameters

Menu

evaluate

Used in Docs

Parameters