Evaluate existing experiment runs against each other.
This lets you use pairwise preference scoring to generate more reliable feedback in your experiments.
evaluate_comparative(
experiments: tuple[EXPERIMENT_T, EXPERIMENT_T],
,
evaluators: Sequence[COMPARATIVE_EVALUATOR_T],
experiment_prefix: Optional[str] = None,
description: Optional[str] = None,
max_concurrency: int = 5,
client: Optional[langsmith.Client] = None,
metadata: Optional[dict] = None,
load_nested: bool = False,
randomize_order: bool = False
) -> ComparativeExperimentResults| Name | Type | Description |
|---|---|---|
experiments* | Tuple[Union[str, uuid.UUID], Union[str, uuid.UUID]] | The identifiers of the experiments to compare. |
evaluators* | Sequence[COMPARATIVE_EVALUATOR_T] | A list of evaluators to run on each example. |
experiment_prefix | Optional[str] | Default: NoneA prefix to provide for your experiment name. |
description | Optional[str] | Default: NoneA free-form text description for the experiment. |
max_concurrency | int | Default: 5The maximum number of concurrent evaluations to run. |
client | Optional[langsmith.Client] | Default: NoneThe LangSmith client to use. |
metadata | Optional[dict] | Default: NoneMetadata to attach to the experiment. |
load_nested | bool | Default: FalseWhether to load all child runs for the experiment. Default is to only load the top-level root runs. |
randomize_order | bool | Default: FalseWhether to randomize the order of the outputs for each evaluation. |