LangChain Reference home pageLangChain ReferenceLangChain Reference
  • GitHub
  • Main Docs
Deep Agents
LangChain
LangGraph
Integrations
LangSmith
  • Overview
  • Client
  • AsyncClient
  • Run Helpers
  • Run Trees
  • Evaluation
  • Schemas
  • Utilities
  • Wrappers
  • Anonymizer
  • Testing
  • Expect API
  • Middleware
  • Pytest Plugin
  • Deployment SDK
⌘I

LangChain Assistant

Ask a question to get started

Enter to send•Shift+Enter new line

Menu

OverviewClientAsyncClientRun HelpersRun TreesEvaluationSchemasUtilitiesWrappersAnonymizerTestingExpect APIMiddlewarePytest PluginDeployment SDK
Language
Theme
Pythonlangsmithevaluation_runnerevaluate_comparative
Function●Since v0.1

evaluate_comparative

Copy
evaluate_comparative(
  experiments: tuple[EXPERIMENT_T, EXPERIMENT_T],
  ,
  evaluators: Sequence[COMPARATIVE_EVALUATOR_T
View source on GitHub
]
,
experiment_prefix
:
Optional
[
str
]
=
None
,
description
:
Optional
[
str
]
=
None
,
max_concurrency
:
int
=
5
,
client
:
Optional
[
langsmith
.
Client
]
=
None
,
metadata
:
Optional
[
dict
]
=
None
,
load_nested
:
bool
=
False
,
randomize_order
:
bool
=
False
)
->
ComparativeExperimentResults

Parameters

NameTypeDescription
experiments*Tuple[Union[str, uuid.UUID], Union[str, uuid.UUID]]
evaluators*Sequence[COMPARATIVE_EVALUATOR_T]
experiment_prefixOptional[str]
Default:None
descriptionOptional[str]
Default:None
max_concurrencyint
Default:5
clientOptional[langsmith.Client]
Default:None
metadataOptional[dict]
Default:None
load_nestedbool
Default:False
randomize_orderbool
Default:False

Evaluate existing experiment runs against each other.

This lets you use pairwise preference scoring to generate more reliable feedback in your experiments.

The identifiers of the experiments to compare.

A list of evaluators to run on each example.

A prefix to provide for your experiment name.

A free-form text description for the experiment.

The maximum number of concurrent evaluations to run.

The LangSmith client to use.

Metadata to attach to the experiment.

Whether to load all child runs for the experiment.

Default is to only load the top-level root runs.

Whether to randomize the order of the outputs for each evaluation.