LangChain Reference home pageLangChain ReferenceLangChain Reference
  • GitHub
  • Main Docs
Deep Agents
LangChain
LangGraph
Integrations
LangSmith
  • Overview
  • Client
  • AsyncClient
  • Run Helpers
  • Run Trees
  • Evaluation
  • Schemas
  • Utilities
  • Wrappers
  • Anonymizer
  • Testing
  • Expect API
  • Middleware
  • Pytest Plugin
  • Deployment SDK
  • RemoteGraph
⌘I

LangChain Assistant

Ask a question to get started

Enter to send•Shift+Enter new line

Menu

OverviewClientAsyncClientRun HelpersRun TreesEvaluationSchemasUtilitiesWrappersAnonymizerTestingExpect APIMiddlewarePytest PluginDeployment SDKRemoteGraph
Language
Theme
Pythonlangsmithevaluation_runnerevaluate
Function●Since v0.1

evaluate

Copy
evaluate(
  target: Union[TARGET_T, Runnable, EXPERIMENT_T, tuple[EXPERIMENT_T, 

Used in Docs

  • How to add evaluators to an existing experiment (Python only)
  • How to define a code evaluator
  • How to define an LLM-as-a-judge evaluator
  • How to evaluate an LLM application
  • How to evaluate with repetitions
View source on GitHub
EXPERIMENT_T
]
]
,
,
data
:
Optional
[
DATA_T
]
=
None
,
evaluators
:
Optional
[
Union
[
Sequence
[
EVALUATOR_T
]
,
Sequence
[
COMPARATIVE_EVALUATOR_T
]
]
]
=
None
,
summary_evaluators
:
Optional
[
Sequence
[
SUMMARY_EVALUATOR_T
]
]
=
None
,
metadata
:
Optional
[
dict
]
=
None
,
experiment_prefix
:
Optional
[
str
]
=
None
,
description
:
Optional
[
str
]
=
None
,
max_concurrency
:
Optional
[
int
]
=
0
,
num_repetitions
:
int
=
1
,
client
:
Optional
[
langsmith
.
Client
]
=
None
,
blocking
:
bool
=
True
,
experiment
:
Optional
[
EXPERIMENT_T
]
=
None
,
upload_results
:
bool
=
True
,
error_handling
:
Literal
[
'log'
,
'ignore'
]
=
'log'
,
**
kwargs
:
Any
=
{
}
)
->
Union
[
ExperimentResults
,
ComparativeExperimentResults
]

Parameters

NameTypeDescription
target*TARGET_T | Runnable | EXPERIMENT_T | Tuple[EXPERIMENT_T, EXPERIMENT_T]

The target system or experiment(s) to evaluate.

Can be a function that takes a dict and returns a dict, a langchain Runnable, an existing experiment ID, or a two-tuple of experiment IDs.

dataDATA_T
Default:None
evaluatorsSequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None
Default:None
summary_evaluatorsSequence[SUMMARY_EVALUATOR_T] | None
Default:None
metadatadict | None
Default:None
experiment_prefixstr | None
Default:None
descriptionstr | None
Default:None
max_concurrencyint | None
Default:0
clientlangsmith.Client | None
Default:None
blockingbool
Default:True
num_repetitionsint
Default:1
experimentschemas.TracerSession | None
Default:None
error_handlingstr, default="log"
Default:'log'

Evaluate a target system on a given dataset.

'max_concurrency' default updated from None (no limit on concurrency) to 0 (no concurrency at all).

The dataset to evaluate on.

Can be a dataset name, a list of examples, or a generator of examples.

A list of evaluators to run on each example. The evaluator signature depends on the target type.

A list of summary evaluators to run on the entire dataset.

Should not be specified if comparing two existing experiments.

Metadata to attach to the experiment.

A prefix to provide for your experiment name.

A free-form text description for the experiment.

The maximum number of concurrent evaluations to run.

If None then no limit is set. If 0 then no concurrency.

The LangSmith client to use.

Whether to block until the evaluation is complete.

The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times.

An existing experiment to extend.

If provided, experiment_prefix is ignored.

For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.

How to handle individual run errors.

'log' will trace the runs with the error message as part of the experiment, 'ignore' will not count the run as part of the experiment at all.