Compare the output of two models (or two outputs of the same model).
Grade, tag, or otherwise evaluate predictions relative to their inputs and/or reference labels.
Evaluate the perplexity of a predicted string.
Evaluate the perplexity of a predicted string.