Skip to content

Evaluations

Genkit provides a framework for evaluating AI output quality. You can define custom evaluators for your specific needs, use pre-built RAGAS-style metrics, manage evaluation datasets, and run evaluations against flows or models.

Define evaluators to assess specific quality dimensions. Each evaluator receives an EvalDataPoint containing the input, output, context, and reference, and returns a Score.

import com.google.genkit.ai.evaluation.*;
Evaluator<Void> lengthEvaluator = genkit.defineEvaluator(
"custom/length",
"Output Length",
"Evaluates whether the output has an appropriate length",
(dataPoint, options) -> {
String output = dataPoint.getOutput() != null
? dataPoint.getOutput().toString() : "";
int length = output.length();
double score = (length >= 50 && length <= 500)
? 1.0 : Math.min(length / 50.0, 1.0);
EvalStatus status = score >= 1.0
? EvalStatus.PASS : EvalStatus.FAIL;
return EvalResponse.builder()
.testCaseId(dataPoint.getTestCaseId())
.evaluation(Score.builder()
.score(score)
.status(status)
.details(ScoreDetails.builder()
.reasoning("Output length: " + length + " characters")
.build())
.build())
.build();
});

You can define evaluators that accept typed options:

Evaluator<ThresholdOptions> thresholdEvaluator = genkit.defineEvaluator(
"custom/threshold",
"Threshold Check",
"Checks output against a configurable threshold",
false, // isBilled
ThresholdOptions.class,
(dataPoint, options) -> {
double threshold = options.getThreshold();
// ... evaluate against threshold
return EvalResponse.builder()
.testCaseId(dataPoint.getTestCaseId())
.evaluation(Score.builder().score(score).build())
.build();
});

Each data point passed to your evaluator contains:

FieldTypeDescription
testCaseIdStringUnique identifier for the test case
inputObjectThe input sent to the flow/model
outputObjectThe output produced by the flow/model
contextList<String>Retrieved context documents (for RAG)
referenceObjectExpected/reference output
customMap<String, Object>Custom metadata
traceIdsList<String>Associated trace IDs

Scores can be numeric, boolean, or string:

Score.builder().score(0.95).status(EvalStatus.PASS).build() // numeric
Score.builder().score(true).status(EvalStatus.PASS).build() // boolean
Score.builder().score("good").status(EvalStatus.PASS).build() // categorical

The EvalStatus enum has three values: PASS, FAIL, and UNKNOWN.

Genkit includes a DatasetStore for managing evaluation datasets. The default implementation (LocalFileDatasetStore) persists datasets to local files.

DatasetStore datasetStore = genkit.getDatasetStore();
List<DatasetSample> samples = List.of(
DatasetSample.builder()
.testCaseId("food_1")
.input("pizza")
.reference("A delicious Italian dish with a crispy crust and savory toppings")
.build(),
DatasetSample.builder()
.testCaseId("food_2")
.input("sushi")
.reference("Fresh, delicate rolls of rice and fish from Japan")
.build());
DatasetMetadata metadata = datasetStore.createDataset(
CreateDatasetRequest.builder()
.datasetId("food_descriptions")
.data(samples)
.datasetType(DatasetType.FLOW)
.targetAction("/flow/describeFood")
.metricRefs(List.of("custom/length", "custom/keywords"))
.build());
List<DatasetMetadata> datasets = datasetStore.listDatasets();
DatasetMetadata dataset = datasetStore.getDataset("food_descriptions");
datasetStore.updateDataset(UpdateDatasetRequest.builder()
.datasetId("food_descriptions")
.data(updatedSamples)
.version(dataset.getVersion()) // optimistic concurrency
.build());
TypeDescription
DatasetType.FLOWDataset targets a flow
DatasetType.MODELDataset targets a model
DatasetType.CUSTOMCustom dataset type

Run evaluations against a flow or model using a dataset:

EvalRunKey result = genkit.evaluate(
RunEvaluationRequest.builder()
.dataSource(DataSource.builder()
.datasetId("food_descriptions")
.build())
.targetAction("/flow/describeFood")
.evaluators(List.of("custom/length", "custom/keywords"))
.build());

You can also provide data inline without creating a dataset first:

EvalRunKey result = genkit.evaluate(
RunEvaluationRequest.builder()
.dataSource(DataSource.builder()
.data(List.of(
DatasetSample.builder()
.testCaseId("test_1")
.input("What is Genkit?")
.reference("Genkit is an AI framework")
.build()))
.build())
.targetAction("/flow/myFlow")
.evaluators(List.of("custom/length"))
.build());

Evaluation results are stored in an EvalStore. The default implementation (LocalFileEvalStore) persists to local files.

EvalStore evalStore = genkit.getEvalStore();
// List all evaluation runs
List<EvalRunKey> runs = evalStore.list();
// Load a specific run
EvalRun run = evalStore.load(result); // result from evaluate()
for (EvalResult evalResult : run.getResults()) {
System.out.println("Test: " + evalResult.getTestCaseId());
for (EvalMetric metric : evalResult.getMetrics()) {
System.out.println(" " + metric.getEvaluator()
+ ": " + metric.getScore()
+ " (" + metric.getStatus() + ")");
}
}

Use the evaluators plugin for RAGAS-style metrics without writing custom evaluators:

import com.google.genkit.plugins.evaluators.EvaluatorsPlugin;
import com.google.genkit.plugins.evaluators.EvaluatorsPluginOptions;
import com.google.genkit.plugins.evaluators.GenkitMetric;
Genkit genkit = Genkit.builder()
.plugin(OpenAIPlugin.create())
.plugin(EvaluatorsPlugin.create(
EvaluatorsPluginOptions.builder()
.judge("openai/gpt-4o-mini")
.metrics(List.of(
GenkitMetric.FAITHFULNESS,
GenkitMetric.ANSWER_RELEVANCY,
GenkitMetric.ANSWER_ACCURACY,
GenkitMetric.MALICIOUSNESS,
GenkitMetric.REGEX,
GenkitMetric.DEEP_EQUAL,
GenkitMetric.JSONATA
))
.build()))
.build();

Or enable all metrics at once:

EvaluatorsPluginOptions.builder()
.useAllMetrics()
.judge("openai/gpt-4o-mini")
.embedder("openai/text-embedding-3-small")
.build()
MetricTypeDescription
FAITHFULNESSLLM-basedFactual accuracy against provided context
ANSWER_RELEVANCYLLM-basedAnswer pertains to the question
ANSWER_ACCURACYLLM-basedMatches reference answer
MALICIOUSNESSLLM-basedDetects harmful content
REGEXProgrammaticPattern matching against output
DEEP_EQUALProgrammaticJSON deep equality comparison
JSONATAProgrammaticJSONata expression evaluation

LLM-based metrics require a judge model. Programmatic metrics run locally without any API calls.

See the evaluators plugin docs for full details.