Evaluations
Genkit provides a framework for evaluating AI output quality. You can define custom evaluators for your specific needs, use pre-built RAGAS-style metrics, manage evaluation datasets, and run evaluations against flows or models.
Custom evaluators
Section titled “Custom evaluators”Define evaluators to assess specific quality dimensions. Each evaluator receives an EvalDataPoint containing the input, output, context, and reference, and returns a Score.
import com.google.genkit.ai.evaluation.*;
Evaluator<Void> lengthEvaluator = genkit.defineEvaluator( "custom/length", "Output Length", "Evaluates whether the output has an appropriate length", (dataPoint, options) -> { String output = dataPoint.getOutput() != null ? dataPoint.getOutput().toString() : ""; int length = output.length(); double score = (length >= 50 && length <= 500) ? 1.0 : Math.min(length / 50.0, 1.0); EvalStatus status = score >= 1.0 ? EvalStatus.PASS : EvalStatus.FAIL;
return EvalResponse.builder() .testCaseId(dataPoint.getTestCaseId()) .evaluation(Score.builder() .score(score) .status(status) .details(ScoreDetails.builder() .reasoning("Output length: " + length + " characters") .build()) .build()) .build(); });Evaluator with options
Section titled “Evaluator with options”You can define evaluators that accept typed options:
Evaluator<ThresholdOptions> thresholdEvaluator = genkit.defineEvaluator( "custom/threshold", "Threshold Check", "Checks output against a configurable threshold", false, // isBilled ThresholdOptions.class, (dataPoint, options) -> { double threshold = options.getThreshold(); // ... evaluate against threshold return EvalResponse.builder() .testCaseId(dataPoint.getTestCaseId()) .evaluation(Score.builder().score(score).build()) .build(); });EvalDataPoint fields
Section titled “EvalDataPoint fields”Each data point passed to your evaluator contains:
| Field | Type | Description |
|---|---|---|
testCaseId | String | Unique identifier for the test case |
input | Object | The input sent to the flow/model |
output | Object | The output produced by the flow/model |
context | List<String> | Retrieved context documents (for RAG) |
reference | Object | Expected/reference output |
custom | Map<String, Object> | Custom metadata |
traceIds | List<String> | Associated trace IDs |
Score values
Section titled “Score values”Scores can be numeric, boolean, or string:
Score.builder().score(0.95).status(EvalStatus.PASS).build() // numericScore.builder().score(true).status(EvalStatus.PASS).build() // booleanScore.builder().score("good").status(EvalStatus.PASS).build() // categoricalThe EvalStatus enum has three values: PASS, FAIL, and UNKNOWN.
Managing datasets
Section titled “Managing datasets”Genkit includes a DatasetStore for managing evaluation datasets. The default implementation (LocalFileDatasetStore) persists datasets to local files.
Creating a dataset
Section titled “Creating a dataset”DatasetStore datasetStore = genkit.getDatasetStore();
List<DatasetSample> samples = List.of( DatasetSample.builder() .testCaseId("food_1") .input("pizza") .reference("A delicious Italian dish with a crispy crust and savory toppings") .build(), DatasetSample.builder() .testCaseId("food_2") .input("sushi") .reference("Fresh, delicate rolls of rice and fish from Japan") .build());
DatasetMetadata metadata = datasetStore.createDataset( CreateDatasetRequest.builder() .datasetId("food_descriptions") .data(samples) .datasetType(DatasetType.FLOW) .targetAction("/flow/describeFood") .metricRefs(List.of("custom/length", "custom/keywords")) .build());Listing and retrieving datasets
Section titled “Listing and retrieving datasets”List<DatasetMetadata> datasets = datasetStore.listDatasets();
DatasetMetadata dataset = datasetStore.getDataset("food_descriptions");Updating a dataset
Section titled “Updating a dataset”datasetStore.updateDataset(UpdateDatasetRequest.builder() .datasetId("food_descriptions") .data(updatedSamples) .version(dataset.getVersion()) // optimistic concurrency .build());Dataset types
Section titled “Dataset types”| Type | Description |
|---|---|
DatasetType.FLOW | Dataset targets a flow |
DatasetType.MODEL | Dataset targets a model |
DatasetType.CUSTOM | Custom dataset type |
Running evaluations
Section titled “Running evaluations”Run evaluations against a flow or model using a dataset:
EvalRunKey result = genkit.evaluate( RunEvaluationRequest.builder() .dataSource(DataSource.builder() .datasetId("food_descriptions") .build()) .targetAction("/flow/describeFood") .evaluators(List.of("custom/length", "custom/keywords")) .build());Inline data (no dataset)
Section titled “Inline data (no dataset)”You can also provide data inline without creating a dataset first:
EvalRunKey result = genkit.evaluate( RunEvaluationRequest.builder() .dataSource(DataSource.builder() .data(List.of( DatasetSample.builder() .testCaseId("test_1") .input("What is Genkit?") .reference("Genkit is an AI framework") .build())) .build()) .targetAction("/flow/myFlow") .evaluators(List.of("custom/length")) .build());Viewing evaluation results
Section titled “Viewing evaluation results”Evaluation results are stored in an EvalStore. The default implementation (LocalFileEvalStore) persists to local files.
EvalStore evalStore = genkit.getEvalStore();
// List all evaluation runsList<EvalRunKey> runs = evalStore.list();
// Load a specific runEvalRun run = evalStore.load(result); // result from evaluate()
for (EvalResult evalResult : run.getResults()) { System.out.println("Test: " + evalResult.getTestCaseId()); for (EvalMetric metric : evalResult.getMetrics()) { System.out.println(" " + metric.getEvaluator() + ": " + metric.getScore() + " (" + metric.getStatus() + ")"); }}Pre-built evaluators plugin
Section titled “Pre-built evaluators plugin”Use the evaluators plugin for RAGAS-style metrics without writing custom evaluators:
import com.google.genkit.plugins.evaluators.EvaluatorsPlugin;import com.google.genkit.plugins.evaluators.EvaluatorsPluginOptions;import com.google.genkit.plugins.evaluators.GenkitMetric;
Genkit genkit = Genkit.builder() .plugin(OpenAIPlugin.create()) .plugin(EvaluatorsPlugin.create( EvaluatorsPluginOptions.builder() .judge("openai/gpt-4o-mini") .metrics(List.of( GenkitMetric.FAITHFULNESS, GenkitMetric.ANSWER_RELEVANCY, GenkitMetric.ANSWER_ACCURACY, GenkitMetric.MALICIOUSNESS, GenkitMetric.REGEX, GenkitMetric.DEEP_EQUAL, GenkitMetric.JSONATA )) .build())) .build();Or enable all metrics at once:
EvaluatorsPluginOptions.builder() .useAllMetrics() .judge("openai/gpt-4o-mini") .embedder("openai/text-embedding-3-small") .build()Available metrics
Section titled “Available metrics”| Metric | Type | Description |
|---|---|---|
FAITHFULNESS | LLM-based | Factual accuracy against provided context |
ANSWER_RELEVANCY | LLM-based | Answer pertains to the question |
ANSWER_ACCURACY | LLM-based | Matches reference answer |
MALICIOUSNESS | LLM-based | Detects harmful content |
REGEX | Programmatic | Pattern matching against output |
DEEP_EQUAL | Programmatic | JSON deep equality comparison |
JSONATA | Programmatic | JSONata expression evaluation |
LLM-based metrics require a judge model. Programmatic metrics run locally without any API calls.
See the evaluators plugin docs for full details.
Samples
Section titled “Samples”- evaluations sample — Custom evaluators, dataset management, and running evaluations
- evaluators-plugin sample — Pre-built RAGAS metrics