Skip to content

Running Evaluations

Running Evaluations

The Klira SDK includes a built-in evaluation framework for testing your AI workflows against datasets of expected inputs and outputs. Evaluations run your workflow function against each test case and collect results with full tracing.

Core Components

evaluate()

The main entry point for running evaluations:

from klira.sdk.evals import evaluate
results = evaluate(
target=my_workflow,
data=test_cases,
dataset_id="my_dataset_v1",
experiment_id="experiment_001",
)

Parameters

ParameterTypeRequiredDescription
targetCallableYesThe workflow or function to evaluate
datalist[KliraTestCase] or strYesTest cases (list) or path to CSV file
dataset_idstrNoIdentifier for the dataset
experiment_idstrNoIdentifier for this evaluation run
api_keystrNoOverride API key for results upload
endpointstrNoOverride endpoint for results upload

Returns

  • list[KliraEvalResult] — One result per test case

Data Classes

KliraTestCase

Represents a single test case:

from klira.sdk.evals import KliraTestCase
test_case = KliraTestCase(
input="What are the side effects of ibuprofen?",
expected_output="Common side effects include stomach pain, nausea...",
metadata={"category": "medication", "difficulty": "easy"},
)
FieldTypeRequiredDescription
inputstrYesInput to pass to the target function
expected_outputstrNoExpected output for comparison
metadatadictNoAdditional metadata for filtering and analysis

KliraEvalResult

Represents the result of evaluating one test case:

result: KliraEvalResult
print(result.input) # Original input
print(result.expected_output) # Expected output
print(result.actual_output) # What the target function returned
print(result.passed) # Whether evaluation passed
print(result.score) # Numeric score (0.0–1.0)
print(result.error) # Error message if the target raised
print(result.latency_ms) # Execution time in milliseconds
print(result.trace_id) # OpenTelemetry trace ID for this run
FieldTypeDescription
inputstrInput that was evaluated
expected_outputstrExpected output (if provided)
actual_outputstrActual output from the target
passedboolWhether the test case passed
scorefloatNumeric score (0.0–1.0)
errorstrError message if target raised an exception
latency_msfloatExecution time in milliseconds
trace_idstrOpenTelemetry trace ID
metadatadictMetadata from the test case

Using Local Data

List of Test Cases

from klira.sdk.evals import evaluate, KliraTestCase
test_cases = [
KliraTestCase(
input="What is hypertension?",
expected_output="Hypertension is high blood pressure...",
metadata={"topic": "cardiology"},
),
KliraTestCase(
input="What is diabetes?",
expected_output="Diabetes is a metabolic disease...",
metadata={"topic": "endocrinology"},
),
]
results = evaluate(
target=my_clinical_workflow,
data=test_cases,
experiment_id="clinical_qa_v1",
)
for r in results:
status = "PASS" if r.passed else "FAIL"
print(f"[{status}] {r.input[:50]}... (score: {r.score:.2f}, {r.latency_ms:.0f}ms)")

CSV File

You can also load test cases from a CSV file with input and expected_output columns:

results = evaluate(
target=my_workflow,
data="test_data/clinical_qa.csv",
dataset_id="clinical_qa",
experiment_id="run_003",
)

CSV format:

input,expected_output
"What is hypertension?","Hypertension is high blood pressure..."
"What is diabetes?","Diabetes is a metabolic disease..."

Configuring Evaluations at Init Time

You can set default evaluation parameters via Klira.init():

from klira.sdk import Klira
Klira.init(
app_name="EvalRunner",
api_key="klira_live_your_key",
evals_run="nightly_2026_03_10",
dataset_id="clinical_qa_v2",
)
ParameterDescription
evals_runDefault experiment identifier applied to all evaluate() calls
dataset_idDefault dataset identifier applied to all evaluate() calls

These defaults can be overridden per evaluate() call.

Complete Working Example

import asyncio
from klira.sdk import Klira
from klira.sdk.decorators import workflow, guardrails
from klira.sdk.evals import evaluate, KliraTestCase
Klira.init(
app_name="ClinicalQAEval",
api_key="klira_live_your_key",
evals_run="qa_regression_v3",
)
@workflow(name="clinical_qa")
@guardrails(domain="healthcare", check_output=True)
async def clinical_qa(question: str) -> str:
# Your clinical QA logic
return await generate_answer(question)
# Define test cases
test_cases = [
KliraTestCase(
input="What are common symptoms of pneumonia?",
expected_output="Common symptoms include cough, fever, and difficulty breathing.",
metadata={"category": "pulmonology", "severity": "moderate"},
),
KliraTestCase(
input="How is type 2 diabetes diagnosed?",
expected_output="Type 2 diabetes is diagnosed through blood tests such as A1C, fasting glucose, or oral glucose tolerance test.",
metadata={"category": "endocrinology", "severity": "low"},
),
]
# Run evaluation
results = evaluate(
target=clinical_qa,
data=test_cases,
dataset_id="clinical_qa_v2",
experiment_id="regression_run_042",
)
# Summarize results
passed = sum(1 for r in results if r.passed)
total = len(results)
avg_latency = sum(r.latency_ms for r in results) / total
print(f"\nResults: {passed}/{total} passed")
print(f"Average latency: {avg_latency:.0f}ms")
for r in results:
if not r.passed:
print(f"\nFAILED: {r.input[:60]}...")
print(f" Expected: {r.expected_output[:80]}...")
print(f" Got: {r.actual_output[:80]}...")
if r.error:
print(f" Error: {r.error}")

Viewing Results

Evaluation results are automatically uploaded to the Klira dashboard when an api_key is configured. Each evaluation run creates traced spans, so you can:

  • View individual test case traces in the Klira dashboard
  • Compare results across experiment runs
  • Filter by metadata (category, severity, etc.)
  • Track score distributions and latency trends over time