Running Evaluations
Running Evaluations
The Klira SDK includes a built-in evaluation framework for testing your AI workflows against datasets of expected inputs and outputs. Evaluations run your workflow function against each test case and collect results with full tracing.
Core Components
evaluate()
The main entry point for running evaluations:
from klira.sdk.evals import evaluate
results = evaluate( target=my_workflow, data=test_cases, dataset_id="my_dataset_v1", experiment_id="experiment_001",)Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
target | Callable | Yes | The workflow or function to evaluate |
data | list[KliraTestCase] or str | Yes | Test cases (list) or path to CSV file |
dataset_id | str | No | Identifier for the dataset |
experiment_id | str | No | Identifier for this evaluation run |
api_key | str | No | Override API key for results upload |
endpoint | str | No | Override endpoint for results upload |
Returns
list[KliraEvalResult]— One result per test case
Data Classes
KliraTestCase
Represents a single test case:
from klira.sdk.evals import KliraTestCase
test_case = KliraTestCase( input="What are the side effects of ibuprofen?", expected_output="Common side effects include stomach pain, nausea...", metadata={"category": "medication", "difficulty": "easy"},)| Field | Type | Required | Description |
|---|---|---|---|
input | str | Yes | Input to pass to the target function |
expected_output | str | No | Expected output for comparison |
metadata | dict | No | Additional metadata for filtering and analysis |
KliraEvalResult
Represents the result of evaluating one test case:
result: KliraEvalResult
print(result.input) # Original inputprint(result.expected_output) # Expected outputprint(result.actual_output) # What the target function returnedprint(result.passed) # Whether evaluation passedprint(result.score) # Numeric score (0.0–1.0)print(result.error) # Error message if the target raisedprint(result.latency_ms) # Execution time in millisecondsprint(result.trace_id) # OpenTelemetry trace ID for this run| Field | Type | Description |
|---|---|---|
input | str | Input that was evaluated |
expected_output | str | Expected output (if provided) |
actual_output | str | Actual output from the target |
passed | bool | Whether the test case passed |
score | float | Numeric score (0.0–1.0) |
error | str | Error message if target raised an exception |
latency_ms | float | Execution time in milliseconds |
trace_id | str | OpenTelemetry trace ID |
metadata | dict | Metadata from the test case |
Using Local Data
List of Test Cases
from klira.sdk.evals import evaluate, KliraTestCase
test_cases = [ KliraTestCase( input="What is hypertension?", expected_output="Hypertension is high blood pressure...", metadata={"topic": "cardiology"}, ), KliraTestCase( input="What is diabetes?", expected_output="Diabetes is a metabolic disease...", metadata={"topic": "endocrinology"}, ),]
results = evaluate( target=my_clinical_workflow, data=test_cases, experiment_id="clinical_qa_v1",)
for r in results: status = "PASS" if r.passed else "FAIL" print(f"[{status}] {r.input[:50]}... (score: {r.score:.2f}, {r.latency_ms:.0f}ms)")CSV File
You can also load test cases from a CSV file with input and expected_output columns:
results = evaluate( target=my_workflow, data="test_data/clinical_qa.csv", dataset_id="clinical_qa", experiment_id="run_003",)CSV format:
input,expected_output"What is hypertension?","Hypertension is high blood pressure...""What is diabetes?","Diabetes is a metabolic disease..."Configuring Evaluations at Init Time
You can set default evaluation parameters via Klira.init():
from klira.sdk import Klira
Klira.init( app_name="EvalRunner", api_key="klira_live_your_key", evals_run="nightly_2026_03_10", dataset_id="clinical_qa_v2",)| Parameter | Description |
|---|---|
evals_run | Default experiment identifier applied to all evaluate() calls |
dataset_id | Default dataset identifier applied to all evaluate() calls |
These defaults can be overridden per evaluate() call.
Complete Working Example
import asynciofrom klira.sdk import Klirafrom klira.sdk.decorators import workflow, guardrailsfrom klira.sdk.evals import evaluate, KliraTestCase
Klira.init( app_name="ClinicalQAEval", api_key="klira_live_your_key", evals_run="qa_regression_v3",)
@workflow(name="clinical_qa")@guardrails(domain="healthcare", check_output=True)async def clinical_qa(question: str) -> str: # Your clinical QA logic return await generate_answer(question)
# Define test casestest_cases = [ KliraTestCase( input="What are common symptoms of pneumonia?", expected_output="Common symptoms include cough, fever, and difficulty breathing.", metadata={"category": "pulmonology", "severity": "moderate"}, ), KliraTestCase( input="How is type 2 diabetes diagnosed?", expected_output="Type 2 diabetes is diagnosed through blood tests such as A1C, fasting glucose, or oral glucose tolerance test.", metadata={"category": "endocrinology", "severity": "low"}, ),]
# Run evaluationresults = evaluate( target=clinical_qa, data=test_cases, dataset_id="clinical_qa_v2", experiment_id="regression_run_042",)
# Summarize resultspassed = sum(1 for r in results if r.passed)total = len(results)avg_latency = sum(r.latency_ms for r in results) / total
print(f"\nResults: {passed}/{total} passed")print(f"Average latency: {avg_latency:.0f}ms")
for r in results: if not r.passed: print(f"\nFAILED: {r.input[:60]}...") print(f" Expected: {r.expected_output[:80]}...") print(f" Got: {r.actual_output[:80]}...") if r.error: print(f" Error: {r.error}")Viewing Results
Evaluation results are automatically uploaded to the Klira dashboard when an api_key is configured. Each evaluation run creates traced spans, so you can:
- View individual test case traces in the Klira dashboard
- Compare results across experiment runs
- Filter by metadata (category, severity, etc.)
- Track score distributions and latency trends over time
Related Pages
- Tracing — View evaluation traces
- Compliance Analytics — Analyze evaluation results
- @workflow Decorator — Decorate your target functions
- @guardrails Decorator — Add policy enforcement to evaluated functions