Running Evaluations

The Klira SDK includes a built-in evaluation framework for testing your AI workflows against datasets of expected inputs and outputs. Evaluations run your workflow function against each test case and collect results with full tracing.

Core Components

`evaluate()`

The main entry point for running evaluations:

from klira.sdk.evals import evaluate

results = evaluate(
    target=my_workflow,
    data=test_cases,
    dataset_id="my_dataset_v1",
    experiment_id="experiment_001",
)

Parameters

Parameter	Type	Required	Description
`target`	`Callable`	Yes	The workflow or function to evaluate
`data`	`list[KliraTestCase]` or `str`	Yes	Test cases (list) or path to CSV file
`dataset_id`	`str`	No	Identifier for the dataset
`experiment_id`	`str`	No	Identifier for this evaluation run
`api_key`	`str`	No	Override API key for results upload
`endpoint`	`str`	No	Override endpoint for results upload

Returns

list[KliraEvalResult] — One result per test case

Data Classes

`KliraTestCase`

Represents a single test case:

from klira.sdk.evals import KliraTestCase

test_case = KliraTestCase(
    input="What are the side effects of ibuprofen?",
    expected_output="Common side effects include stomach pain, nausea...",
    metadata={"category": "medication", "difficulty": "easy"},
)

Field	Type	Required	Description
`input`	`str`	Yes	Input to pass to the target function
`expected_output`	`str`	No	Expected output for comparison
`metadata`	`dict`	No	Additional metadata for filtering and analysis

`KliraEvalResult`

Represents the result of evaluating one test case:

result: KliraEvalResult

print(result.input)            # Original input
print(result.expected_output)  # Expected output
print(result.actual_output)    # What the target function returned
print(result.passed)           # Whether evaluation passed
print(result.score)            # Numeric score (0.0–1.0)
print(result.error)            # Error message if the target raised
print(result.latency_ms)       # Execution time in milliseconds
print(result.trace_id)         # OpenTelemetry trace ID for this run

Field	Type	Description
`input`	`str`	Input that was evaluated
`expected_output`	`str`	Expected output (if provided)
`actual_output`	`str`	Actual output from the target
`passed`	`bool`	Whether the test case passed
`score`	`float`	Numeric score (0.0–1.0)
`error`	`str`	Error message if target raised an exception
`latency_ms`	`float`	Execution time in milliseconds
`trace_id`	`str`	OpenTelemetry trace ID
`metadata`	`dict`	Metadata from the test case

Using Local Data

List of Test Cases

from klira.sdk.evals import evaluate, KliraTestCase

test_cases = [
    KliraTestCase(
        input="What is hypertension?",
        expected_output="Hypertension is high blood pressure...",
        metadata={"topic": "cardiology"},
    ),
    KliraTestCase(
        input="What is diabetes?",
        expected_output="Diabetes is a metabolic disease...",
        metadata={"topic": "endocrinology"},
    ),
]

results = evaluate(
    target=my_clinical_workflow,
    data=test_cases,
    experiment_id="clinical_qa_v1",
)

for r in results:
    status = "PASS" if r.passed else "FAIL"
    print(f"[{status}] {r.input[:50]}... (score: {r.score:.2f}, {r.latency_ms:.0f}ms)")

CSV File

You can also load test cases from a CSV file with input and expected_output columns:

results = evaluate(
    target=my_workflow,
    data="test_data/clinical_qa.csv",
    dataset_id="clinical_qa",
    experiment_id="run_003",
)

CSV format:

input,expected_output
"What is hypertension?","Hypertension is high blood pressure..."
"What is diabetes?","Diabetes is a metabolic disease..."

Configuring Evaluations at Init Time

You can set default evaluation parameters via Klira.init():

from klira.sdk import Klira

Klira.init(
    app_name="EvalRunner",
    api_key="klira_live_your_key",
    evals_run="nightly_2026_03_10",
    dataset_id="clinical_qa_v2",
)

Parameter	Description
`evals_run`	Default experiment identifier applied to all `evaluate()` calls
`dataset_id`	Default dataset identifier applied to all `evaluate()` calls

These defaults can be overridden per evaluate() call.

Complete Working Example

import asyncio
from klira.sdk import Klira
from klira.sdk.decorators import workflow, guardrails
from klira.sdk.evals import evaluate, KliraTestCase

Klira.init(
    app_name="ClinicalQAEval",
    api_key="klira_live_your_key",
    evals_run="qa_regression_v3",
)

@workflow(name="clinical_qa")
@guardrails(domain="healthcare", check_output=True)
async def clinical_qa(question: str) -> str:
    # Your clinical QA logic
    return await generate_answer(question)

# Define test cases
test_cases = [
    KliraTestCase(
        input="What are common symptoms of pneumonia?",
        expected_output="Common symptoms include cough, fever, and difficulty breathing.",
        metadata={"category": "pulmonology", "severity": "moderate"},
    ),
    KliraTestCase(
        input="How is type 2 diabetes diagnosed?",
        expected_output="Type 2 diabetes is diagnosed through blood tests such as A1C, fasting glucose, or oral glucose tolerance test.",
        metadata={"category": "endocrinology", "severity": "low"},
    ),
]

# Run evaluation
results = evaluate(
    target=clinical_qa,
    data=test_cases,
    dataset_id="clinical_qa_v2",
    experiment_id="regression_run_042",
)

# Summarize results
passed = sum(1 for r in results if r.passed)
total = len(results)
avg_latency = sum(r.latency_ms for r in results) / total

print(f"\nResults: {passed}/{total} passed")
print(f"Average latency: {avg_latency:.0f}ms")

for r in results:
    if not r.passed:
        print(f"\nFAILED: {r.input[:60]}...")
        print(f"  Expected: {r.expected_output[:80]}...")
        print(f"  Got:      {r.actual_output[:80]}...")
        if r.error:
            print(f"  Error:    {r.error}")

Viewing Results

Evaluation results are automatically uploaded to the Klira dashboard when an api_key is configured. Each evaluation run creates traced spans, so you can:

View individual test case traces in the Klira dashboard
Compare results across experiment runs
Filter by metadata (category, severity, etc.)
Track score distributions and latency trends over time

Tracing — View evaluation traces
Compliance Analytics — Analyze evaluation results
@workflow Decorator — Decorate your target functions
@guardrails Decorator — Add policy enforcement to evaluated functions

Running Evaluations

Running Evaluations

Core Components

evaluate()

Parameters

Returns

Data Classes

KliraTestCase

KliraEvalResult

Using Local Data

List of Test Cases

CSV File

Configuring Evaluations at Init Time

Complete Working Example

Viewing Results

Related Pages

`evaluate()`

`KliraTestCase`

`KliraEvalResult`