AgentReplay Blog

A Guide to AgentReplay's 20+ Evaluators

2026-02-08T00:00:00.000Z

Evaluating AI agents is hard. AgentReplay ships with 20+ built-in evaluators that cover everything from hallucination detection to adversarial testing. Here's how to use them.

The Evaluation Pyramid

Not all evaluations are equal. We organize them in a pyramid from cheap/fast to expensive/thorough:

          ┌─────────────┐
          │    CIP       │  ← Adversarial
          │ (Saboteur)   │     (Most expensive)
          ├─────────────┤
          │  G-Eval      │  ← LLM-as-Judge
          │  RAGAS       │
          ├─────────────┤
          │ Hallucination│  ← Quality Checks
          │ Relevance    │
          │ Toxicity     │
          ├─────────────┤
          │ Latency      │  ← Metrics
          │ Cost         │     (Cheapest)
          │ Perplexity   │
          └─────────────┘

Progressive Evaluation

AgentReplay's ProgressiveEvaluator automatically manages this pyramid:

Phase 1 — Run cheap heuristic checks (latency, cost, perplexity)
Phase 2 — If heuristics pass, run quality checks (hallucination, relevance)
Phase 3 — If quality is uncertain, escalate to LLM-as-judge (G-Eval)

This saves LLM API costs by only running expensive evaluations when needed.

Our Favorite Evaluators

G-Eval

The gold standard for LLM evaluation. Automatically generates chain-of-thought evaluation steps:

curl -X POST http://127.0.0.1:47100/api/v1/evals/geval \
  -d '{
    "trace_id": "abc-123",
    "criteria": ["relevance", "coherence", "fluency"],
    "rubric": "Score 1-5 based on..."
  }'

CIP (Causal Integrity Protocol)

Our most novel evaluator. Creates "saboteur agents" that:

Perturb the input in subtle ways
Run the agent on perturbed inputs
Check if the output changes appropriately

This tests causal reasoning — does the agent understand why it produces its output, or is it just pattern matching?

RAGAS

Comprehensive RAG evaluation:

QAG Faithfulness — Is the answer faithful to the context?
Embedding Answer Relevance — Is the answer relevant to the question?
Claim Verification — Are specific claims supported?
NLI Verdict — Natural language inference check

Running a Full Eval Suite

import requests

# Create a dataset
dataset = requests.post("http://127.0.0.1:47100/api/v1/evals/datasets", json={
    "name": "support-qa-v1",
    "description": "Customer support Q&A pairs"
}).json()

# Add test cases
requests.post(f"http://127.0.0.1:47100/api/v1/evals/datasets/{dataset['id']}/examples", json={
    "input": "How do I reset my password?",
    "expected": "Navigate to Settings > Security > Reset Password"
})

# Create and run an evaluation run
run = requests.post("http://127.0.0.1:47100/api/v1/evals/runs", json={
    "dataset_id": dataset["id"],
    "evaluators": ["hallucination", "relevance", "geval", "ragas"]
}).json()

# Check results
results = requests.get(f"http://127.0.0.1:47100/api/v1/evals/runs/{run['id']}").json()
print(f"Overall score: {results['summary']['mean_score']}")

Explore all evaluators →

Architecture Deep Dive: 16 Rust Crates

2026-02-07T00:00:00.000Z

AgentReplay is built as a modular Rust workspace with 16 focused crates. Here's why we chose this architecture and how each piece fits together.

Why Rust?

We chose Rust for three reasons:

Performance — Sub-millisecond vector search, high-throughput WAL writes, SIMD-optimized embeddings
Memory safety — No garbage collection pauses, no null pointer exceptions
Single binary — Ship a self-contained executable with zero runtime dependencies

The Crate Map

┌─ Transport ──────────────────────────┐
│  agentreplay-server (Axum + gRPC)    │
│  agentreplay-tauri (Desktop)         │
│  agentreplay-cli (CLI)               │
├─ Intelligence ───────────────────────┤
│  agentreplay-evals (20+ evaluators)  │
│  agentreplay-prompts (Versioning)    │
│  agentreplay-query (Search engine)   │
│  agentreplay-memory (Persistence)    │
├─ Storage ────────────────────────────┤
│  agentreplay-storage (SochDB)        │
│  agentreplay-index (HNSW + PQ)       │
├─ Core ───────────────────────────────┤
│  agentreplay-core (Data types)       │
│  agentreplay-observability (OTEL)    │
├─ Extensibility ──────────────────────┤
│  agentreplay-plugins (WASM)          │
│  agentreplay-experiments (A/B)       │
└──────────────────────────────────────┘

Each crate has a single responsibility and clear API boundaries. This lets us:

Test in isolation — Each crate has its own test suite
Compile in parallel — Cargo builds independent crates concurrently
Replace components — Swap the storage engine without touching evaluators

Key Design Decisions

SochDB for Storage

We built on SochDB rather than SQLite or RocksDB because:

ACID transactions with MVCC — concurrent reads during writes
Group Commit WAL — ~10× throughput vs standard WAL
Adaptive sketches — HyperLogLog, CountMinSketch, DDSketch built-in
No external process — runs in-process with zero setup

HNSW for Vector Search

Our HNSW implementation uses:

Lock-free entry point with packed atomic CAS
CSR graph for cache-efficient traversal
Hot buffer for inserts without graph rebuilds
Product Quantization — 32× memory reduction (15 GB → 480 MB for 10M vectors)

Hybrid Logical Clock

We use HLC instead of wall clocks for causal ordering:

Physical component for human-readable timestamps
Logical component for causal ordering when clocks collide
Guaranteed monotonic within a process

Performance Numbers

Metric	Value
Trace ingestion	50K+ traces/sec
Vector search (1M vectors)	< 1ms p99
WAL write (group commit)	200K+ writes/sec
Embedding (local ONNX)	~5ms per text
PQ compression ratio	32×

What's Next

We're working on:

Distributed mode with Raft consensus
GPU-accelerated embedding pipeline
More evaluator plugins
React Native mobile app

Explore the architecture →

Welcome to AgentReplay

2026-02-01T00:00:00.000Z

We're excited to introduce AgentReplay — the open-source AI observability platform that runs 100% locally.

Why We Built This

As AI agents become more complex, understanding what they do and how well they do it is critical. Existing observability tools either:

Send your data to the cloud — raising privacy and compliance concerns
Cost $50–500+/month — making them inaccessible to individual developers and small teams
Provide limited evaluators — typically 3–5 basic metrics

AgentReplay solves all three problems. It runs entirely on your machine, it's free and open source, and it ships with 20+ built-in evaluators.

What You Get

Tracing

OpenTelemetry-native tracing that captures every LLM call, tool invocation, and agent step. Auto-instrument OpenAI, Anthropic, LangChain, and LlamaIndex with zero code changes.

20+ Evaluators

From hallucination detection to RAGAS and G-Eval. Run evaluations locally without sending your data anywhere.

Prompt Management

Semantic versioning, A/B traffic splitting, canary rollouts, and deployment environments. Treat your prompts like production code.

MCP Server

Built-in Model Context Protocol server that lets Claude, Cursor, and other AI tools search and explore your traces.

Desktop App

Native macOS, Windows, and Linux app with an embedded server, OTLP receiver, and full React UI — all in one download.

Get Started

pip install agentreplay

import agentreplay
from openai import OpenAI

agentreplay.init()
client = agentreplay.wrap_openai(OpenAI())
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

That's it. Three lines and your agent is fully instrumented.

Read the docs →

AgentReplay Blog

A Guide to AgentReplay's 20+ Evaluators

The Evaluation Pyramid​

Progressive Evaluation​

Our Favorite Evaluators​

G-Eval​

CIP (Causal Integrity Protocol)​

RAGAS​

Running a Full Eval Suite​

Architecture Deep Dive: 16 Rust Crates

Why Rust?​

The Crate Map​

Key Design Decisions​

SochDB for Storage​

HNSW for Vector Search​

Hybrid Logical Clock​

Performance Numbers​

What's Next​