Seer Book demo

Ship retrieval changes with confidence.

Compare embedding models, rerankers, or prompts on real traffic. Know which variant wins with statistical significance—in 1 hour, not 1 month.

A/B comparison

How it works

  1. 1. Tag your variants. Add a feature flag to your log calls. No code changes to your retrieval logic.
  2. 2. Seer evaluates both. Every query gets scored on recall, precision, and groundedness—automatically.
  3. 3. See the winner. Statistical comparison shows which variant performs better, with confidence intervals.

What you compare

Metrics comparison Query-level analysis

Integration

Add a feature flag to your logs

# Variant A (baseline)
client.log(task=query, context=docs, metadata={"feature_flag": "baseline"})

# Variant B (new embeddings)
client.log(task=query, context=docs, metadata={"feature_flag": "new-embeddings"})

That's it. Seer automatically groups and compares variants.

Test your first change

Add feature flags. Compare variants. Ship with confidence.

Read the docs