Skip to content

Eval Dashboard

Astrocyte runs nightly evaluation suites against the built-in Tier 1 pipeline to track retrieval quality, latency, and LLM token usage over time. Results are published here automatically.

The basic suite runs 20 retain + 20 recall + reflect operations against an in-memory vector store with OpenAI embeddings and completions. See the Evaluation design doc for details on suites, metrics, and regression detection.

Loading eval history…
MetricWhat it measures
Hit rateFraction of queries that return at least one relevant result
MRRMean reciprocal rank — how high the first relevant result ranks
NDCGNormalized discounted cumulative gain — overall ranking quality
PrecisionRelevant results / total results returned
Reflect accuracyTopic coverage in synthesized answers (keyword overlap)
Tokens usedTotal LLM tokens consumed during the eval run (not global spend)
  1. The eval workflow runs nightly (5am UTC) or on-demand via gh workflow run eval.yml
  2. It executes scripts/run_eval.py with an inline OpenAI adapter + in-memory vector store
  3. Results are appended to docs/public/eval/history.json and committed
  4. The docs site rebuilds, and this page renders the latest data