Benchmark trajectory
Live time-series of Astrocyte’s LoCoMo and LongMemEval scores. Data is read from the public R2 bucket at runtime — every archived run shows up within seconds of upload, with no docs rebuild needed.
The chart is rendered client-side from trajectory/<bench>.json in the public R2 bucket. See bench-archive for the bucket layout, the archive / fetch tooling, and how the trajectory artifact is regenerated on each run.
README badges link here
Section titled “README badges link here”The two shields.io endpoint badges at the top of the main README link to this page:
The badge JSON lives at badges/<bench>.json on the same public R2 bucket — regenerated whenever make bench-refresh-labels runs. Its content depends on what BENCH_PARITY.yaml (at the repo root) records:
- If a published release exists for the
astrocytepackage: badge shows that release’s frozen scores (e.g.LoCoMo (astrocyte v0.14.0) 83.8%). This is whatpip install astrocyteactually produces. - If no release exists yet but a cycle has been marked shipped: badge shows the mean of that cycle’s shipped run pair (e.g.
LoCoMo (n=200, M19 × 2 runs) 84.1%). - Otherwise: falls back to the most recent non-smoke run.
The full cycle-close + release ritual that drives badge changes lives in RELEASING.md at the repo root.
Current operating point (post-M17 close, 2026-05-17)
Section titled “Current operating point (post-M17 close, 2026-05-17)”Conversation Engine + Memory Engine, gpt-4o-mini answerer + judge, --user-profile enabled. Fair subsets: LME-30 via --per-type 5; LoCoMo-200 via MEM0_HARNESS_LOCOMO_MAX_Q=20 (matches baseline subset). 3-run replication.
| Bench | Cutoff | M14 baseline | M17 3-run mean (Conversation Engine) | Δ |
|---|---|---|---|---|
| LME | top_20 | 65.0% ±3.3pp | 75.56% ±5.09pp | +10.56pp (3.2σ) |
| LoCoMo | top_20 | 78.25% ±1.5pp | 80.50% ±1.50pp | +2.25pp (1.5σ) |
M17 routes LME/LoCoMo through the new Conversation Engine (astrocyte/conversations/ + session-aware Hindsight-parity chunking) → ConversationIngestor → retain SPI → existing extraction. The Document Engine (astrocyte/documents/) ships in parallel but is not benched on conversation workloads (its bench is FinanceBench / DoubleBench, future cycle). See m17-pageindex-ingestion.md §8.
LME run-to-run std (±5.09pp) is wider than baseline (±3.3pp), concentrated in the N=5 single-session-preference and single-session-assistant categories. Conservative “central tendency” LME lift, discounting one SSP outlier, is closer to +8pp than the headline +10.56pp. Both numbers clear the locked 2σ ship gate.
Gap to Hindsight’s published numbers (94.6% LME, 92.0% LoCoMo): −19pp LME, −12pp LoCoMo remaining. M18 quick-wins target the cheapest engine-side levers.
Previous operating point (post-M14 close, 2026-05-15)
Section titled “Previous operating point (post-M14 close, 2026-05-15)”Multi-run aggregates on the Mem0 harness, gpt-4o-mini judge + answerer, --user-profile enabled. HEAD was at commit 6ec61ea (revert(m14.2)).
| Bench | Cutoff | 4-run baseline | M14 experiment WIP (3-run mean) |
|---|---|---|---|
| LME | top_20 | 65.0% ±3.3pp | 68.9% ±5.0pp |
| LoCoMo | top_20 | 78.25% ±1.5pp | 75.83% ±1.5pp |
The “M14 experiment WIP” column reflects bench runs of a working-tree implementation that was never committed — the experiment was torn down after replication invalidated the single-run +8.3pp LME lift. Single-run highs from the M14 cycle (LME 73.3%, LoCoMo 81.0%) did NOT replicate; treat them as variance. See m13-m14-roadmap.md §§8.7-8.10 for the full retrospective and null-verdict explanation.
LoCoMo
Section titled “LoCoMo”R2_PUBLIC_URL at docs build time
(doppler run --config bench -- pnpm build)
to enable the locomo chart. See
benchmarks-doppler-setup.
LongMemEval
Section titled “LongMemEval”R2_PUBLIC_URL at docs build time
(doppler run --config bench -- pnpm build)
to enable the longmemeval chart. See
benchmarks-doppler-setup.
What you’re looking at
Section titled “What you’re looking at”Each point on the overall accuracy chart is one archived bench run. Hovering reveals the run’s stage (e.g. pr2-d55-gate, weekly-ci), the git commit it was launched from, and the question count.
The per-category accuracy chart breaks the same runs down by category. For LoCoMo: single-hop, multi-hop, temporal, open-domain, adversarial. For LongMemEval: single-session-user, single-session-assistant, single-session-preference, multi-session, temporal-reasoning, knowledge-update.
How a run becomes a point on the chart
Section titled “How a run becomes a point on the chart”make bench-locomo / bench-longmemeval / bench-parallel ↓ writes per-project result JSON under ↓ benchmark-results/<harness>/<bench>/<project>/<bench>_results_*.json ↓make bench-archive-rescan ↓ scripts/archive_bench_results.py walks the canonical tree, ↓ skips projects with _ARCHIVED marker, ingests both schemas ↓ (Mem0 metrics_by_cutoff.top_20 OR PageIndex overall_accuracy) ↓ gzip + put_object → private bucket ↓ patch per-day manifest, regenerate trajectory + badgesprivate bucket: runs/<date>/<stage>/<bench>/results-*.json.gzpublic bucket: trajectory/<bench>.json ← this page reads from herepublic bucket: badges/<bench>.json ← README badges read from hereA run that completes without R2 credentials still writes locally; it just doesn’t show up here. To archive everything not already pushed (idempotent — projects with an _ARCHIVED marker are skipped):
cd astrocyte-pymake bench-archive-rescanSmoke / micro runs (stage containing smoke, or n_questions < 30) are filtered out by default — they would otherwise show up as 0% / 100% extreme points on the trajectory. Pass INCLUDE_SMOKE=1 to archive them anyway.
Cycle close → release wiring
Section titled “Cycle close → release wiring”When a cycle’s ship-gate condition is identified, mark its replicate runs with a label so the badge writer can pick them out cleanly:
make bench-mark-shipped PROJECT=m19-b1-dp-rrf-run-1 LABEL=m19 RATIONALE="..."make bench-mark-shipped PROJECT=m19-b1-dp-rrf-run-2 LABEL=m19 RATIONALE="..."make bench-refresh-labels # patches R2 manifests, regenerates badgesmake bench-tag-shipped LABEL=m19 # annotated git tag bench/m19 anchors the cycleAt release time, release-mark-all links the released package version(s) to the cycle in BENCH_PARITY.yaml and the README badges flip to display the released version’s frozen scores. Full ritual in RELEASING.md.
Related
Section titled “Related”- Evaluation — what the suites measure and how scoring works.
- Benchmark roadmap — the PR1 / PR2 / PR3 plan whose deltas appear here.
- Bench archive — bucket layout and tooling reference.
- Doppler setup — how the
R2_*secrets are provisioned.