Benchmark trajectory

Live time-series of Astrocyte’s LoCoMo and LongMemEval scores. Data is read from the public R2 bucket at runtime — every archived run shows up within seconds of upload, with no docs rebuild needed.

The chart is rendered client-side from trajectory/<bench>.json in the public R2 bucket. See bench-archive for the bucket layout, the archive / fetch tooling, and how the trajectory artifact is regenerated on each run.

README badges link here

The two shields.io endpoint badges at the top of the main README link to this page:

LoCoMo LongMemEval

The badge JSON lives at badges/<bench>.json on the same public R2 bucket — regenerated whenever make bench-refresh-labels runs. Its content depends on what BENCH_PARITY.yaml (at the repo root) records:

If a published release exists for the astrocyte package: badge shows that release’s frozen scores (e.g. LoCoMo (astrocyte v0.14.0) 83.8%). This is what pip install astrocyte actually produces.
If no release exists yet but a cycle has been marked shipped: badge shows the mean of that cycle’s shipped run pair (e.g. LoCoMo (n=200, M19 × 2 runs) 84.1%).
Otherwise: falls back to the most recent non-smoke run.

The full cycle-close + release ritual that drives badge changes lives in RELEASING.md at the repo root.

Current operating point (post-M17 close, 2026-05-17)

Conversation Engine + Memory Engine, gpt-4o-mini answerer + judge, --user-profile enabled. Fair subsets: LME-30 via --per-type 5; LoCoMo-200 via MEM0_HARNESS_LOCOMO_MAX_Q=20 (matches baseline subset). 3-run replication.

| Bench | Cutoff | M14 baseline | M17 3-run mean (Conversation Engine) | Δ | |---|---|---|---|---| | LME | top_20 | 65.0% ±3.3pp | 75.56% ±5.09pp | +10.56pp (3.2σ) | | LoCoMo | top_20 | 78.25% ±1.5pp | 80.50% ±1.50pp | +2.25pp (1.5σ) |

M17 routes LME/LoCoMo through the new Conversation Engine (astrocyte/conversations/ + session-aware Hindsight-parity chunking) → ConversationIngestor → retain SPI → existing extraction. The Document Engine (astrocyte/documents/) ships in parallel but is not benched on conversation workloads (its bench is FinanceBench / DoubleBench, future cycle). See m17-pageindex-ingestion.md §8.

LME run-to-run std (±5.09pp) is wider than baseline (±3.3pp), concentrated in the N=5 single-session-preference and single-session-assistant categories. Conservative “central tendency” LME lift, discounting one SSP outlier, is closer to +8pp than the headline +10.56pp. Both numbers clear the locked 2σ ship gate.

Gap to Hindsight’s published numbers (94.6% LME, 92.0% LoCoMo): −19pp LME, −12pp LoCoMo remaining. M18 quick-wins target the cheapest engine-side levers.

Previous operating point (post-M14 close, 2026-05-15)

Multi-run aggregates on the Mem0 harness, gpt-4o-mini judge + answerer, --user-profile enabled. HEAD was at commit 6ec61ea (revert(m14.2)).

| Bench | Cutoff | 4-run baseline | M14 experiment WIP (3-run mean) | |---|---|---|---| | LME | top_20 | 65.0% ±3.3pp | 68.9% ±5.0pp | | LoCoMo | top_20 | 78.25% ±1.5pp | 75.83% ±1.5pp |

The “M14 experiment WIP” column reflects bench runs of a working-tree implementation that was never committed — the experiment was torn down after replication invalidated the single-run +8.3pp LME lift. Single-run highs from the M14 cycle (LME 73.3%, LoCoMo 81.0%) did NOT replicate; treat them as variance. See m13-m14-roadmap.md §§8.7-8.10 for the full retrospective and null-verdict explanation.

LoCoMo

Trajectory chart not configured. Set R2_PUBLIC_URL at docs build time (doppler run --config bench -- pnpm build) to enable the locomo chart. See benchmarks-doppler-setup.

LongMemEval

Trajectory chart not configured. Set R2_PUBLIC_URL at docs build time (doppler run --config bench -- pnpm build) to enable the longmemeval chart. See benchmarks-doppler-setup.

What you’re looking at

Each point on the overall accuracy chart is one archived bench run. Hovering reveals the run’s stage (e.g. pr2-d55-gate, weekly-ci), the git commit it was launched from, and the question count.

The per-category accuracy chart breaks the same runs down by category. For LoCoMo: single-hop, multi-hop, temporal, open-domain, adversarial. For LongMemEval: single-session-user, single-session-assistant, single-session-preference, multi-session, temporal-reasoning, knowledge-update.

How a run becomes a point on the chart

make bench-locomo / bench-longmemeval / bench-parallel
   ↓ writes per-project result JSON under
   ↓ benchmark-results/<harness>/<bench>/<project>/<bench>_results_*.json
   ↓
make bench-archive-rescan
   ↓ scripts/archive_bench_results.py walks the canonical tree,
   ↓ skips projects with _ARCHIVED marker, ingests both schemas
   ↓ (Mem0 metrics_by_cutoff.top_20 OR PageIndex overall_accuracy)
   ↓ gzip + put_object → private bucket
   ↓ patch per-day manifest, regenerate trajectory + badges
private bucket: runs/<date>/<stage>/<bench>/results-*.json.gz
public bucket:  trajectory/<bench>.json   ← this page reads from here
public bucket:  badges/<bench>.json       ← README badges read from here

A run that completes without R2 credentials still writes locally; it just doesn’t show up here. To archive everything not already pushed (idempotent — projects with an _ARCHIVED marker are skipped):

cd astrocyte-py
make bench-archive-rescan

Smoke / micro runs (stage containing smoke, or n_questions < 30) are filtered out by default — they would otherwise show up as 0% / 100% extreme points on the trajectory. Pass INCLUDE_SMOKE=1 to archive them anyway.

Cycle close → release wiring

When a cycle’s ship-gate condition is identified, mark its replicate runs with a label so the badge writer can pick them out cleanly:

make bench-mark-shipped PROJECT=m19-b1-dp-rrf-run-1 LABEL=m19 RATIONALE="..."
make bench-mark-shipped PROJECT=m19-b1-dp-rrf-run-2 LABEL=m19 RATIONALE="..."
make bench-refresh-labels        # patches R2 manifests, regenerates badges
make bench-tag-shipped LABEL=m19 # annotated git tag bench/m19 anchors the cycle

At release time, release-mark-all links the released package version(s) to the cycle in BENCH_PARITY.yaml and the README badges flip to display the released version’s frozen scores. Full ritual in RELEASING.md.

Evaluation — what the suites measure and how scoring works.
Benchmark roadmap — the PR1 / PR2 / PR3 plan whose deltas appear here.
Bench archive — bucket layout and tooling reference.
Doppler setup — how the R2_* secrets are provisioned.