Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

Most RAG demos answer "what's the right chunk?" Very few can answer the two questions a regulator or an auditor will actually ask:

RAB measures whether your audit trail is good enough to replay a decision, with three deterministic metrics:

The three metrics map verbatim to EU AI Act Articles 10, 12, and 19 — record-keeping obligations that apply from 2026-08-02 (per Article 113).

AC RF PC JAMES 1.000 1.000 1.000 Baseline-0 0.275 0.000 0.000 (vanilla default-logging) Enter fullscreen mode Exit fullscreen mode The gap is the whole point. "We have logs" (AC 0.275) is not the same as "we can replay the decision" (RF 0). Default application logging gets you a partial event trail and zero replay/provenance — which is exactly the failure mode an Article 12 audit would surface.

RAG facts go stale. A policy is superseded, a price changes, a spec is revised. LRB asks: when you query as of a point in time, do you retrieve the fact that was valid then, or whatever overwrote it?

The R@1 ordering V < N < J holds across 4 model families × 4 scale points (a 12.5× scale span) — time-aware retrieval beats both naive overwrite and no time-handling at every scale, not just one lucky cell.

R@1 V 0.502 N 0.721 J 0.845 Enter fullscreen mode Exit fullscreen mode How to run it yourself Everything is local — Ollama (gemma4:e4b default) + BAAI/bge-m3 embeddings + ChromaDB. No cloud LLM account.

git clone https://github.com/Hashevolution/James-RAG-Evol cp .env.example .env pip install -r requirements.txt ollama pull gemma4:e4b # benchmark runners live in scripts/research/ (lrb_run*.py, rab_*) Enter fullscreen mode Exit fullscreen mode Honest framing These are benchmarks, not a victory lap. JAMES hitting 1.0/1.0/1.0 on a scenario I designed is a starting line, not proof of general superiority — the value is that the scenarios, metrics, and baselines are public and deterministic, so you can run them, disagree, and beat the numbers.

Feedback I'd value most: (a) does the AC/RF/PC ↔ Art. 10/12/19 mapping hold up under your reading of the text? (b) is "newest wins" the right Naive-supersede baseline for LRB, or is there a stronger one I should add?

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink.

For further actions, you may consider blocking this person and/or reporting abuse

Thank you to our Diamond Sponsors for supporting the DEV Community

DEV Community — A space to discuss and keep up software development and manage your software career

Built on Forem — the open source software that powers DEV and other inclusive communities.

We're a place where coders share, stay up-to-date and grow their careers.

Signing you in

Two Pre-Registered Benchmarks for Audit-Native RAG: RAB (EU AI Act 10/12/19) + LRB (Time-Travel Retrieval)

Original Source