ARA Eval

Agent Risk Assessment — when can an agent act alone?

Most risk frameworks collapse risk into a single score. ARA Eval decomposes it into 7 dimensions, applies deterministic gates, and produces a fingerprint that tells you exactly where the danger is — and where it isn't.

Open-source. Built for enterprises evaluating AI agent autonomy, universities teaching AI governance, and regulators stress-testing policy frameworks.

GitHubLeaderboard

Risk Is a Fingerprint, Not a Score

Two scenarios, both “high risk.” But the interventions are completely different:

Insurance Claims Processor (HK cross-border) Fingerprint: B-C-A-D-D-B-C ├── Decision Reversibility: B (clawback possible) ├── Failure Blast Radius: C (one policyholder) ├── Regulatory Exposure: A ← HARD GATE: PIPL + PDPO triggered ├── Decision Time Pressure: D (days to process) ├── Data Confidence: D (structured claim data) ├── Accountability Chain: B (auditable but cross-border) └── Graceful Degradation: C (queue for human review) Algorithmic Trading Deployment Fingerprint: A-A-A-A-C-C-A ├── Decision Reversibility: A ← HARD GATE: trades are instant ├── Failure Blast Radius: A ← HARD GATE: market-wide impact ├── Regulatory Exposure: A ← HARD GATE: direct mandate ├── Decision Time Pressure: A (milliseconds) ├── Data Confidence: C (market data is noisy) ├── Accountability Chain: C (logged but opaque) └── Graceful Degradation: A (cascading failure / Knight Capital)

The 7 Dimensions

DimensionWhat it measuresGate
Decision ReversibilityCan you undo it?Soft
Failure Blast RadiusHow many people/systems/dollars?Hard
Regulatory ExposureDoes it touch compliance?Hard
Decision Time PressureHow long before you must act?Soft
Data ConfidenceDoes the agent have enough signal?Soft
Accountability ChainWho’s responsible? Can you audit?Soft
Graceful DegradationDoes it fail safely or cascade?Soft

Hard Gates: The Aviation Principle

Regulatory Exposure = A → autonomy not permitted, full stop

Failure Blast Radius = A → human oversight required

The gating rules are deterministic code, never delegated to the LLM. The LLM classifies the dimensions. The code enforces the policy. You can swap models, change prompts, add jurisdictions — but the gates don't move.

LLM-as-Judge Results

Can LLMs evaluate scenarios against this framework and match human judgment? 11 models tested across 6 real-world scenarios:

Model Gate Recall Calibration Time (18 evals) ───────────────────────────────────────────────────────────────────── Claude Opus 100% 87% — Gemini Flash Lite 100% 71s fastest ───────────────────────────────────────────────────────────────────── Everything else sharp cliff in gate recall

Full results on the leaderboard.

Who It's For

Enterprises — Structured reports for evaluating which workflows can safely use autonomous agents, and which need human-in-the-loop.

Universities — Real-world scenarios and a 5-week MBA capstone syllabus for teaching AI governance.

Regulators — Hong Kong's unique overlap of HKMA GenAI Sandbox, PIPL, and PDPO stress-tests the framework against regulatory complexity other jurisdictions haven't faced.

What's Included

Scenarios — 6 core evaluation scenarios grounded in real incidents (Samsung leak, Knight Capital, HK cross-border claims)

Rubric — 7-dimension scoring rubric with A–D ratings and worked examples

Evaluation pipeline — Automated LLM-as-judge harness with gate recall and calibration metrics

Gating rules — Deterministic hard/soft gate logic (code, not prompts)

MBA syllabus — 5-week capstone course for AI governance education

Built By

Digital Rain Technologies. Founded by Augustin Chan, building at the intersection of AI systems and enterprise governance. Built in Hong Kong.

Read the full technical write-up: Risk Isn't a Number. It's a Fingerprint.