ARA Eval
Agent Risk Assessment — when can an agent act alone?
Most risk frameworks collapse risk into a single score. ARA Eval decomposes it into 7 dimensions, applies deterministic gates, and produces a fingerprint that tells you exactly where the danger is — and where it isn't.
Open-source. Built for enterprises evaluating AI agent autonomy, universities teaching AI governance, and regulators stress-testing policy frameworks.
Risk Is a Fingerprint, Not a Score
Two scenarios, both “high risk.” But the interventions are completely different:
The 7 Dimensions
| Dimension | What it measures | Gate |
|---|---|---|
| Decision Reversibility | Can you undo it? | Soft |
| Failure Blast Radius | How many people/systems/dollars? | Hard |
| Regulatory Exposure | Does it touch compliance? | Hard |
| Decision Time Pressure | How long before you must act? | Soft |
| Data Confidence | Does the agent have enough signal? | Soft |
| Accountability Chain | Who’s responsible? Can you audit? | Soft |
| Graceful Degradation | Does it fail safely or cascade? | Soft |
Hard Gates: The Aviation Principle
Regulatory Exposure = A → autonomy not permitted, full stop
Failure Blast Radius = A → human oversight required
The gating rules are deterministic code, never delegated to the LLM. The LLM classifies the dimensions. The code enforces the policy. You can swap models, change prompts, add jurisdictions — but the gates don't move.
LLM-as-Judge Results
Can LLMs evaluate scenarios against this framework and match human judgment? 11 models tested across 6 real-world scenarios:
Full results on the leaderboard.
Who It's For
Enterprises — Structured reports for evaluating which workflows can safely use autonomous agents, and which need human-in-the-loop.
Universities — Real-world scenarios and a 5-week MBA capstone syllabus for teaching AI governance.
Regulators — Hong Kong's unique overlap of HKMA GenAI Sandbox, PIPL, and PDPO stress-tests the framework against regulatory complexity other jurisdictions haven't faced.
What's Included
Scenarios — 6 core evaluation scenarios grounded in real incidents (Samsung leak, Knight Capital, HK cross-border claims)
Rubric — 7-dimension scoring rubric with A–D ratings and worked examples
Evaluation pipeline — Automated LLM-as-judge harness with gate recall and calibration metrics
Gating rules — Deterministic hard/soft gate logic (code, not prompts)
MBA syllabus — 5-week capstone course for AI governance education
Built By
Digital Rain Technologies. Founded by Augustin Chan, building at the intersection of AI systems and enterprise governance. Built in Hong Kong.
Read the full technical write-up: Risk Isn't a Number. It's a Fingerprint.