SNAG-Bench
The Quality Certifier. Temporal reasoning benchmark for LLMs. 60 adversarial tasks, 5 scoring axes, 3 difficulty tiers. Designed to stay hard through 2030. Measures Causal Resolution: Coverage x Convergence.GitHub
timepoint-ai/timepoint-snag-bench — Apache-2.0, Python, Click CLIDetailed Docs
Full task reference, scoring methodology, and evaluation docs
5 Scoring Axes
| Axis | Name | Source | Status |
|---|---|---|---|
| 1 | GSR (Grounding) | Flash API | Live |
| 2 | TCS (Temporal Coherence) | Pro subprocess/API | Live |
| 3 | WMNED (Predictive) | Proteus markets | Stubbed |
| 4 | HTP (Human Judgment) | OpenRouter LLM judges | Live |
| 5 | GCQ (Graph Coverage) | Clockchain stats | Stubbed |
Usage
SNAG-Bench is a local CLI tool — not a deployed service.Task Tiers
| Tier | Difficulty | Tasks |
|---|---|---|
| 1 | Standard | 20 |
| 2 | Hard | 20 |
| 3 | Adversarial | 20 |
Causal Resolution
The composite metric: Coverage x Convergence- Coverage — how much of the relevant temporal space is represented
- Convergence — how consistent the rendered outputs are across runs