Overall Performance
All models evaluated at high reasoning effort
Ebla
25.4%
Opus 4.6
20.1%
Sonnet 4.6
19.3%
GPT-5.4
17.8%
Gemini 3.1 Pro
12.2%
GPT-5.2
11.3%
Grok 4.1 Fast
8.2%
GPT-OSS-120b
7.1%
Gemini 3 Flash
6.3%
Aviro benchmarks measure how well frontier models handle document-heavy workflows, grounded reasoning, and multi-step tool use inside enterprise environments.
All models evaluated at high reasoning effort