Enterprise Search Benchmark
By Aviro Research
Research·Dec 11, 2025
Introduction
The Enterprise Search Benchmark (ESB) is the first benchmark designed to evaluate deep research agents on enterprise search scenarios. Unlike web-search benchmarks that test open-domain retrieval, ESB targets the proprietary document corpora, multi-document reasoning, and visual understanding required in real enterprise workflows. This initial release, ESB-10, introduces $10$ tasks encompassing realistic enterprise workflows, cross-document dependencies, and data complexity—including visual content (screenshots, diagrams, charts), large tables, and multi-step reasoning.
We evaluated seven frontier models—Claude Opus 4.5, Claude Sonnet 4.5, Claude Sonnet 4, GPT-5.1, GPT-5.2, Gemini 3 Pro, and Gemini 3 Flash—each run $3$ times per task to account for output variance, for a total of $210$ runs. Even the top-performing model (Claude Sonnet 4.5) achieves only $29.6%$ across tasks, and only $2$ of $10$ tasks were ever fully solved by any model. Explore the results below—click on the different categories to view performance scores by group:
Or view all category scores at a glance:
ESB-10 is also the first benchmark to combine visual reasoning, messy internal corpora, and strict abstinence requirements—testing agents on real-world enterprise challenges like extracting info from screenshots, navigating unfinished docs, and recognizing when data doesn’t exist. In contrast, other benchmarks each only focus on $1$ or $2$ areas. The table below compares ESB-10 to other enterprise search benchmarks:
| Benchmark | Primary Focus | Visual / PDF-native | Messy internal docs | Closed corpus | Abstinence focus | Agentic vs RAG | Multi-hop reasoning |
|---|---|---|---|---|---|---|---|
| HERB (Salesforce, 2025) 1 2 | Deep search over enterprise artifacts (Slack, docs, PRs, meetings) | No. Text and structured data only; no PDF-native or visual tasks. | Synthetic workflows. Text/logs/JSON only; no UI screenshots or WIP slides. | Yes. Closed synthetic enterprise corpus. | Yes. Includes unanswerable queries at question level. | Agentic. Multi-tool agent search. | Yes. Cross-artifact reasoning. |
| WixQA (Wix, 2025) 4 5 | Support QA over help center KB | No. Text-only articles and QA pairs. | No. Polished KB docs only. | Yes. KB snapshot only. | No. Answerable support questions. | Static RAG. Retrieval + generation pipelines. | Partial. Short multi-doc support answers. |
| OfficeQA (Databricks, 2025) 6 7 | Grounded reasoning on Treasury PDFs | Medium. PDFs with tables/charts, but uses internal parser instead of native PDF APIs. | No. Government bulletins, not internal enterprise content. | Yes. Treasury Bulletin corpus. | Limited. Focus on numeric correctness, not abstinence. | Agentic. Agents with toolchains navigating PDFs. | Yes. Multi-bulletin computations and time-series analysis. |
| MTRAG (IBM, 2025) 8 9 | Multi-turn conversational RAG | No. Text corpora only. | No. Curated QA/domain texts. | Yes. Fixed document sets per domain. | Some. Includes unanswerable questions. | RAG pipeline. Multi-turn RAG, not tool-rich agents. | Yes. Context-dependent multi-turn reasoning. |
| UAEval4RAG (Salesforce, 2024) 10 | Unanswerable query synthesis framework | No. Text KBs and existing QA benchmarks. | No. Standard text datasets. | Yes. Closed-KB RAG framework. | Core focus. Taxonomy of unanswerability types. | Static RAG. Retriever/reranker/LLM combinations. | No. Single-turn KB queries. |
| ESB-10 (Aviro) | Deep research agents over synthetic enterprise corpus | Yes. Native PDF APIs with screenshots, charts, dashboards, org charts, and tables. | Yes. Mixes polished decks with WIP docs, training materials, and Salesforce screenshots. | Yes. Zyro enterprise corpus only. | Yes, and hard. Explicit null-field requirements; strong difficulty correlation. | Agentic. search(), fetch(), answer() with citations and trajectory scoring. |
Yes. Cross-modal $2$–$4$ doc tasks with business calculations. |
ESB-10 establishes a difficult baseline for enterprise research agents. Our evaluation reveals several critical failure modes that teams building research agents should address:
- Premature termination: Agents answer too early with incomplete information, missing critical data scattered across documents.
- Tool degradation: Tool-call errors (ID typos, missed calls) increase with task complexity—agents lose track of document references mid-task.
- Failure to abstain: Agents cannot admit when information is missing; instead, they hallucinate plausible answers from adjacent data.
- Incorrect citations: Wrong or mixed-up document references undermine verifiability, even when answers are correct.
- Multi-hop breakdown: Early reasoning errors cascade across steps, compounding inaccuracies in later conclusions.
- Visual limitations: Agents struggle with screenshots, charts, and UI elements—often missing information that isn't in plain text.
- Excessive tool use: Some runs exceed $100$ tool calls without solving the task, indicating thrashing rather than progress.
These findings indicate that future research must focus on enabling agents to traverse large and unfamiliar document corpora, perform precise multi-step reasoning spanning several documents, maintain accurate citations alongside correct answers, and recognize when requested information simply does not exist.
Benchmark
Environment
In our benchmark, models receive a query and must produce a structured answer by iteratively searching and fetching documents from a corpus. For document and task creation, we recruited domain experts from Fortune 500 companies, leading consulting firms, global investment banks, financial services institutions, enterprise technology companies, fintech platforms, quantitative trading firms, insurance providers, government regulatory agencies, and premier research universities.
The benchmark consists of three main components:
-
Multimodal document corpus: The benchmark includes $21$ documents ($287$ pages) with dense visual content, from polished corporate to rough work-in-progress docs. Each contains complex, realistic media—large multi-column tables, dashboard screenshots, diagrams, and placeholder images. Documents are manually created with consistent org names, employees, financial figures, and styling, with multiple review rounds to ensure cross-document coherence and diversity. Critically, all documents are provided as PDFs and processed through native vision APIs—agents must interpret visual content directly rather than relying on text extraction, making this a truly multimodal environment.
-
Cloned platforms: Functional clones of platforms like Salesforce and Google Analytics are built, populated with realistic organization data (employee records, pipelines, dashboards, etc.), enabling rapid screenshot generation for document creation without needing real platform APIs or demo environments.
-
Expert-crafted tasks: Tasks are designed based on expert judgment and informed by surveys of knowledge workers in relevant organizations. Each task undergoes rigorous peer review and quality assurance. Additionally, every agent run is carefully annotated to ensure data consistency and reliability.
ESB-10 includes $10$ tasks handpicked from our full benchmark to represent five core categories of enterprise search challenges:
- Bulk Extraction: Extract many similar items ($20+$ cases, knowledge articles) from screenshots with consistent field mapping across entries.
- Abstinence: Recognize when related data exists but doesn't actually answer the question—return null instead of hallucinating plausible values.
- Cross-Document Synthesis: Combine information across $2$–$4$ documents, following implicit connections and verifying consistency between sources.
- Visual Understanding: Interpret data from charts, dashboards, org charts, and platform screenshots where OCR/text extraction is insufficient.
- Complex Calculation: Multi-step arithmetic ($12+$ chained formulas) using data from multiple sources with domain-specific business rules.
Each task may require multiple categories. The hardest tasks combine $3+$ categories, particularly those requiring abstinence.
The benchmark centers on a synthetic enterprise analytics company serving healthcare, finance, and retail. Each task draws on $1$–$4$ documents ($2.3$ average), split between textual ($65.7%$) and visual ($34.3%$) evidence. See samples of tasks and document pages below.
Our design philosophy for ESB is evaluating generalist models without domain-specific fine-tuning or specialized adaptation. This design choice warrants explanation, as it diverges from how enterprises deploy production systems-organizations fine-tune models on proprietary institutional data (via bespoke environments) to achieve stronger performance on domain-specific tasks. However, before investing in fine-tuning models, teams need to understand baseline capabilities—can a model navigate unfamiliar document corpora, extract information across multiple sources, perform multi-hop reasoning, and abstain when data doesn't exist? These transferable skills determine whether a model is worth fine-tuning in the first place. ESB isolates and measures these fundamental capabilities by ensuring tasks require zero prior knowledge. Current results show that even without enterprise-specific complexities like access controls or legacy system integration, frontier models struggle at document retrieval, cross-document synthesis, and abstinence—suggesting these are foundational gaps, not deployment-specific issues.
Currently, our synthetic corpora creation provides the control, reproducibility, and iteration speed essential for rigorous benchmarking. Our environments preserve the core challenges of internal docs—messy formatting, visual content, incomplete documentation, and multi-document dependencies. To prevent overfitting and ensure robust generalization, we are actively building multiple organization environments spanning different industries, document styles, and organizational structures. This approach lets us systematically test whether capabilities transfer across contexts, rather than memorizing patterns from a single corpus. ESB measures out-of-the-box performance before any enterprise-specific training. Enterprise teams can leverage these results to identify base models with strong fundamental capabilities, while frontier labs can use these environments to accelerate improvements and deliver more robust out-of-the-box deployments.
Methodology
The agent follows a three-step workflow—searching for relevant documents, fetching full documents for detailed analysis, and submitting an answer with citations. This workflow is modeled after ChatGPT Deep Research, which uses a similar search-fetch-answer pattern for deep information gathering.18 Its tools are as follows:
-
$\text{search}(q)$ — Searches across all documents in the corpus to find the most relevant ones. The agent receives up to $5$ documents ranked by relevance, each with its ID, title, and full text content—including natural-language descriptions of visual elements like charts, tables, screenshots, and diagrams. This gives the agent an overview of what's available.
The search uses a hybrid of semantic similarity (cosine distance on
text-embedding-3-smallembeddings) and BM25 keyword matching, combined via Reciprocal Rank Fusion. Corpus documents have embedding vectors precomputed and stored in pgvector; at runtime, the query is encoded and matched against these stored vectors. -
$\text{fetch}(d_{id})$ — Retrieves the complete original document for detailed analysis. For PDFs, this provides the full visual document—enabling the agent to see charts, dashboards, org charts, and screenshots directly, which is essential when text descriptions alone are insufficient. PDFs are sent as base64 to the model's native PDF input API for direct visual understanding.
-
$\text{answer}(\cdot)$ — Submits the final structured response with page-level citations. Each citation must specify the document ID, page numbers, and evidence type (visual or textual) for every claim.
We use a multiplicative reward function that evaluates agents across three dimensions, and models are awarded partial credit:
$$R = R_{\text{answer}} \cdot R_{\text{docs}} \cdot R_{\text{steps}}$$
-
$R_{\text{answer}}$: Field-by-field match between predicted and ground truth answers, yielding a score from $0.0$ to $1.0$.
-
$R_{\text{docs}}$: Citation accuracy via precision × recall—penalizes both missing required documents and including incorrect ones.
-
$R_{\text{steps}}$: Step efficiency penalty. The expected step budget is $2 \times$ the number of required documents (one search + one fetch per document). If actual steps $\leq$ budget, the score is $1.0$. Otherwise, the penalty follows a square root curve:
$$R_{\text{steps}} = \left(\frac{\text{expected steps}}{\text{actual steps}}\right)^{1/2}$$
A task is considered fully solved when $R_{\text{answer}} = 1.0$ and $R_{\text{docs}} = 1.0$—meaning $100%$ answer accuracy with all correct supporting documents cited. We use multiplicative scoring rather than averaging because all three dimensions must succeed together: a correct answer citing wrong documents is unverifiable, correct citations with a wrong answer is still wrong, and excessive steps correlate with struggling rather than thoroughness (quantified in Analysis).
Runs are capped at $50$ tool calls. If a model doesn't submit an answer by step $50$, the run terminates with $0%$ reward. For fairness, future iterations will require forced answer submission at step $50$ due to abstienence-style questions.
Deep Research Agent pipeline. The agent iteratively searches a document corpus via embedding similarity, fetches full documents (text or PDF), and submits a structured answer.
Analysis
The chart below breaks down each model's performance across the three scoring dimensions—Correctness ($R_{\text{answer}}$), Citations ($R_{\text{docs}}$), and Efficiency ($R_{\text{steps}}$):
Task-level breakdown:
The strongest predictor of task difficulty is abstinence—the proportion of answer fields where the correct response is null because the data doesn't exist. Tasks with $100%$ abstinence fields (where every answer should be null) averaged just $6.7%$ scores. In contrast, tasks with $0%$ abstinence (and no complex extraction) averaged $76.7%$. Models consistently substitute adjacent data rather than admitting information is missing. For example, when asked for an architecture team's eSAT score, budget, and headcount, a model correctly extracted team members from the org chart—but then pulled the organization-wide eSAT ($91$) as if it were team-specific, simply because the prompt asked about that team. The ground truth was null; the model returned { "architecture_esat_score": 91, "architecture_ytd_budget": "$5,180,200" }. This abstinence failure is a signal of reward hacking—models are optimizing to provide answers (which are rewarded in training) rather than correctly identifying when information is missing.
Abstinence tasks also reveal divergent search strategies across model families. Claude and GPT models always submit answers ($100%$), while Gemini models frequently fail to submit—Gemini 3 Pro submits $63%$ of the time, and Gemini 3 Flash only $50%$. On abstinence tasks specifically, Gemini 3 Flash submits just $13%$ of runs, while Gemini 3 Pro submits $40%$. Search exhaustiveness also differs: on normal tasks, GPT-5.1 averages $3.9$ tool calls; on abstinence tasks, it barely increases to $4.5$. Gemini 3 Flash, by contrast, balloons from $19.9$ to $45.1$ steps on abstinence tasks—$93%$ of its non-submissions hit the $50$-step cap, exhaustively searching for data that doesn't exist.
When Gemini 3 Flash does submit, it achieves the highest score of any model—$40.4%$ versus $29.6%$ for Sonnet 4.5. This suggests Gemini 3 Flash is highly selective: it continues searching until confident, which produces accurate answers on normal tasks but backfires on abstinence tasks where confidence never arrives. Component scores (Correctness, Citations, Efficiency) only average runs where the model submitted an answer—non-submissions are excluded from component averages but count as $0%$ in overall scores. This explains why Gemini 3 Flash leads in Citations ($84%$) yet ranks last overall ($20.2%$). For this reason, our benchmark now includes a forceful answer submission at the step limit ($50$ steps), ensuring all runs produce evaluable responses even when models cannot find the requested information.
When examining "fully solved" tasks ($100%$ answer accuracy and $100%$ citation accuracy, ignoring step penalties), Claude models fully solve tasks more often than other models—Sonnet 4.5 and Opus 4.5 each achieve a $20%$ full solve rate. Gemini 3 Flash achieves $16.7%$ (tied with GPT-5.2), while Gemini 3 Pro and GPT-5.1 achieve $10%$. Unsolved runs take significantly more steps than solved runs: Sonnet 4.5 takes $2.9\times$ more steps on unsolved tasks, while Gemini 3 Pro takes $4.1\times$ more. This confirms that step count correlates with struggling rather than thoroughness.
Excluding abstinence tasks, Gemini models lead—unsurprising given Google's decades of search infrastructure. Google is the only frontier lab alongside OpenAI to ship production embedding models,11 recently launched a dedicated File Search API,12 and released a Deep Research API13 with native file search that neither OpenAI's nor Perplexity's research APIs offer. Yet this retrieval dominance inverts on abstinence tasks, where Gemini models rank last.
Anthropic models show the opposite pattern—moderate retrieval but strong calibration, leading to the highest overall scores. This aligns with Anthropic's focus on enterprise deployments, where false positives carry real cost: legal discovery that surfaces irrelevant documents, compliance checks that flag non-issues, support agents that hallucinate policy. Their leading market share in enterprise LLM API adoption14 may reflect exactly this reliability-over-retrieval tradeoff.
OpenAI's results are harder to interpret. GPT-5.1 performs competitively, particularly on abstinence tasks, but GPT-5.2 represents a clear regression—echoing public sentiment at launch that the newer model failed to meaningfully surpass its predecessor.15 16 17 One plausible explanation: if OpenAI optimized GPT-5.2 to perform better on standard benchmarks, this is precisely the reward hacking pattern we defined earlier. Models post-trained to maximize benchmark scores are rewarded for providing answers—likely not for correctly abstaining when information doesn't exist. Poor abstinence performance is a likely symptom of this optimization pressure.
Samples
Documents
Left: Analytics Platform Engineering Organization chart on a slide. Right: Security Monitoring slide with time-series charts.
Sample slides and documents from Z-Force training materials containing synthetic Salesforce platform screenshots.
Left: SupplyLens Supply Chain Intelligence Framework slide. Right: Platform Mastery Learning Plan.
Clones
Functional Salesforce clone used for screenshot generation, showing list views, chat conversations, and case detail panels with realistic enterprise data.
Tasks
Here are sample task prompts across difficulty levels:
| Difficulty | Categories | Prompt |
|---|---|---|
| Easy | Cross-doc, Visual | "Who is the named demo user shown in the Salesforce (Z-Force) platform training screenshots, and what email domain is associated with that user? Confirm the answer by cross-verifying $2$ separate documents." |
| Medium | Abstinence, Cross-doc, Visual | "I need to validate team-specific metrics for a high-stakes platform architecture initiative. First, from the engineering organization chart, identify all individuals shown in pink/magenta boxes and extract their complete names with locations. Then, from the engagement survey results, extract the satisfaction score (eSAT) for the architecture team. Next, from the budget and cost information, determine the year-to-date spending allocated to the Architecture function. Finally, from the team growth information, calculate the year-over-year headcount change for the Architecture team by comparing 2023 to 2022 staffing levels." |
| Hard | Abstinence, Cross-doc, Visual | "Examine the timeline diagram showing the $4$ phases across Q1-Q4 2025 with colored progress bars. Extract the numerical percentage completion value displayed on or above the 'Platform Deployment' bar for Q2 2025. Then, review the dashboard screenshot showing agent information with individual agent names, statuses, and capacity metrics. Identify the full name and location of the employee specifically designated as the primary 'Platform Deployment Specialist' role for HealthSphere General German sites. Next, cross-reference the relevant process documents to find the required escalation email address this specialist must contact when deployment completion falls below $80%$. Finally, using the training schedule from the available documents, determine the mandatory training completion date this specialist must meet before accessing the production analytics console." |
Infrastructure
We provide a self-contained evaluation environment built on MCP (Model Context Protocol)—an open standard for connecting AI models to external tools and data sources. The environment runs as a Docker container with PostgreSQL and pgvector for semantic search. All documents are synced from external buckets at build time and baked into the image, then pre-loaded at container startup, eliminating runtime dependencies and ensuring identical behavior across container instances. This architecture supports agentic systems through standard MCP tool interfaces, enabling efficient parallel evaluation and RL training across hundreds of instances with fast evaluation cycles.
To work within public API context limits, we use an isolated LLM service that runs separate LLM calls that don't become part of the main conversation. When enabled, it processes document search results and fetched PDFs through dedicated calls using provider-specific document APIs (Anthropic, OpenAI, Google). For search results, it analyzes each document and returns structured judgments (relevance scores, summaries, key figures, whether to fetch). For documents, it analyzes PDFs and returns free-form answers to specific queries. These calls are isolated: they don't add to the main conversation history, use focused prompts with the original query and context summary, support reasoning modes, and are tracked separately for monitoring. The agent receives enriched, analyzed results instead of raw document data, improving research quality without bloating the conversation context.
Future Work
ESB-10 is a starting point—it covers $10$ tasks and five initial categories, but the full ESB will include a wider range of workflows and scenarios. We are expanding to richer task types involving:
- Metadata reasoning: Queries requiring document dates, authors, version history, and folder hierarchy
- Negation: Formulating queries that seek the absence of information or explicitly filter out content—for example, "find all Salesforce screenshots that don’t contain Alex" or "list documents where sales goals are not mentioned".
- Conditional logic chains: Multi-step, branching criteria with if-then dependencies
- Comparative analysis: Contrasting metrics or statuses across multiple dimensions and time periods
For ESB-10, our search tool was intentionally simple: agents could only specify a query and the number of top documents to return. In practice, enterprise search APIs provide far richer capabilities. Enterprise corpora almost always have metadata crucial for effective retrieval—document creation and modification dates, folder structure, authorship, access permissions. Searching by "all files modified this quarter," "all PDFs in a given shared folder," or "the most recent slides from a teammate" enables scenarios that go beyond text-only queries.
For example, OpenAI's Google Drive MCP exposes functions like recent_documents (return most recently modified docs), list_folder (list files in a specific folder), get_profile (return user profile info), and list_drives (enumerate shared drives). With this broader surface area, future ESB releases will include tasks requiring agents to coordinate across metadata fields, file types, time ranges, and nested folder hierarchies.
Conclusion
ESB-10 demonstrates that frontier models struggle with enterprise search tasks—achieving under $30%$ on realistic workflows that knowledge workers handle routinely. The most significant finding is the strong correlation between abstinence requirements and task difficulty: models consistently fail to recognize when requested information doesn't exist, substituting adjacent data instead of returning null.
These results suggest that improving enterprise research agents requires not just better retrieval or reasoning, but fundamentally better calibration—knowing what you don't know.
We welcome submissions from researchers and teams working on agent frameworks. For more on ESB-10, the full ESB, or evaluation access, contact founders@aviro.ai.
References
-
1.
Benchmarking Deep Search over Heterogeneous Enterprise Dataarxiv.org -
2.
Salesforce/HERB · Datasets at Hugging Facehuggingface.co
-
3.
Benchmarking Deep Search over Heterogeneous Enterprise Dataarxiv.org
-
4.
WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generationarxiv.org -
5.
Wix/WixQA · Datasets at Hugging Facehuggingface.co
-
6.
Introducing OfficeQA: A Benchmark for End-to-End Grounded Reasoning | Databricks Blogdatabricks.com -
7.
GitHub - databricks/officeqa: Repository for getting started with the OfficeQA Benchmark.github.com
-
8.
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems for ACL 2025research.ibm.com
-
9.
MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systemsarxiv.org -
10.
Unanswerability Evaluation for Retrieval Augmented Generationarxiv.org
-
11.
Introducing EmbeddingGemma: The Best-in-Class Open Model for On-Device Embeddings- Google Developers Blogdevelopers.googleblog.com
-
12.
Introducing the File Search Tool in Gemini APIblog.google
-
13.
Build with Gemini Deep Researchblog.google
-
14.
2025: The State of Generative AI in the Enterprise | Menlo Venturesmenlovc.com -
15.
GPT-5.2, Why it mattersyoutube.com
-
16.
GPT-5.2 and Meaningless Benchmarks - by Maria Sukharevamsukhareva.substack.com
-
17.
I Tested Gpt 5 2 And Its Just Bad 03888d054916medium.com
-
18.
Deep Researchplatform.openai.com