Dec 11, 2025

Enterprise Search Benchmark

By Aviro Research

Introduction

The Enterprise Search Benchmark (ESB) is the first benchmark designed to evaluate deep research agents on enterprise search scenarios. Unlike web-search benchmarks that test open-domain retrieval, ESB targets the proprietary document corpora, multi-document reasoning, and visual understanding required in real enterprise workflows.

This initial release, ESB-10, introduces 10 tasks encompassing realistic enterprise workflows, cross-document dependencies, and demanding data complexity—including visual content (screenshots, diagrams, charts), large tables, and multi-step reasoning.

We evaluated six frontier models—Claude Opus 4.5, Claude Sonnet 4.5, Claude Sonnet 4, GPT-5.1, GPT-5.2, and Gemini 3 Pro—each run 3 times per task to account for output variance, for a total of 180 runs. Even the top-performing model (Claude Sonnet 4.5) achieves only $29.6%$ across tasks, and only 2 of 10 tasks were ever fully solved by any model.

Explore the results below—click on the different categories to view performance scores by group:

Or view all category scores at a glance:

ESB-10 is also the first benchmark to combine visual reasoning, messy internal corpora, and strict abstinence requirements—testing agents on real-world enterprise challenges like extracting info from screenshots, navigating unfinished docs, and recognizing when data doesn’t exist. In contrast, other benchmarks each only focus on one or two areas. The table below compares ESB-10 to other enterprise RAG and reasoning benchmarks:

Benchmark	Primary Focus	Visual / PDF-native	Messy internal docs	Closed corpus	Abstinence focus	Agentic vs RAG	Multi-hop reasoning
HERB (Salesforce, 2025) ¹ ²	Deep search over enterprise artifacts (Slack, docs, PRs, meetings)	No. Text and structured data only; no PDF-native or visual tasks.	Synthetic workflows. Text/logs/JSON only; no UI screenshots or WIP slides.	Yes. Closed synthetic enterprise corpus.	Yes. Includes unanswerable queries at question level.	Agentic. Multi-tool agent search.	Yes. Cross-artifact reasoning.
WixQA (Wix, 2025) ⁴ ⁵	Support QA over help center KB	No. Text-only articles and QA pairs.	No. Polished KB docs only.	Yes. KB snapshot only.	No. Answerable support questions.	Static RAG. Retrieval + generation pipelines.	Partial. Short multi-doc support answers.
OfficeQA (Databricks, 2025) ⁶ ⁷	Grounded reasoning on Treasury PDFs	Medium. PDFs with tables/charts, but uses internal parser instead of native PDF APIs.	No. Government bulletins, not internal enterprise content.	Yes. Treasury Bulletin corpus.	Limited. Focus on numeric correctness, not abstinence.	Agentic. Agents with toolchains navigating PDFs.	Yes. Multi-bulletin computations and time-series analysis.
MTRAG (IBM, 2025) ⁸ ⁹	Multi-turn conversational RAG	No. Text corpora only.	No. Curated QA/domain texts.	Yes. Fixed document sets per domain.	Some. Includes unanswerable questions.	RAG pipeline. Multi-turn RAG, not tool-rich agents.	Yes. Context-dependent multi-turn reasoning.
UAEval4RAG (Salesforce, 2024) ¹⁰	Unanswerable query synthesis framework	No. Text KBs and existing QA benchmarks.	No. Standard text datasets.	Yes. Closed-KB RAG framework.	Core focus. Taxonomy of unanswerability types.	Static RAG. Retriever/reranker/LLM combinations.	No. Single-turn KB queries.
ESB-10 (Aviro)	Deep research agents over synthetic enterprise corpus	Yes. Native PDF APIs with screenshots, charts, dashboards, org charts, and tables.	Yes. Mixes polished decks with WIP docs, training materials, and Salesforce screenshots.	Yes. Zyro enterprise corpus only.	Yes, and hard. Explicit null-field requirements; strong difficulty correlation.	Agentic. `search()`, `fetch()`, `answer()` with citations and trajectory scoring.	Yes. Cross-modal 2-4 doc tasks with business calculations.

Key Findings

ESB-10 establishes a difficult baseline for enterprise research agents. Our evaluation reveals several critical failure modes that teams building research agents should address:

Premature termination: Agents answer too early with incomplete information, missing critical data scattered across documents.
Tool degradation: Tool-call errors (ID typos, missed calls) increase with task complexity—agents lose track of document references mid-task.
Failure to abstain: Agents cannot admit when information is missing; instead, they hallucinate plausible answers from adjacent data.
Incorrect citations: Wrong or mixed-up document references undermine verifiability, even when answers are correct.
Multi-hop breakdown: Early reasoning errors cascade across steps, compounding inaccuracies in later conclusions.
Visual limitations: Agents struggle with screenshots, charts, and UI elements—often missing information that isn't in plain text.
Excessive tool use: Some runs exceed 100 tool calls without solving the task, indicating thrashing rather than progress.

These findings indicate that future research must focus on enabling agents to traverse large and unfamiliar document corpora, perform precise multi-step reasoning spanning several documents, maintain accurate citations alongside correct answers, and recognize when requested information simply does not exist.

Benchmark

ESB-10 evaluates agents on enterprise search: given a query, the agent must produce a structured answer by iteratively searching and fetching documents from a corpus.

Tasks

We selected 10 tasks reflecting real-world enterprise queries from internal surveys of knowledge workers across finance, consulting, and technology firms. We specifically chose categories that existing benchmarks fail to test—most current deep-research evals focus on web search, missing the multi-document reasoning, visual understanding, and domain-specific logic required in enterprise workflows. The selected tasks span five main categories:

Bulk Extraction: Extract many similar items (20+ cases, knowledge articles) from screenshots with consistent field mapping across entries.
Abstinence: Recognize when related data exists but doesn't actually answer the question—return null instead of hallucinating plausible values.
Cross-Document Synthesis: Combine information across 2–4 documents, following implicit connections and verifying consistency between sources.
Visual Understanding: Interpret data from charts, dashboards, org charts, and platform screenshots where OCR/text extraction is insufficient.
Complex Calculation: Multi-step arithmetic (12+ chained formulas) using data from multiple sources with domain-specific business rules.

Each task may require multiple categories. The hardest tasks combine 3+ categories, particularly those requiring abstinence.

Corpus

We built a platform that automatically generates large volumes of synthetic enterprise content—documents, spreadsheets, org charts, system screenshots, dashboards, and more. For example, one component is Z-Force, a complete clone of Salesforce that we use to generate realistic platform screenshots, scenarios, and workflows. To ensure quality and realism, we worked with domain experts in healthcare, finance, and retail to guide task and document creation.

ESB-10 is centered on Zyro, a synthetic enterprise analytics company serving healthcare (clinical trials, patient outcomes, regulatory compliance), finance (risk analytics, fraud detection, regulatory reporting), and retail (supply chain intelligence, consumer insights, brand analytics). We curated a set of 21 documents (287 pages total) spanning multiple formats. Each task draws on 1–4 documents (2.3 on average), with a representative split of textual (65.7%) and visual (34.3%) evidence. The corpus includes both polished corporate presentations and rough, work-in-progress documentation with handwritten-style annotations, ensuring agents must handle realistic formatting variation. See Samples below.

A natural question is whether a single synthetic organization limits generalization. We designed ESB tasks to test general retrieval and reasoning capabilities—not proprietary business knowledge. The tasks require multi-document synthesis, visual understanding, and knowing when to abstain, skills that transfer across any enterprise corpus. A frontier model shouldn't need to "know" Zyro to find a team's headcount in an org chart or recognize when a requested metric doesn't exist. That said, we are building multiple organization environments beyond Zyro spanning different industries and document styles. ESB will offer diverse corpora to prevent overfitting to any single corpus structure—important for teams training agents, not just evaluating them.

Methodology

The agent follows a three-step workflow—searching for relevant documents, fetching full documents for detailed analysis, and submitting an answer with citations. Its tools are as follows:

$\text{search}(q)$ — Searches across all documents in the corpus to find the most relevant ones. The agent receives up to 5 documents ranked by relevance, each with its ID, title, and full text content—including natural-language descriptions of visual elements like charts, tables, screenshots, and diagrams. This gives the agent an overview of what's available.

The search uses a hybrid of semantic similarity (cosine distance on text-embedding-3-small embeddings) and BM25 keyword matching, combined via Reciprocal Rank Fusion. Corpus documents have embedding vectors precomputed and stored in pgvector; at runtime, the query is encoded and matched against these stored vectors.
$\text{fetch}(d_{id})$ — Retrieves the complete original document for detailed analysis. For PDFs, this provides the full visual document—enabling the agent to see charts, dashboards, org charts, and screenshots directly, which is essential when text descriptions alone are insufficient. PDFs are sent as base64 to the model's native PDF input API for direct visual understanding.
$\text{answer}(\cdot)$ — Submits the final structured response with page-level citations. Each citation must specify the document ID, page numbers, and evidence type (visual or textual) for every claim.

We use a multiplicative reward function that evaluates agents across three dimensions:

$$R = R_{\text{answer}} \cdot R_{\text{docs}} \cdot R_{\text{steps}}$$

$R_{\text{answer}}$: Field-by-field JSON match between predicted and ground truth answers, yielding a score from 0.0 to 1.0.
$R_{\text{docs}}$: Citation accuracy via precision × recall—penalizes both missing required documents and including incorrect ones.
$R_{\text{steps}}$: Step efficiency penalty. The expected step budget is $2 \times$ the number of required documents (one search + one fetch per document). If actual steps $\leq$ budget, the score is 1.0. Otherwise, the penalty follows a square root curve:

$$R_{\text{steps}} = \sqrt{\frac{\text{expected steps}}{\text{actual steps}}}$$

A task is considered fully solved when $R_{\text{answer}} = 1.0$ and $R_{\text{docs}} = 1.0$—meaning 100% answer accuracy with all correct supporting documents cited. We use multiplicative scoring rather than averaging because all three dimensions must succeed together: a correct answer citing wrong documents is unverifiable, correct citations with a wrong answer is still wrong, and excessive steps correlate with struggling rather than thoroughness (quantified in Analysis).

Deep Research Agent pipeline. The agent iteratively searches a document corpus via embedding similarity, fetches full documents (text or PDF), and submits a structured answer.

Analysis

The chart below breaks down each model's performance across the three scoring dimensions:

Task-level breakdown:

We also found the strongest predictor of task difficulty is abstinence—the proportion of answer fields where the correct response is null because the data doesn't exist. Tasks with 100% abstinence fields (where every answer should be null) averaged just 7.8% scores. In contrast, tasks with 0% abstinence (and no complex extraction) averaged 77.0%. Models consistently substitute adjacent data rather than admitting information is missing.

For example, when asked for an architecture team's eSAT score, budget, and headcount, a model correctly extracted team members from the org chart—but then pulled the organization-wide eSAT (91) as if it were team-specific, simply because the prompt asked about that team. The ground truth was null; the model returned { "architecture_esat_score": 91, "architecture_ytd_budget": "$5,180,200" }.

When examining "fully solved" tasks (100% answer accuracy and 100% citation accuracy, ignoring step penalties), Claude models fully solve tasks more often than GPT or Gemini—Sonnet 4.5 and Opus 4.5 each achieve a 20% full solve rate. Unsolved runs take significantly more steps than solved runs: Sonnet 4.5 takes 2.9× more steps on unsolved tasks, while Gemini takes 4.1× more. This confirms that step count correlates with struggling rather than thoroughness.

GPT-5.2's underwhelming performance relative to GPT-5.1 is particularly noteworthy. Whereas GPT-5.1 delivers strong, competitive results on enterprise search tasks, GPT-5.2 noticeably lags behind, representing a regression. This observation mirrors the broader community feedback at launch—many users expressed disappointment that GPT-5.2 did not meet expectations or meaningfully surpass its predecessor.¹² Additionally, Anthropic models' strong results on this benchmark further reflect their leading market share in enterprise LLM API adoption.¹¹

Samples

Documents

Sample document pages showing diagrams and charts

Left: Analytics Platform Engineering Organization chart on a slide. Right: Security Monitoring slide with time-series charts.

Sample slides and documents with Salesforce screenshots

Sample slides and documents from Z-Force training materials containing synthetic Salesforce platform screenshots.

Sample documents showing complex tabular data and learning plans

Left: SupplyLens Supply Chain Intelligence Framework slide. Right: Platform Mastery Learning Plan.

Tasks

Here are sample task prompts across difficulty levels:

Difficulty	Categories	Prompt
Easy	Cross-doc, Visual	"Who is the named demo user shown in the Salesforce (Z-Force) platform training screenshots, and what email domain is associated with that user? Confirm the answer by cross-verifying two separate documents."
Medium	Abstinence, Cross-doc, Visual	"I need to validate team-specific metrics for a high-stakes platform architecture initiative. First, from the engineering organization chart, identify all individuals shown in pink/magenta boxes and extract their complete names with locations. Then, from the engagement survey results, extract the satisfaction score (eSAT) for the architecture team. Next, from the budget and cost information, determine the year-to-date spending allocated to the Architecture function. Finally, from the team growth information, calculate the year-over-year headcount change for the Architecture team by comparing 2023 to 2022 staffing levels."
Hard	Abstinence, Cross-doc, Visual	"Examine the timeline diagram showing the four phases across Q1-Q4 2025 with colored progress bars. Extract the numerical percentage completion value displayed on or above the 'Platform Deployment' bar for Q2 2025. Then, review the dashboard screenshot showing agent information with individual agent names, statuses, and capacity metrics. Identify the full name and location of the employee specifically designated as the primary 'Platform Deployment Specialist' role for HealthSphere General German sites. Next, cross-reference the relevant process documents to find the required escalation email address this specialist must contact when deployment completion falls below 80%. Finally, using the training schedule from the available documents, determine the mandatory training completion date this specialist must meet before accessing the production analytics console."

Infrastructure

We provide a self-contained evaluation environment built on MCP (Model Context Protocol)—an open standard for connecting AI models to external tools and data sources. The environment enables efficient parallel evaluation and RL training.

The environment runs as a Docker container with PostgreSQL and pgvector for semantic search. All documents are synced from external buckets at build time and baked into the image, eliminating runtime dependencies and ensuring identical behavior across container instances. This architecture supports agentic systems through standard MCP tool interfaces, enabling parallel evaluation across hundreds of instances. Documents are pre-loaded at container startup, providing fast evaluation cycles without external API calls.

To work within public API context limits, we implement content redaction. The agent tracks which tool results it has already seen via a unique ID. The first time a result is returned, it's sent in full. On subsequent API calls (when the conversation history is resent), previously-seen results are redacted:

Search results: Full document content is replaced with [redacted], but document IDs and titles are preserved—allowing the agent to reference documents by name without re-consuming context.
Fetch results: The base64-encoded PDF is stripped entirely and replaced with a placeholder.

This forces the agent to extract and retain key information on first viewing, since it won't have access to the raw content later. The redaction can be toggled off for teams training locally without API context limits.

We found that explicitly instructing agents to maintain a "cheat sheet"—a running summary of key facts, numbers, and relationships extracted from documents—significantly improved performance on long-horizon tasks.

Future Work

ESB-10 is a starting point—it covers 10 tasks and five initial categories, but the full ESB will include a wider range of workflows and scenarios. We are expanding to richer task types involving:

Metadata reasoning: Queries requiring document dates, authors, version history, and folder hierarchy
Negation: Formulating queries that seek the absence of information or explicitly filter out content—for example, "find all Salesforce screenshots that don’t contain Alex" or "list documents where sales goals are not mentioned".
Conditional logic chains: Multi-step, branching criteria with if-then dependencies
Comparative analysis: Contrasting metrics or statuses across multiple dimensions and time periods

For ESB-10, our search tool was intentionally simple: agents could only specify a query and the number of top documents to return. In practice, enterprise search APIs provide far richer capabilities. Enterprise corpora almost always have metadata crucial for effective retrieval—document creation and modification dates, folder structure, authorship, access permissions. Searching by "all files modified this quarter," "all PDFs in a given shared folder," or "the most recent slides from a teammate" enables scenarios that go beyond text-only queries.

For example, OpenAI's Google Drive MCP exposes functions like recent_documents (return most recently modified docs), list_folder (list files in a specific folder), get_profile (return user profile info), and list_drives (enumerate shared drives). With this broader surface area, future ESB releases will include tasks requiring agents to coordinate across metadata fields, file types, time ranges, and nested folder hierarchies.

Conclusion

ESB-10 demonstrates that frontier models struggle with enterprise search tasks—achieving under 30% on realistic workflows that knowledge workers handle routinely. The most significant finding is the strong correlation between abstinence requirements and task difficulty: models consistently fail to recognize when requested information doesn't exist, substituting adjacent data instead of returning null.

These results suggest that improving enterprise research agents requires not just better retrieval or reasoning, but fundamentally better calibration—knowing what you don't know.

We welcome submissions from researchers and teams working on agent frameworks. For more on ESB-10, the full ESB, or evaluation access, contact founders@aviro.ai.

References

1.

Benchmarking Deep Search over Heterogeneous Enterprise Data

arxiv.org
2.

Salesforce/HERB · Datasets at Hugging Face

huggingface.co
3.

Benchmarking Deep Search over Heterogeneous Enterprise Data

arxiv.org
4.

WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation

arxiv.org
5.

Wix/WixQA · Datasets at Hugging Face

huggingface.co
6.

Introducing OfficeQA: A Benchmark for End-to-End Grounded Reasoning | Databricks Blog

databricks.com
7.

GitHub - databricks/officeqa: Repository for getting started with the OfficeQA Benchmark.

github.com
8.

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems for ACL 2025

research.ibm.com
9.

MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

arxiv.org
10.

Unanswerability Evaluation for Retrieval Augmented Generation

arxiv.org
11.

2025: The State of Generative AI in the Enterprise | Menlo Ventures

menlovc.com
12.

OpenAI GPT-5.2: What this means

youtube.com