Sep 8, 2025

On Search Tasks

By Aviro Research

We evaluated OpenAI Deep Research alongside our in-house deep search agent powered by Cortex (a lightweight, state-aware workflow model) in two demanding settings: Microsoft LiveDRBench and Aviro's Enterprise Search benchmark. Across both, Cortex consistently improved research-task success rates and retrieval grounding while keeping operational overhead low.

Before diving into results, we address common objections to SOTA performance claims. We did not train on LiveDRBench or Aviro evaluation items—the few-shot corpus used to bootstrap Cortex explicitly excluded all evaluation tasks and close variants, with the Cortex workflow model frozen before evaluation. We enforced budget parity across systems (same model, temperature, max steps/tool-calls, and token caps), with runs stopped at identical ceilings. In practice, Cortex reduced redundant exploration and often converged earlier rather than consuming more tokens. Aside from Cortex's state-aware workflow model, agents used the same system and user prompts, and where stochasticity matters, we use multiple seeds and report aggregate metrics.

Deep Research

We began with LangChain’s Open Deep Research¹ as our baseline, intentionally using it in its most minimal, unenhanced form to establish a clear point of comparison. Cortex was then introduced not as a replacement, but as a distinct layer operating on top of this baseline agent. In other words, Cortex augments the original Open Deep Research agent, providing a state-aware workflow model that guides and structures the agent’s actions without altering the underlying retrieval or reasoning stack. This setup ensured that any observed improvements could be directly attributed to Cortex’s workflow intelligence, rather than changes to the core agent. To train Cortex, we relied on a small, carefully curated set of few-shot examples—entirely separate from evaluation tasks—designed to encode reusable strategies like planning, branching, recovery, and convergence. The emphasis was on the depth and applicability of each example, not on sheer volume, so that Cortex could inject meaningful workflow knowledge atop the otherwise barebones research agent.

For deep research evaluation, LiveDRBench stresses research tasks where an agent must compose a workflow from scratch, adapt it over time, and reuse prior experience. These GAIA-style multi-hop tasks are well suited to evaluate Cortex's ability to encode and reuse workflow-level knowledge. We optimized for research-task execution rather than long-form report polish, like Deep Research Bench. Critically, we did not preload workflows for the deep research tasks. Cortex built task-specific workflows dynamically. This resulted in higher task success on LiveDRBench-style problems requiring iterative workflow refinement and more stable convergence when sources are noisy or partially contradictory.

Enterprise Search

For enterprise search, we evaluated 99 tasks across three curated environments — Finance, Consulting, and Accounting (33 each) — using internal corpora that included knowledge bases, code documentation, legal/compliance content, SOPs, and operations/support materials. We created three representative companies with rich internal repositories to mirror realistic enterprise conditions. OpenAI Deep Research used MCP for broad, high-recall exploration, while our Cortex-backed agent connected directly to sources and consulted its workflow model throughout the run, incorporating workflow learnings as environments evolved. Cortex was supplied with the hypothetical enterprise's SOPs beforehand.

With these readily at hand, Cortex would obviously be expected to outpace OpenAI Deep Research on enterprise search tasks. The purpose of the experiment was to measure just how much of a difference we can observe versus a SOTA model with Cortex. We preloaded only enterprise-generic workflow primitives (plan, branch, recover, converge)—no task answers or anything from the dataset. This resulted in better retrieval grounding in enterprise settings with fewer unnecessary tool calls and more efficient navigation of complex enterprise knowledge structures.

To see more on the environments we created, visit our environments essay.

Launch Query

In our launch evaluation, we posed a realistic services-discovery question on Palo Alto Networks, using the web interfaces for each agent and reporting pass@3 as our main metric. The task itself was challenging but well within the expected capabilities of a strong research agent. For pass@1, agents identified 25 out of 30 services, and by pass@3, all services were eventually found. However, the main reason for failure was that agents quickly encountered the A-Z products page—a straightforward find—and then became stuck, circling that page without progressing to the actual services list. In cybersecurity companies, company offerings tend to be divided into solutions, products, and services—three distinct categories. The deep research agents repeatedly confused products with services, missing the nuance that would have guided them to the correct answer. Even in a generic enterprise context, an agent equipped with a layer of workflow or domain knowledge would have recognized the need to look beyond the products page and seek out the services section.

If you're interested in learning more about our choice of evaluation questions, the reasoning behind their validity, or the common pitfalls we observed in deep research tasks, we welcome you to book time with us for a deeper discussion. You can schedule a meeting here.

For a visual demonstration of our approach, see our video below:

Cortex by Aviro

Click to play video

Conclusion

Across both enterprise and deep research environments, Cortex’s integration has reliably improved task success, retrieval grounding, and convergence, all while keeping the system lightweight and practical. These results highlight the impact of equipping agents with reusable workflow strategies that can be applied flexibly as new challenges arise.

Looking ahead, we are developing a new series of rich enterprise-focused puzzles and environments—more intricate, layered tasks that reflect the real complexities of organizational work. By expanding our evaluation to include these richer scenarios, we aim to test how well agents can leverage accumulated workflow know-how to navigate and solve the kinds of multi-step, nuanced problems that define enterprise knowledge work. This ongoing effort will help us understand how such capabilities enable agents to adapt, generalize, and excel as the demands of their environment become ever more sophisticated.

References

1.

GitHub - langchain-ai/open_deep_research

github.com

Enterprise Search Benchmark

A benchmark of 10 tasks evaluating deep research agents on enterprise search scenarios, revealing significant gaps between current model capabilities and enterprise requirements.

Dec 11, 2025