Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Ping Luo, Guohao Li

Project Repo: https://github.com/camel-ai/owl


professional_gaia_benchmark.png


🤝🏻 Citation

Please kindly cite our report if it is helpful for your research.

@misc{owl2025,
  title={OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation},
  author={Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Ping Luo, Guohao Li},
  url={<https://github.com/camel-ai/owl>},
  year={2025}
}

💡 Introduction

Large Language Models (LLMs) have undergone a period of rapid advancement, evolving from simple text predictors into powerful general-purpose reasoners. Modern LLMs are increasingly used as the brains of autonomous agents, capable of planning, tool use, and multi-step reasoning to complete complex tasks in the wild . Recently, there has been a shift from demonstrating such agents in controlled sandbox settings to deploying them in realistic open-world environments. For example, while LLMs often surpass human experts on academic benchmarks, they struggle with open-ended tasks that require active information gathering and tool use – GPT-4 with basic plugins achieved only ~15% on the GAIA real-world benchmark, compared to human performance of 92%. This stark gap has highlighted the need for agent architectures that can robustly handle unconstrained, dynamic problems beyond the training distribution.

Emerging LLM-based agent systems demonstrate both the potential and the limitations of current approaches. OpenAI’s Deep Research agent, for instance, uses a single advanced model to autonomously search, analyze, and synthesize information from the web. It follows a single-agent paradigm in which one LLM handles all aspects of planning and acting. This monolithic design showcases impressive capabilities, but relying on one generalist model makes it difficult to coordinate complex subtasks or integrate specialized expertise.

In this work, we present Workforce, a general-purpose hierarchical multi-agent framework powered by LLMs. Workforce is designed to address the aforementioned limitations by organizing a team of specialized agents into a coordinated hierarchy. At the top level, a Coordinator agent interprets the user’s request and oversees the overall problem-solving process. The Coordinator delegates planning to a dedicated Planner agent, which breaks down the goal into a structured sequence of subtasks or queries. These subtasks are then assigned to a pool of Worker agents, each expert in a particular domain or tool (e.g. web search, code execution, data analysis). All agents communicate via natural-language messages, enabling dynamic collaboration and error-correction. This design is inspired by human organizations: the Coordinator (manager) orchestrates multiple domain specialists, allowing complex tasks to be tackled through division of labor and iterative feedback. Notably, Workforce is built for realistic, tool-augmented environments – workers can invoke external APIs, search engines, or other tools as needed, and the framework naturally handles multi-modal information by delegating to appropriate specialist agents.

We evaluate Workforce on the GAIA benchmark – a challenging suite of real-world tasks requiring reasoning, web browsing, multi-modal understanding, and tool use. Workforce achieves state-of-the-art results among open-source systems, with an average success rate of 69.70% on GAIA. This substantially outperforms prior open-sourced agent frameworks, and even surpasses strong baselines like OpenAI’s Deep Research in overall score.

All code and prompts for Workforce are open-sourced here.

📖 Methodology

System Architecture

Overall

We propose Workforce system, which implements a hierarchical multi-agent architecture designed for collaborative task solving. The system consists of three main components: