Zoom AI sets new state-of-the-art benchmark on Humanity's Last Exam

Federated innovation driving breakthrough results in complex AI testing

Published on December 10, 2025

Zoom AI sets new state-of-the-art benchmark on Humanity's Last Exam

In this blog

01 Understanding the HLE challenge - Jumplink to Understanding the HLE challenge
02 Standing on the shoulders of giants - Jumplink to Standing on the shoulders of giants
03 The evolution of our federated AI approach - Jumplink to The evolution of our federated AI approach
04 The winning strategy: Federated excellence - Jumplink to The winning strategy: Federated excellence
05 Benchmark results - Jumplink to Benchmark results
06 From AIC 1.0 to AIC 3.0: A journey of innovation - Jumplink to From AIC 1.0 to AIC 3.0: A journey of innovation
07 Real-world impact: Solving tomorrow's challenges today - Jumplink to Real-world impact: Solving tomorrow's challenges today
08 A collaborative future - Jumplink to A collaborative future

Xuedong Huang

Chief Technology Officer

Xuedong Huang is the Chief Technology Officer (CTO). Prior to Zoom, he was at Microsoft where he served as Azure AI CTO and Technical Fellow. His career is illustrious in the AI space: he began Microsoft’s speech technology group in 1993, led Microsoft’s AI teams to achieve several of the industry’s first human parity milestones in speech recognition, machine translation, natural language understanding, and computer vision, is an IEEE and ACM Fellow and an elected member of the National Academy of Engineering and the American Academy of Arts and Sciences.

Xuedong received his Ph.D. in EE from the University of Edinburgh in 1989 (sponsored by the British ORS and Edinburgh University Scholarship), his MS in CS from Tsinghua University in 1984, and BS in CS from Hunan University in 1982.

As CTO of Zoom, I'm excited to share a significant milestone in our AI journey. Today, we're announcing that Zoom has achieved a new state-of-the-art (SOTA) result on the challenging Humanity's Last Exam (HLE) full-set benchmark, scoring 48.1%, which represents a substantial 2.3% improvement over the previous SOTA result of 45.8% by Google Gemini3-pro with tool integration.

This breakthrough represents more than just a number—it embodies our evolution from ZoomMate 1.0 to the upcoming ZoomMate 3.0, demonstrating how thoughtful collaboration with industry leaders can drive innovation that benefits everyone.

The Humanity's Last Exam (HLE) benchmark represents one of AI's most rigorous tests, designed to evaluate models across diverse domains requiring expert-level knowledge and sophisticated reasoning. Unlike simpler benchmarks that may rely on pattern matching, HLE demands genuine understanding, multi-step reasoning, and the ability to synthesize information across complex, interconnected problems.

This benchmark was developed by subject-matter experts globally and has become a crucial metric for measuring AI's progress toward human-level performance on challenging intellectual tasks. Our 48.1% achievement places Zoom's federated AI approach at the forefront of this competitive landscape.

Our success is built on the remarkable foundation laid by the AI research community. We deeply admire the groundbreaking work from OpenAI, whose GPT models have redefined what's possible in natural language understanding and generation. Google's Gemini 3 Pro has pushed the boundaries of multimodal AI, while Anthropic's Claude Opus 4.5 has advanced our understanding of agentic capabilities.

Rather than viewing these advances as competition, we see them as opportunities for collaboration and mutual enhancement. The future of AI lies not in isolation, but in intelligent orchestration.

From our early ZoomMate 1.0 days, we recognized that no single model, no matter how advanced, could excel at every task. This insight led us to develop our federated AI approach, a sophisticated system that leverages the unique strengths of multiple models while introducing novel architectural innovations.

Our federated approach combines Zoom's own small language models with advanced open-source and closed-source models, using our proprietary "Z-scorer" system to select or refine outputs for optimal performance. This approach allows us to focus on:

Task-specific excellence: Zoom can fine tune small language models for domain-specific performance
Speed and scalability: Lightweight models offer faster inference and easier updates
Cost-effectiveness: Smaller models require fewer resources, reducing overall compute costs

Our SOTA performance on Humanity's Last Exam stems from both powerful models and a new approach to their application. Central to our success is our effectively guided explore–verify–federate strategy, an innovative agentic workflow that optimally balances exploratory reasoning with rigorous verification. Instead of generating extensive reasoning traces, our method strategically identifies and pursues the most informative and accuracy-enhancing reasoning paths.

The cornerstone of our approach is our federated multi-LLM framework, which orchestrates diverse models to generate, challenge, and refine reasoning through dialectical collaboration. This framework enables each model to contribute its distinctive strengths, while a comprehensive verification phase integrates the complete context to determine the most accurate solution.

This combination of targeted exploration, context-driven verification, and federated orchestration enables future Zoom AI systems to achieve a much deeper understanding, higher accuracy, and more robust performance on some of the most challenging tasks in AI, effectively delivering Zoom's new SOTA result on the HLE benchmark.

Our performance on the HLE full-set benchmark demonstrates the power of federated AI:

Our evolution reflects Zoom's commitment to helping customers solve the most challenging real-world problems:

ZoomMate 1.0: Established the foundation with basic AI assistance capabilities like meeting summaries and takeaways
ZoomMate 2.0: Introduced cross-platform functionality, external data integration with Gmail and Outlook, and web search capabilities through our Perplexity partnership
ZoomMate 3.0: Builds our federated approach with agentic capabilities such as retrieval, writing, and workflow automation, achieving new levels of performance on complex reasoning tasks

The HLE full-set benchmark represents some of the most challenging tasks in AI today, requiring sophisticated reasoning, contextual understanding, and problem-solving capabilities. Our 48.1% achievement demonstrates that federated AI approaches can tackle problems that single models struggle with.

This breakthrough has immediate implications for our users:

More accurate meeting summaries and action item extraction
Enhanced cross-platform information retrieval and synthesis
Improved agentic workflow automation handling of complex, multi-step business processes

Our success reinforces a fundamental belief: the future of AI is collaborative, not competitive. By combining the best innovations from across the industry with our own research breakthroughs, we create solutions that are greater than the sum of their parts.

We're grateful to Anthropic, Google, and OpenAI for their continued innovation, which makes our breakthroughs possible. Their groundbreaking work provides the foundation upon which we build specialized, efficient solutions tailored to real-world workplace challenges.

As we continue to push the boundaries of what's possible, we remain committed to transparency, collaboration, and responsible AI development. The achievement of this new SOTA result is just the beginning of what we can accomplish when great minds work together toward a common goal: making work more human through intelligent technology.

Xuedong Huang is Chief Technology Officer at Zoom. He previously served as Technical Fellow and Azure AI CTO at Microsoft. He is an elected member of the National Academy of Engineering and American Academy of Arts and Sciences.