Lux Surpasses Google Gemini CUA with 83.6% Accuracy on Online Mind2Web Benchmark

Lux: A Foundation Computer Use Model that Tops Online Mind2Web with OSGym At Scale

OpenAGI Foundation launched Lux, a computer use model that scores 83.6% on the Online Mind2Web benchmark, outperforming Google Gemini CUA (69.0%), OpenAI Operator (61.3%), and Anthropic Claude Sonnet 4 (61.0%). The model automates browser and desktop interactions through low-level actions like clicks and keystrokes.

Why This Matters

Lux bridges the gap between theoretical AI models and real-world automation by operating on rendered UI rather than application-specific APIs. Its success rate on a benchmark with over 300 tasks highlights the gap between lab benchmarks and practical deployment. For instance, a 14% performance lead over Gemini CUA could translate to significant cost savings in production workflows requiring hundreds of actions per task.

Key Insights

“83.6% success rate on Online Mind2Web benchmark, 2025”
“Three execution modes: Actor (fast UI macros), Thinker (multi-step decomposition), Tasker (deterministic scripting)”
“OSGym, the open-source engine behind Lux, runs 1,000+ OS replicas and generates 1,400+ trajectories/minute”

Practical Applications

Use Case: Software QA teams automating regression tests across web apps
Pitfall: Over-reliance on UI automation without fallbacks for dynamic page layouts

References:

https://www.marktechpost.com/2025/12/05/openagi-foundation-launches-lux-a-foundation-computer-use-model-that-tops-online-mind2web-with-osgym-at-scale/

On This Page

Lux: A Foundation Computer Use Model that Tops Online Mind2Web with OSGym At Scale

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Alibaba Tongyi Lab Releases MAI-UI: A Foundation GUI Agent Family that Surpasses Gemini 2.5 Pro, Seed1.8 and UI-Tars-2 on AndroidWorld

Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents

Zep's Temporal KG Memory Hits 94.8% Accuracy on DMR, Outperforming Vector RAG