AI Models Surpass Human Performance on Complex Reasoning Tasks

Saturday, April 18, 2026

OpenAI's GPT-5.4 'Thinking' variant achieved 75% on OSWorld-Verified benchmark, crossing human-level performance thresholds. Simultaneously, Google's Gemini 3.1 Ultra scored 94.3% on GPQA Diamond while Anthropic's Claude Mythos 5 reached 10 trillion parameters, indicating a coordinated leap in model capabilities across all major providers.

Read the source →

The simultaneous breakthrough suggests AI systems are entering a new capability tier that could fundamentally alter knowledge work and decision-making processes.

agi

benchmarks

reasoning

performance

Prediction Markets

Which company has the best AI model end of April?94% yes