Benchmarks

Reasoning Models Can't Hide Their Thinking - OpenAI Study

OpenAI's CoT-Control benchmark shows frontier reasoning models score 0.1-15.4% at steering their own chain of thought - a result the company frames as good news for AI oversight.

AI Speed and Latency Leaderboard: Tokens/s Rankings

Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.

Multilingual LLM Leaderboard: March 2026 Rankings

Rankings of the best AI models for multilingual tasks, covering 16 languages across the Artificial Analysis Multilingual Index and MGSM benchmarks.

CoT Control, Hidden Beliefs, and Dynamic Agent Benchmarks

New research shows reasoning models can't suppress their chain-of-thought, that they commit to answers internally long before their CoT reveals it, and that static benchmarks are inadequate for measuring real-world agent adaptability.

75% of AI Coding Agents Break Working Code Over Time

Alibaba's SWE-CI benchmark tested 18 AI models on 100 real codebases across 233 days of maintenance. Most agents accumulate technical debt and break previously working code. Only Claude Opus stays above 50% zero-regression.

GPT-5.4 Review: The Computer-Use Frontier

GPT-5.4 brings native computer use, a 1M token context window, and serious coding muscle to OpenAI's mainline model - but at a premium price.

← Previous