Reasoning

Microsoft Phi-4

Microsoft's 14B dense transformer that consistently beats models 5x its size on MATH and GPQA, available under the MIT license for unrestricted commercial use.

NVIDIA's hybrid Mamba2+MoE model packs 31.6B total parameters but activates only 3.2B per token, delivering frontier-class reasoning with 3.3x the throughput of comparable models on a single H200 GPU.

Claude Opus 4.6

Anthropic's flagship model leads on agentic coding, enterprise knowledge work, and long-context retrieval with a 1M-token window, 128K output, and agent teams at $5/$25 per million tokens.

Gemini 3.1 Pro

Google DeepMind's Gemini 3.1 Pro leads on 13 of 16 benchmarks with 77.1% ARC-AGI-2, 94.3% GPQA Diamond, and a 1M-token context window at $2/M input.

Gemini 3 Deep Think Review: Google's Reasoning Powerhouse Tested

Google's Gemini 3 Deep Think trades speed for depth, delivering record-breaking reasoning benchmarks - but at a steep price.

Google's Gemini 3.1 Pro Doubles Reasoning Performance and Retakes the AI Crown

Google releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2, more than doubling the reasoning capability of its predecessor and beating Claude Opus 4.6 and GPT-5.2 on most benchmarks.

← Previous