Swe bench

MiniMax M2.7 Claims to Automate Its Own Training

MiniMax's new 2,300B MoE model tops the Artificial Analysis Intelligence Index and claims to run 30-50% of its own RL research workflow autonomously.

Cursor Ships Composer 2 - Its First In-House Coding Model

Cursor launches Composer 2, its first in-house coding model trained via RL on long-horizon tasks, scoring 73.7 on SWE-bench Multilingual at $0.50/M input tokens.

METR: Half of SWE-Bench Passes Fail Real Code Review

METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

Best AI Models for Code Generation - March 2026

Claude Opus 4.6 leads SWE-bench Verified at 80.8% while Gemini 3.1 Pro dominates LiveCodeBench Pro with 2887 Elo, making the best coding model a matter of workflow.

Coding Benchmarks Leaderboard: SWE-Bench, Terminal-Bench, and LiveCodeBench

Rankings of the best AI models for coding tasks across SWE-Bench, Terminal-Bench, and LiveCodeBench benchmarks, measuring real-world software engineering and algorithmic problem-solving ability.