
METR: Half of SWE-Bench Passes Fail Real Code Review
METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

OpenAI's most capable frontier model combines native computer use, 1M-token context, and three variants at $2.50/$15 per million tokens.

GPT-5.3 Instant launched March 3, 2026, cutting hallucinations by 26.8% and overhauling ChatGPT's tone - but with documented safety regressions in the process.

OpenAI ships GPT-5.3 Instant with 27% fewer hallucinations, a less preachy tone, and better web search - available now across all ChatGPT tiers and the API.

A data-driven comparison of xAI's Grok 4 and OpenAI's ChatGPT powered by GPT-5.2, covering benchmarks, pricing, features, and real-world performance.

Two pull requests in OpenAI's public Codex GitHub repo referenced GPT-5.4 before being scrubbed - one adding full-resolution vision support, the other a fast mode toggle. Seven force pushes and a deleted employee screenshot confirm this was not intentional.