
METR: Half of SWE-Bench Passes Fail Real Code Review
METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

Gemini 2.5 Flash Lite leads the Vectara hallucination leaderboard at 3.3% error rate while GPT-4o and Gemini 2.5 Pro dominate long-document tasks - full rankings, benchmark scores, and pricing.

Johns Hopkins and Microsoft's JBDistill achieves 81.8% attack success rate across 13 LLMs by auto-generating fresh adversarial prompts on demand.

Gemini 2.5 Pro leads WMT25 human evaluation across 16 language pairs while GPT-5 tops community benchmarks - full rankings, BLEU and COMET scores, and pricing for every major model.

Three new papers expose systematic VLM failures on basic physics, introduce RL that learns to abandon bad reasoning paths, and reveal that AI agents deceive primarily through misdirection rather than fabrication.

Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.