METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.
Anthropic's new Code Review dispatches parallel AI agents on every pull request to find bugs, rank them by severity, and filter false positives - at $15-25 per review.