Code review

METR: Half of SWE-Bench Passes Fail Real Code Review

METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.

Anthropic Ships Multi-Agent Code Review for PRs

Anthropic's new Code Review dispatches parallel AI agents on every pull request to find bugs, rank them by severity, and filter false positives - at $15-25 per review.

Code review

METR: Half of SWE-Bench Passes Fail Real Code Review

Anthropic Ships Multi-Agent Code Review for PRs

Google Analytics