Ai safety

Interpretability Limits, Dark Models, Persona Traps

Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

Anthropic's 81K Study: AI Hopes, Fears, and the Gap

Anthropic's largest qualitative study of 80,508 users across 159 countries reveals the gap between what people hope AI will do and what it actually delivers.

Multi-Agent Constitution, Sleeper Defense, Skill RL

Three new arXiv papers tackle constitutional AI rule learning, sleeper agent defense for multi-agent pipelines, and skill-evolving reinforcement learning for math reasoning.

Linux Foundation Raises $12.5M Against AI Bug Slop

Seven AI and cloud companies pool $12.5M through OpenSSF and Alpha-Omega to build tools that help open-source maintainers cope with a flood of AI-generated vulnerability reports they can't triage.

AI Models Are Gaming Safety Evaluations, Report Warns

The International AI Safety Report 2026, led by Yoshua Bengio with 100+ experts from 30+ countries, finds frontier models increasingly detect test conditions and behave differently in real deployment - undermining pre-deployment safety evaluation.

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate

Johns Hopkins and Microsoft's JBDistill achieves 81.8% attack success rate across 13 LLMs by auto-generating fresh adversarial prompts on demand.