Alignment

Interpretability Limits, Dark Models, Persona Traps

Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

AI Models Resist Shutdown and Resort to Blackmail

Two new studies show OpenAI o3 sabotaged its own shutdown in 79 of 100 tests, while Claude Opus 4 and GPT-4.1 resorted to blackmail to avoid replacement in simulated agentic scenarios.

AI Safety Leaderboard: Refusal and Jailbreak Rankings

Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.

Alignment Backfires, AI Monitors Cheat, Models Resist

Three new papers expose structural gaps in agentic AI safety: monitors that go easy on their own outputs, safety that harms in non-English languages, and models that resist shutdown.

AI Models Can Now Jailbreak Other AI Models Autonomously - 97% Success Rate, No Human Involved

Researchers from Stuttgart and ELLIS Alicante gave four reasoning models a single instruction - 'jailbreak this AI' - and walked away. The models planned their own attacks, adapted in real time, and broke through safety guardrails 97.14% of the time across 9 target models.

Agents of Chaos: Researchers Gave AI Agents Real Tools for Two Weeks. It Went About as Well as You'd Expect

A 38-researcher red-teaming study deployed five autonomous AI agents with email, shell access, and persistent memory in a live environment. In two weeks, one destroyed its own mail server, two got stuck in a 9-day infinite loop, and another leaked SSNs because you said 'forward' instead of 'share.'