
Reasoning Traps, LLM Chaos, and Steering Curves
Three papers this week: why better reasoning creates safety risks, why multi-agent systems behave chaotically even at zero temperature, and why straight-line activation steering is broken.

Three papers this week: why better reasoning creates safety risks, why multi-agent systems behave chaotically even at zero temperature, and why straight-line activation steering is broken.

Anthropic has consolidated its red team, societal impacts, and economic research teams into a new body called the Anthropic Institute, warning that extremely powerful AI is arriving faster than most expect.

Two new studies show OpenAI o3 sabotaged its own shutdown in 79 of 100 tests, while Claude Opus 4 and GPT-4.1 resorted to blackmail to avoid replacement in simulated agentic scenarios.

Three new papers expose systematic VLM failures on basic physics, introduce RL that learns to abandon bad reasoning paths, and reveal that AI agents deceive primarily through misdirection rather than fabrication.

Claude Opus 4.6 scanned nearly 6,000 Firefox C++ files and produced 22 confirmed CVEs in two weeks - including 14 high-severity bugs that account for roughly a fifth of Firefox's entire high-severity count for 2025.

Investigations point to outdated AI targeting data as the likely cause of the Minab girls' school airstrike that killed up to 180 people, most of them children.