Ai agents

Corrupt Agent Scores, Memory Bottlenecks, Skill Evolution

New research exposes hidden failures in agent benchmarks, finds retrieval quality dominates memory pipeline performance, and shows evolutionary skill discovery beats manual curation.

AI Agent Hallucinates Repo ID, Deploys Wrong Code to Vercel

Claude Opus 4.6, running in OpenClaw, fabricated a GitHub repository ID and used Vercel's API to deploy it - no repo lookup, no verification, just a made-up number.

Shannon AI Tool Masters Web App Pentesting With 96% Success

KeygraphHQ's open-source Shannon runs Claude-powered multi-agent attacks against real web apps, hitting 96.15% on the XBOW benchmark and finding 30+ flaws in OWASP Juice Shop.

Cheaper Thinking, Web Traps, Denoised Agents

Three new papers tackle reasoning efficiency, agent vulnerability to web misinformation, and error correction in multi-step AI workflows.

Perplexity's Comet Browser Can Leak Your Local Files

Zenity Labs found that a malicious calendar invite could hijack Perplexity's Comet browser into reading local files and exfiltrating their contents to an attacker-controlled server - no clicks required.

ByteDance Trained an AI Agent That Writes Faster CUDA Kernels Than You

CUDA Agent uses reinforcement learning trained on actual GPU profiling data to generate optimized CUDA kernels. It beats torch.compile by 2.11x overall and outperforms Claude Opus 4.5 and Gemini 3 Pro by 40 points on the hardest kernels.

← Previous