Local llm

Nemotron-Cascade 2: 30B Open MoE, One GPU, Beats 120B

NVIDIA's new Nemotron-Cascade-2-30B-A3B activates just 3B parameters per token, runs on a single RTX 4090, and outscores NVIDIA's own 120B model on coding and math benchmarks.

Apple's M5 Pro and Max Make 70B Models Portable

Apple launches M5 Pro and M5 Max MacBook Pros with Neural Accelerators in every GPU core, 128GB unified memory, and 614GB/s bandwidth - enough to run Llama 70B on a laptop.

Mac Studio Clusters Now Run Trillion-Parameter Models for $40K

macOS RDMA over Thunderbolt 5 has turned four Mac Studios into a 1.5TB unified memory cluster that runs Kimi K2 at 25 tokens per second - a setup that would cost $780K with NVIDIA H100s.

LM Studio Launches LM Link - Access Your GPU Rig's Models From Anywhere via Encrypted Mesh

LM Studio 0.4.5 introduces LM Link, built on Tailscale's tsnet library, letting users access local AI models on remote hardware through end-to-end encrypted connections with zero port forwarding.

Liquid AI Drops LFM2-24B - A 24 Billion Parameter Model That Runs on Your Laptop

MIT spinoff Liquid AI releases LFM2-24B-A2B, a hybrid mixture-of-experts model that activates only 2.3B parameters per token, fits in 32GB RAM, and hits 112 tokens per second on a consumer CPU.

LLMfit: Stop Guessing Which LLM Your Hardware Can Actually Run

LLMfit is a Rust-based terminal tool that scans your hardware and scores 157 LLMs across 30 providers for compatibility, speed, and quality. Here is why it matters.