Vllm

GPT-4 to Self-Hosted Llama 4 Migration Guide

Switch from OpenAI's GPT-4 API to self-hosted Llama 4 with near-zero code changes, but plan carefully for hardware, EU licensing, and real context window limits.

Open-Source LLM Hosting Costs - March 2026

Real costs of self-hosting Llama, Mistral, Qwen, and DeepSeek models on cloud GPUs vs. API access - with break-even analysis and hardware pricing.

IBM Granite 4.0 1B Speech Tops OpenASR Leaderboard

IBM's new 1B-parameter speech model claims the top spot on the Open ASR Leaderboard while running on consumer hardware, beating Whisper Large V3 by 25% on word error rate.

vLLM 0.17 Ships FlashAttention 4 and Live MoE Scaling

vLLM v0.17.0 adds FlashAttention 4, elastic expert parallelism for live MoE rescaling, full Qwen3.5 support, and a performance-mode flag, all in 699 commits from 272 contributors.

Qwen 3.5 FP8 Weights Drop - How to Actually Deploy a 397B Model on 8 GPUs

Alibaba releases official FP8-quantized weights for the Qwen 3.5 flagship and 27B dense model, cutting memory requirements roughly in half and enabling deployment on 8x H100 GPUs with native vLLM and SGLang support.