Apple AI Updates: May 6, 2026

1. Apple Research: Stochastic KV Routing Cuts Transformer Cache Footprint Along the Depth Axis

Apple. A new paper from Apple ML Research targets the memory cost of KV caching in transformer LLMs by optimizing along the depth dimension rather than the temporal one. The authors argue that maintaining a full cache for every layer is redundant and propose adaptive depth-wise sharing via stochastic routing, trading a small amount of fidelity for substantial memory savings during inference. Relevant for anyone running long-context models on memory-constrained Apple silicon. Source