Hugging Face AI Updates: May 15, 2026

1. IBM Granite Embedding Multilingual R2 lands as best sub-100M retrieval model

Hugging Face. IBM released granite-embedding-97m-multilingual-r2 (384-dim) and granite-embedding-311m-multilingual-r2 (768-dim) on the Hub under Apache 2.0, built on ModernBERT with a 32,768-token context (64x the R1 generation) and coverage across 200+ languages plus nine programming languages. The 97M variant posts 60.3 on MTEB Multilingual Retrieval, the top score for open sub-100M models, and the 311M variant hits 65.2; both gain +31-34 points on LongEmbed versus R1, and ONNX/OpenVINO weights ship for CPU inference. Source

2. Hugging Face unlocks asynchronous continuous batching for a 22% inference speedup

Hugging Face. A new engineering post details how ContinuousBatchingAsyncIOs in the transformers library uses three CUDA streams (H2D, compute, D2H), CUDA events, and dual A/B input buffers to overlap CPU prep with GPU compute, lifting utilization from 76% to 99.4% on an 8B model generating 8K tokens. Total generation time drops from 300.6s to 234.5s, a 22% speedup with zero model changes, and a carry-over mechanism transfers each batch’s output tokens into the next batch’s inputs without stalling either device. Source