Hugging Face AI Updates: June 19, 2026
1. ServiceNow Releases MosaicLeaks Benchmark for Privacy Leakage in Research Agents
Hugging Face. ServiceNow researchers published MosaicLeaks, a benchmark and training method that measures how research agents leak sensitive internal information when they combine private enterprise documents with public web retrieval. The benchmark consists of 1,001 multi-hop research chains that interleave private documents with public web searches, targeting the mosaic effect where individually innocuous queries reveal sensitive data when viewed together. The team found that base models leaked private information in 34 percent of cases and that training for task performance alone raised leakage to 51.7 percent, while their privacy-aware reinforcement learning method (PA-DR) cut leakage to 9.9 percent while keeping 58.7 percent task success. Source
2. Hugging Face PEFT Adds Benchmarking Showing Alternatives Can Beat LoRA
Hugging Face published a study and expanded its PEFT library benchmarking infrastructure to test the assumption that LoRA is the best parameter-efficient fine-tuning method, despite its 98.4 percent dominance among documented methods. The new benchmarks cover MetaMathQA for math reasoning and an image generation task on a cat plushy dataset, and the results show multiple methods occupy a Pareto frontier of accuracy versus resource use. On the image task, OFT surpassed LoRA with a 0.708 versus 0.697 similarity score while using less memory, and on math, rank-stabilized LoRA reached 53.2 percent accuracy against standard LoRA’s 48.1 percent. The library also now supports converting non-LoRA adapters into LoRA format for compatibility with downstream tools such as vLLM. Source
3. Hugging Face Introduces agent-eval to Benchmark How Agents Use Libraries
Hugging Face released agent-eval, a benchmarking harness that measures how effectively coding agents interact with software libraries, tracking not just success but the tokens, time, errors, and decision paths required across model sizes and library versions. The team applied it to the transformers library, comparing a bare installation, a cloned repository, and a packaged Skill with documentation across multiple models and revisions. Large models completed tasks faster with new CLI tooling, but smaller models such as Qwen3-4B degraded, with some dropping from 100 percent to 0 percent accuracy when added documentation was mistaken for executable tools, leading the authors to recommend testing agent-facing APIs across model sizes before deployment. Source