AI News: May 25, 2026

1. AutoTTS lets Claude Code discover a test-time scaling algorithm that beats human-designed self-consistency at 70 percent fewer tokens

Research. A team from UMD, UVA, WUSTL, UNC, Google, and Meta published AutoTTS, a system that hands the design of test-time scaling strategies to Claude Code rather than writing them by hand. The discovered algorithm monitors confidence shifts between rounds, opens additional reasoning paths when confidence stagnates, skips new paths when it climbs quickly, and selectively prunes diverging trajectories. The Decoder reports the algorithm achieves better accuracy per token than standard self-consistency on AIME and HMMT math benchmarks while cutting token usage by roughly 70 percent, and transfers cleanly to DeepSeek-R1-Distill-Llama-8B and to GPQA-Diamond outside the math domain. Total discovery cost was about $40 over 160 minutes, reframing the researcher’s job as designing the search environment rather than the algorithm. Source

2. White House overrides Pentagon supply-chain flag to keep Anthropic Claude inside the NSA

Policy. The Pentagon designated Anthropic a supply-chain risk after the company declined to commit Claude to “any lawful use,” but the NSA contract is moving forward anyway, with White House Chief of Staff Susie Wiles personally approving the arrangement. The finalized deal strips the disputed “any lawful use” language and adds protections barring the model from processing Americans’ data, and the administration plans to use it as a template for future federal AI contracts. The practical lever is hardware: the NSA does not have enough NVIDIA accelerators on its classified networks to run alternative models, and Anthropic’s Mythos variant is engineered to operate on older hardware, making it the only near-term option for classified inference. Source

3. ByteDance’s MMProLong shows QA-style training beats OCR transcription for long-document multimodal models

Research. Researchers from ByteDance Seed and HKUST compared two training regimes for long-document multimodal models: pure OCR transcription of pages versus generating question-answer pairs that force the model to locate specific passages inside long contexts. QA-style training improved benchmark scores by 5 to 6 points, while pure OCR training actually degraded performance by 6.8 to 17.4 points, identifying retrieval rather than reasoning as the dominant bottleneck. The resulting MMProLong model, built on Qwen2.5-VL, gained 29.4 points on average on Needle-in-a-Haystack evaluations, stayed stable out to 512K tokens despite training only on 128K, and outperformed larger open baselines including InternVL3-38B and Gemma3-27B with gains carrying over to untrained tasks like long-video understanding. Source

4. Hassabis and LeCun stake out opposite ends of the AGI timeline debate on the same weekend

Industry. At Google I/O 2026, Demis Hassabis said humanity is “standing in the foothills of the singularity” and projected AGI within five years with impact at “10 times the industrial revolution at 10 times the speed.” On LinkedIn the same week, Yann LeCun pushed back hard, paraphrasing Piaget that “intelligence is not what you know, it’s what you do when you don’t know” and arguing that current LLMs lack the ability to handle genuinely novel problems without prior training. The split frames the practitioner question that follows it: whether the next two years of capital should fund scaling transformer-based reasoning chains or fund alternative architectures of the kind LeCun’s AMI Labs is pursuing. Source

5. Default “Auto” model selection in Copilot and Gemini fabricates analyses that thinking models catch

Tools. A Decoder piece walks through a repeatable failure mode: when Microsoft Copilot’s Auto mode and Google Gemini default to Flash-class models, analyzing identical datasets labeled with different country names yields fabricated cultural stereotypes (Italians cast as artists, Americans as business-focused) rather than the correct observation that the data is duplicated. Switching ChatGPT and Claude to thinking variants caused both to write Python on the spot to detect the duplication. The recommendation for practitioners is to override Auto when work is consequential, write down expected outputs before running analyses, and treat default model selection as a silent quality regression rather than a neutral convenience. Source

6. Google Cloud developers hit with five-figure Gemini bills after API key permission widening

Security. TechCrunch reports Google Cloud customers received unexpected five-figure invoices after attackers used stolen API keys to call Gemini models against accounts that had never explicitly enabled those services, with Google having quietly expanded API key permissions without clear disclosure. Researchers cited in the piece found Google API keys remained usable for up to 23 minutes after revocation, and Google Cloud COO Francis de Souza is quoted saying security “is not something you can bolt on later.” The article also flags that median time from breach to escalation in modern attacks has collapsed from eight hours to 22 seconds, sharpening the operational cost of the revocation gap. Source

7. Amazon’s Bee wearable ships always-on conversation capture with the surveillance tradeoffs that implies

Consumer. TechCrunch’s hands-on with Amazon’s Bee describes a wrist-worn device that continuously records ambient conversation and produces transcripts plus auto-generated summaries, with calendar integration that turns extracted commitments into alerts. The reviewer found summaries useful for back-to-back meetings but flagged dropped speaker labels and missing dialogue segments in long transcripts, plus the broad mobile permission set (location, photos, contacts, calendar, notifications) and cloud storage of recordings as material privacy concerns. The product sits in the same emerging category as Otter and Granola but moves capture out of the meeting room and into all-day ambient logging. Source

8. Chatbot Arena holds at 1502/1500 with Claude Opus thinking variants on top for a fourth straight week

Benchmarks. The May 25 snapshot of the LMSYS Chatbot Arena text leaderboard is unchanged at the top from the May 23 reading, with claude-opus-4-6-thinking at 1502 and claude-opus-4-7-thinking at 1500, followed by the non-thinking Opus variants at 1498 and 1492. Meta’s muse-spark (1489), Google’s gemini-3.1-pro-preview (1488) and gemini-3-pro (1486), and OpenAI’s gpt-5.5-high (1481) round out the top eight, with Alibaba’s qwen3.7-max-preview entering the top 15 at 1475. The flat top of the board over four consecutive weeks suggests the arena has hit a local plateau pending a new frontier release. Source