AI News: June 15, 2026

1. SWE-Explore Benchmark Finds AI Coding Agents Miss 80-86 Percent of Critical Code Lines

Shanghai Jiao Tong University researchers. A new benchmark called SWE-Explore, evaluated across 848 problems in 203 open-source projects spanning ten programming languages, found that AI coding agents from OpenAI, Anthropic, Google, and others correctly locate the relevant source files but identify only 14 to 19 percent of the specific lines that actually need modification. The study identified a critical threshold: fixes do not improve gradually and require agents to access at least 50 to 75 percent of the critical code regions before any repair succeeds. Upgrading the underlying language model does not resolve the pattern, suggesting future work should focus on broader code exploration rather than restrictive filtering. Source

2. Amazon and Five Tech Companies Triggered the US Government Crackdown on Anthropic’s Fable Model

Amazon. Amazon CEO Andy Jassy and at least five other unnamed technology companies reported alleged security vulnerabilities in Anthropic’s Claude Fable 5 to Trump administration officials, prompting a chain of events that ended with a forced shutdown. The White House initially sought voluntary withdrawal; when Anthropic declined, the administration issued an export control order with a 90-minute compliance window, and Anthropic shut down Fable by 10 p.m. the same day. Amazon, one of Anthropic’s largest investors and its primary cloud provider, claimed researchers had used Fable to obtain information useful for cyberattacks, a characterization cybersecurity experts disputed as overstated. Source

3. Google Research’s Gemini-SQL2 Sets a New Record on the BIRD Text-to-SQL Benchmark

Google Research. The company’s Gemini-SQL2 system, built on Gemini 3.1 Pro, achieved an execution accuracy of 80.04 percent on the BIRD benchmark, outpacing OpenAI’s GPT-5.5-xhigh at 72.8 percent and Anthropic’s Claude Opus 4.6 at 70.9 percent. The benchmark tests conversion of natural language questions into executable SQL queries across databases with complex business logic. Google has not released the model publicly or published a research paper accompanying the result. Source

4. OpenAI and Anthropic Both File Confidentially for IPO as AI Companies Race to Go Public

OpenAI and Anthropic. Both AI labs have filed confidentially for initial public offerings, setting up a competitive race to market in a climate where SpaceX’s debut, the largest IPO in history with shares priced at $135, made AI and deep tech the dominant focus of public market investors. Analysts note limited capital availability creates pressure for one company to go first, as a successful debut by either could crowd out the other’s window. The shift is reconfiguring market benchmarks, with observers replacing the “FAANG” shorthand with “MANGOS” (Meta, Anthropic, NVIDIA, Google, OpenAI, SpaceX) to reflect the new investor priorities. Source

5. Satya Nadella Warns Against Token-Maxing While Admitting He Does It

Microsoft CEO Satya Nadella. In an interview on the Hard Fork podcast, Nadella cautioned developers against indiscriminately using frontier AI models for all tasks, arguing that productivity gains must justify the computational cost. He then acknowledged his own contradiction: “I’m like a token-maxer too. So it is addictive.” Separately in the same interview, Nadella described a future in which programmers oversee hundreds or thousands of AI agents instead of writing code directly, and introduced the concept of “cognitive coverage,” the ability to deeply understand agent-generated code without having written it. Source

6. Claude Fable 5 Scores 88 Percent on FrontierMath’s Hardest Problems, 13 Points Ahead of GPT-5.5

Anthropic. According to Epoch AI data, Claude Fable 5 achieved 87 percent accuracy on FrontierMath tiers 1 through 3 and 88 percent on tier 4 (v2), the benchmark’s most difficult level. GPT-5.5 reached approximately 75 percent on tier 4, a 13-point gap, while Anthropic’s own predecessor model Claude Opus 4.5 had scored below 10 percent on the same tier earlier in 2026. FrontierMath tests advanced mathematical reasoning with novel, expert-crafted problems not present in any training corpus. Source