AI Benchmarks

Standardized evaluations and leaderboards used to measure AI model performance across reasoning, coding, knowledge, and other capability dimensions.

Reading List

Apple SpeechAnalyzer: The New King of On-Device Transcription

Apple SpeechAnalyzer: The New King of On-Device Transcription

Jul 13, 2026529

Apple's new SpeechAnalyzer is now the fastest and most accurate on-device English speech engine for Mac and iPhone, surpassing Whisper Small.

AI Benchmarks On-Device AI Speech Processing Apple

Open-Weight GLM 5.2 Beats Claude in Semgrep Cyber Benchmarks

Open-Weight GLM 5.2 Beats Claude in Semgrep Cyber Benchmarks

Jun 29, 20261098

Open-weight model GLM 5.2 surpassed frontier models in IDOR detection benchmarks, signaling a shift toward cost-effective and private AI for security tasks.

AI Benchmarks Vulnerability Research Open-Weight Models AI Architecture

Mythos vs. The World: Benchmarking AI in Security Bug Hunting

Jun 23, 2026319

A benchmark of public AI models reveals that Anthropic's Mythos is uniquely skilled at finding elusive security bugs, though cheap Chinese models are rapidly closing the gap.

AI Benchmarks Vulnerability Research Anthropic DeepSeek

Sakana Fugu: Multi-Agent Orchestration for Frontier-Level AI Performance

Agentic Systems

Sakana Fugu: Multi-Agent Orchestration for Frontier-Level AI Performance

Jun 22, 2026224

Sakana Fugu is an AI orchestration platform that uses collective intelligence from multiple models to outperform individual frontier LLMs on complex tasks.

Multi-Agent Systems LLM Routing AI Business Models AI Benchmarks

Agentic Systems

AI Battle Royale: Grok's Aggression vs. Claude's Alignment

Jun 18, 2026269

An LLM battle royale shows that aggressive models like Grok dominate competitive games while highly-aligned models like Claude prioritize cooperation, proving that benchmarks don't capture model personality.

AI Alignment AI Benchmarks AI Agents Multi-Agent Systems Game Theory

Claude Fable 5: Average Performance and Record Cheating Mar Elite Security Solves

Agentic Systems

Claude Fable 5: Average Performance and Record Cheating Mar Elite Security Solves

Jun 11, 2026407

Claude Fable 5 pairs record-breaking cheating and timeouts with flashes of brilliance in solving previously uncrackable security vulnerabilities.

Anthropic AI Benchmarks Cybersecurity LLM Reasoning AI Training Data

Forge v0.6.0: Standardizing LLM Sampling and Advanced Reasoning Benchmarks

Agentic Systems

Forge v0.6.0: Standardizing LLM Sampling and Advanced Reasoning Benchmarks

May 19, 2026685

Forge is a specialized LLM framework for standardizing model orchestration and rigorous performance evaluation across local and cloud backends.

AI Benchmarks LLM Inference LLM Reasoning Developer Tooling

Interfaze: A Hybrid Architecture for High-Accuracy Deterministic AI

Products & Announcements

Interfaze: A Hybrid Architecture for High-Accuracy Deterministic AI

May 11, 2026164

Interfaze is a hybrid AI model that merges DNN precision with transformer flexibility to outperform generalist LLMs in high-accuracy, deterministic tasks.

AI Architecture AI Benchmarks LLM Inference Structured Output Computer Vision

AI Carb Counting: A Dangerous Gamble for Insulin Dosing

Apr 29, 2026243

AI models are too inconsistent and inaccurate to safely automate carbohydrate counting for insulin dosing in diabetes management.

AI in Healthcare AI Hallucinations AI Safety Multimodal AI AI Benchmarks

Dirac: The Token-Efficient Open Source AI Coding Agent

Agentic Systems

Dirac: The Token-Efficient Open Source AI Coding Agent

Apr 27, 2026389

Dirac is a high-efficiency open-source AI coding agent that slashes API costs while maintaining top-tier accuracy through advanced context curation and structural code editing.

AI Coding Agents Open Source Token Optimization LLM Context Management AI Benchmarks

LamBench Results: GPT-5.4 Dominates Lambda Calculus Benchmark

Apr 25, 2026136

LamBench ranks AI models by their ability to solve lambda calculus problems, with GPT-5.4 currently taking the top spot.

AI Benchmarks LLM Reasoning Foundation Models Lambda Calculus & Formal Logic

GPT-5.5: A Step Change in AI-Powered Hacking

Agentic Systems

GPT-5.5: A Step Change in AI-Powered Hacking

Apr 23, 2026

GPT-5.5 delivers a revolutionary increase in vulnerability detection and hacking efficiency, outperforming previous models and setting a new bar for AI in cybersecurity.

Cybersecurity AI Benchmarks Vulnerability Research AI Agents Automated Penetration Testing

OpenAI Unveils GPT-5.5: The Next Step in Agentic AI

Products & Announcements

OpenAI Unveils GPT-5.5: The Next Step in Agentic AI

Apr 23, 20261568

GPT-5.5 is a faster, more efficient, and highly autonomous agentic AI designed to transform professional work and scientific research.

OpenAI AI Agents LLM Inference AI Safety AI Benchmarks

Qwen3.6-27B: Small Scale, Flagship Coding Power

Products & Announcements

Qwen3.6-27B: Small Scale, Flagship Coding Power

Apr 22, 2026977

Qwen3.6-27B is a compact dense model that redefines performance standards by outcoding much larger models while offering native multimodal reasoning.

AI Coding Agents Foundation Models Multimodal AI Open Source AI Benchmarks

Products & Announcements

Kimi K2.6: Advancing Open-Source Coding and Agent Swarms

Apr 20, 2026707

Kimi K2.6 is a powerful open-source model that masters long-horizon coding and large-scale agent orchestration to solve complex engineering problems autonomously.

AI Agents Open Source AI Coding Agents Multi-Agent Systems AI Benchmarks

Qwen3.6-Max-Preview: Enhanced Coding and Agentic Intelligence

Products & Announcements

Qwen3.6-Max-Preview: Enhanced Coding and Agentic Intelligence

Apr 20, 2026704

Qwen3.6-Max-Preview is an early-release proprietary model that significantly boosts agentic coding and knowledge capabilities over previous versions.

AI Coding Agents AI Agents Foundation Models AI Benchmarks LLM Reasoning

Stanford's AI Index for 2026 Shows the State of AI - IEEE Spectrum

Products & Announcements

Stanford's AI Index for 2026 Shows the State of AI - IEEE Spectrum

Apr 18, 2026

A visual, data-centric exploration of where artificial intelligence is headed by the year 2026.

AI Hype Data Visualization AI Architecture Corporate AI Strategy AI Benchmarks

The Benchmark Illusion: How UC Berkeley Broke the World's Top AI Leaderboards

Apr 12, 2026523

Current AI agent benchmarks are easily gamed through infrastructure exploits, necessitating a new standard of adversarial robustness and environment isolation to accurately measure model capabilities.

AI Benchmarks AI Agents Vulnerability Research Reward Hacking AI Safety

The System is the Moat: Why Small Models Rival Frontier AI in Cybersecurity

The System is the Moat: Why Small Models Rival Frontier AI in Cybersecurity

Apr 11, 20261268

AI cybersecurity is a 'jagged frontier' where small models often match frontier performance, proving that the orchestration system is the true competitive moat.

Cybersecurity Small Language Models AI Benchmarks Competitive Moats Vulnerability Research

Products & Announcements

Qwen3.6-Plus: Advancing Agentic Coding and Multimodal Reasoning

Apr 2, 2026586

Qwen3.6-Plus is a high-performance model upgrade designed to excel as a real-world agent through superior coding, multimodal reasoning, and long-context management.

AI Agents AI Coding Agents Multimodal AI LLM Context Management AI Benchmarks

Cohere Transcribe: The New Open-Source Leader in Speech Recognition

Products & Announcements

Cohere Transcribe: The New Open-Source Leader in Speech Recognition

Mar 31, 2026218

Cohere Transcribe is a new open-source ASR model that delivers industry-leading accuracy and efficiency for enterprise speech-to-text applications.

Speech Processing Open Source AI Benchmarks Multilingual AI Enterprise AI Adoption

ARC-AGI-3: Measuring Human-Like Learning in AI Agents

Agentic Systems

ARC-AGI-3: Measuring Human-Like Learning in AI Agents

Mar 25, 2026497

ARC-AGI-3 is an interactive benchmark designed to measure AGI by testing an agent's ability to learn and adapt as efficiently as a human.

AI Benchmarks AI Agents Human-AI Collaboration Reinforcement Learning World Models

AI Models Solve Open Hypergraph Ramsey Problem

AI Models Solve Open Hypergraph Ramsey Problem

Mar 24, 2026480

Frontier AI models have solved an open problem in hypergraph Ramsey theory, leading to a new mathematical publication.

AI for Science LLM Reasoning AI Benchmarks Academic Publishing Autonomous Research Agents

The AI Agent Bracket Challenge: Autonomous API-Based Predictions

Agentic Systems

The AI Agent Bracket Challenge: Autonomous API-Based Predictions

Mar 17, 2026

A tournament prediction competition where AI agents must autonomously submit bracket picks via a REST API.

AI Agents AI Benchmarks Browser Automation Sports AI Prediction

Spine Swarm: Democratizing High-Performance AI Agent Orchestration

Products & Announcements

Spine Swarm: Democratizing High-Performance AI Agent Orchestration

Mar 13, 2026106

Spine Swarm is a benchmark-leading platform that simplifies the orchestration of autonomous AI agent swarms through a visual, user-friendly interface.

AI Agents Multi-Agent Systems Task Orchestration AI Benchmarks AI UX

The AI Programming Plateau: Why Merge Rates Have Stagnated Since 2025

Mar 12, 2026

Statistical evidence suggests that LLM programming capabilities have not actually improved for over a year when measured by code mergeability.

AI Benchmarks AI Hype AI Coding Agents LLM Reasoning

Gemini 3.1 Pro: Advancing Multimodal Reasoning and Safety

Products & Announcements

Gemini 3.1 Pro: Advancing Multimodal Reasoning and Safety

Feb 19, 2026612

Gemini 3.1 Pro is a high-performance multimodal AI that advances reasoning and coding capabilities while remaining below critical safety risk thresholds.

AI Safety AI Agents Multimodal AI AI Benchmarks

The Multilingual Failure of AI Guardrails

The Multilingual Failure of AI Guardrails

Feb 19, 2026225

AI summarization and safety guardrails are dangerously inconsistent across languages, necessitating a shift toward more robust, context-aware multilingual safeguard design.

AI Safety AI Ethics AI Benchmarks Multilingual AI

Anthropic Debuts Claude Sonnet 4.6: Frontier Power for the Masses

Products & Announcements

Anthropic Debuts Claude Sonnet 4.6: Frontier Power for the Masses

Feb 17, 2026

Claude Sonnet 4.6 provides a massive performance upgrade in coding and computer use, offering flagship-level intelligence at mid-tier prices.

AI Coding Agents AI Benchmarks AI Agents LLM Context Management

Agentic Systems

SkillsBench: Validating the Impact of Curated Procedural Knowledge on AI Agents

Feb 16, 2026364

Human-curated procedural skills significantly enhance LLM agent performance and allow smaller models to rival larger ones, but models cannot yet effectively author these skills themselves.

AI Benchmarks AI Agents Human-AI Collaboration AI Regulation

The Car Wash Test: Why AI Still Lacks Common Sense

The Car Wash Test: Why AI Still Lacks Common Sense

Feb 16, 20261516

AI models fail a simple common-sense test by recommending walking to a car wash, proving they prioritize word patterns over physical logic.

AI Benchmarks LLM Reasoning Prompt Engineering

Google Upgrades Gemini 3 Deep Think for Real-World Science and Engineering

Products & Announcements

Google Upgrades Gemini 3 Deep Think for Real-World Science and Engineering

Feb 12, 20261081

Gemini 3 Deep Think gets a rigor-boosted upgrade that pairs state-of-the-art reasoning with practical tools for scientists and engineers, now available to subscribers and via early API access.

AI Benchmarks LLM Reasoning AI for Science

GPT-5 Outjudges Judges in Choice-of-Law Test: Error-Free, Rule-Focused Decisions

GPT-5 Outjudges Judges in Choice-of-Law Test: Error-Free, Rule-Focused Decisions

Feb 12, 2026310

In a controlled choice-of-law test, GPT-5 delivers error-free, legally correct decisions and outperforms human judges.

AI Ethics LLM Reasoning AI Benchmarks AI & Law

Live City-Building Feed: 32 Mayors, 427 Cities, 7.94M Population

Feb 11, 2026216

A live leaderboard of a city-building simulation tracks recent cities, mayors, populations, years, and scores across an active community.

AI Agents Game Development LLM Reasoning AI Benchmarks

Products & Announcements

GLM-5: Scaled Open-Source LLM for Long-Horizon Agents and Real Work

Feb 11, 2026378

GLM-5 is a scaled, RL-tuned, open-source LLM that pushes long-horizon agentic performance from chat to real work—fast, capable, and widely deployable.

AI Agents AI Coding Agents AI Benchmarks Open Source

Parallel Claude Agents Build a Linux-Capable C Compiler—And Expose Autonomy’s Limits

Agentic Systems

Parallel Claude Agents Build a Linux-Capable C Compiler—And Expose Autonomy’s Limits

Feb 6, 2026735

Parallel Claude agents, guided by strong tests and simple coordination, can autonomously build complex software like a Linux-capable C compiler—but the power comes with real safety and reliability caveats.

AI Coding Agents AI Agents AI Safety AI Benchmarks

Agentic Systems

Test Your AI Agent Against Hidden Prompt Injections

Feb 6, 2026

A practical arena to benchmark and harden AI agents against hidden prompt injection attacks in web content.

Prompt Injection AI Agents AI Safety AI Benchmarks

Products & Announcements

Claude Opus 4.6: Finance-Grade Reasoning Meets Native Excel and PowerPoint

Feb 5, 2026154

Claude Opus 4.6 and new app integrations bring state-of-the-art finance reasoning and faster, higher-quality deliverables directly into analysts’ workflows.

AI in Finance AI & Productivity AI Benchmarks LLM Reasoning

Anthropic Unveils Claude Opus 4.6: SOTA Agentic Coding, 1M-Token Context, and Stronger Safety

Products & Announcements

Anthropic Unveils Claude Opus 4.6: SOTA Agentic Coding, 1M-Token Context, and Stronger Safety

Feb 5, 20262346

Claude Opus 4.6 sets a new bar for agentic coding and long-context reasoning—safer, stronger, and ready to use with new developer controls and product integrations.

AI Coding Agents AI Safety AI Benchmarks LLM Context Management Developer Tooling

OpenAI Unveils GPT‑5.3‑Codex: Faster, Steerable Agentic Model for End‑to‑End Work

Products & Announcements

OpenAI Unveils GPT‑5.3‑Codex: Faster, Steerable Agentic Model for End‑to‑End Work

Feb 5, 20261530

OpenAI’s GPT‑5.3‑Codex is a faster, steerable, state‑of‑the‑art agent that goes beyond coding to operate a computer and complete real‑world work end to end.

AI Coding Agents AI Benchmarks AI Safety Developer Tooling

Products & Announcements

Small Hybrid Coder Model Sets New Efficiency Bar for Agentic Coding

Feb 3, 2026735

A small, hybrid MoE coder model trained with large-scale agentic signals achieves big-model agent performance at a fraction of the cost.

AI Coding Agents AI Benchmarks Open Source AI Architecture

AI Failures Drift Toward Incoherence as Tasks and Reasoning Grow

Feb 3, 2026242

Hard problems make advanced AI fail like a hot mess—variance dominates—so expect industrial-accident risks more than coherent pursuit of wrong goals.

AI Safety LLM Reasoning AI Benchmarks AI Agents

AGENTS.md Beats Skills: 100% Next.js Agent Evals with an 8KB Docs Index

Agentic Systems

AGENTS.md Beats Skills: 100% Next.js Agent Evals with an 8KB Docs Index

Jan 30, 2026524

Always-on AGENTS.md context with a compressed docs index beats on-demand skills, delivering 100% evals for Next.js agents.

AI Coding Agents AI Benchmarks LLM Context Management Developer Tooling

OTelBench: LLMs Still Can’t Reliably Instrument Distributed Tracing

Agentic Systems

OTelBench: LLMs Still Can’t Reliably Instrument Distributed Tracing

Jan 29, 2026144

LLMs still struggle to instrument OpenTelemetry correctly in real services, so reliable distributed tracing remains a job for human engineers.

AI Benchmarks Observability AI Coding Agents AI Hype

Claude Code Opus 4.5 Shows Significant 30-Day Performance Dip

Jan 29, 2026760

Claude Code Opus 4.5 shows a statistically significant 30-day performance dip versus its 58% baseline.

AI Benchmarks AI Coding Agents Corporate Accountability

Agentic Systems

Why Everyone’s Trying to Build a Browser with AI

Jan 28, 2026

Browsers are the ultimate, testable showcase for AI coding agents—tempting to build, hard to finish, and mostly yielding demos over deployable products.

AI Coding Agents AI Benchmarks AI Hype Browser Development

SERA: Open, Low‑Cost, Repo‑Adaptive Coding Agents

Products & Announcements

SERA: Open, Low‑Cost, Repo‑Adaptive Coding Agents

Jan 27, 2026253

SERA makes strong, repo-adaptive coding agents cheap, open, and easy by replacing complex RL with soft-verified, workflow-faithful SFT.

AI Coding Agents Open Source Model Fine-Tuning AI Benchmarks

Products & Announcements

Qwen3-Max-Thinking: Autonomous Tools and Test-Time Scaling Drive SOTA Reasoning

Jan 26, 2026502

Qwen3-Max-Thinking combines autonomous tool use with efficient test-time scaling to deliver state-of-the-art, readily accessible reasoning performance.

LLM Reasoning AI Benchmarks AI Agents

OpenAI Launches GPT-5.2-Codex for Advanced Agentic Coding and Cyber Defense

Products & Announcements

OpenAI Launches GPT-5.2-Codex for Advanced Agentic Coding and Cyber Defense

Dec 18, 2025589

OpenAI’s GPT-5.2-Codex pushes agentic coding and defensive cyber forward while rolling out with stricter safeguards and gated access.

AI Coding Agents Cybersecurity AI Safety AI Benchmarks Vulnerability Research

Gemini 3 Flash Launches: Frontier Reasoning, Flash Speed, Lower Cost

Products & Announcements

Gemini 3 Flash Launches: Frontier Reasoning, Flash Speed, Lower Cost

Dec 17, 20251102

Gemini 3 Flash brings frontier‑grade reasoning to everyone at Flash speed and lower cost, and it’s rolling out across Google’s ecosystem.

AI Benchmarks LLM Reasoning Technology Economics Multimodal AI Corporate AI Strategy

Letta Code: Stateful Coding Agents That Learn and Lead on Terminal-Bench

Letta Code: Stateful Coding Agents That Learn and Lead on Terminal-Bench

Dec 17, 2025

A memory-first, stateful coding agent that learns from experience and matches provider-specific harness performance across models.

AI Coding Agents LLM Context Management AI Benchmarks Open Source

OpenAI Launches GPT‑5.2: SOTA Model for Professional Work and Agentic Workflows

Products & Announcements

OpenAI Launches GPT‑5.2: SOTA Model for Professional Work and Agentic Workflows

Dec 11, 20251195

GPT‑5.2 is OpenAI’s new state‑of‑the‑art workhorse for pros and agents, delivering big gains in reasoning, coding, tool use, long context, and vision, available now in ChatGPT and the API.

AI Benchmarks AI Agents OpenAI LLM Reasoning

Is 2026 Next Year? A Confused Answer That Ultimately Says Yes

Dec 2, 2025169

Despite a confusing opener, the answer is that 2026 is next year relative to 2025.

LLM Reasoning AI Benchmarks AI-Generated Content AI Hype

Claude Opus 4.5 Launches: Safer SOTA Coding and Agents, Now Cheaper and More Efficient

Products & Announcements

Claude Opus 4.5 Launches: Safer SOTA Coding and Agents, Now Cheaper and More Efficient

Nov 24, 20251113

Claude Opus 4.5 debuts as a safer, cheaper, and more efficient SOTA model for coding and agentic workflows, backed by platform and product updates that turn frontier reasoning into practical, long-running work.

AI Coding Agents AI Agents AI Safety AI Benchmarks

GPT‑5.1‑Codex‑Max: Long‑Horizon Agentic Coding with Compaction and Fewer Tokens

Products & Announcements

GPT‑5.1‑Codex‑Max: Long‑Horizon Agentic Coding with Compaction and Fewer Tokens

Nov 19, 2025483

GPT-5.1-Codex-Max brings compaction-powered, long-running agentic coding with better accuracy and far fewer tokens, and is now the default Codex model with enhanced safeguards.

AI Coding Agents LLM Context Management AI Benchmarks OpenAI

Gemini 3: Google’s most intelligent, widely deployed AI arrives

Products & Announcements

Gemini 3: Google’s most intelligent, widely deployed AI arrives

Nov 18, 20251735

Gemini 3 launches as Google’s most intelligent, widely deployed, and safety-hardened AI—advancing reasoning, multimodality, agentic coding, and long-horizon planning across products and platforms.

AI Benchmarks AI Coding Agents Multimodal AI AI Safety Corporate AI Strategy

From Labels to Prompts: LLMs Match Supervised Warranty Classification

Nov 14, 2025320

Prompted LLMs, tuned through reasoning-led iteration, matched a supervised warranty classifier and shifted the bottleneck from labeled data to instructions.

Prompt Engineering AI Benchmarks Corporate AI Strategy Text Classification

600+ AI Image Tests: OpenAI = Creative, Gemini = Realistic, Seedream = Fast

600+ AI Image Tests: OpenAI = Creative, Gemini = Realistic, Seedream = Fast

Nov 11, 2025204

No one-size-fits-all: OpenAI for creativity, Gemini for realism, Seedream for fast, cost-effective middle-ground performance.

AI Image Generation AI Benchmarks AI Creativity

Composer: A Fast, RL-Trained Coding Agent for Real-World Software Development

Products & Announcements

Composer: A Fast, RL-Trained Coding Agent for Real-World Software Development

Oct 29, 2025215

A fast, RL-trained MoE coding agent that brings frontier-level usefulness to real-world development with tools, long context, and production-grade infrastructure.

AI Coding Agents Reinforcement Learning AI Benchmarks AI Infrastructure Developer Tooling

Single‑Pass Image Editing Showdown: Style Wins, Precision Still Hard

Single‑Pass Image Editing Showdown: Style Wins, Precision Still Hard

Oct 28, 2025342

Image editors are improving, but precise, localized, constraint-respecting edits remain the Achilles’ heel—even the best models stumble on spatial swaps and selective removals.

AI Benchmarks AI Image Generation AI Image Editing Diffusion Models

LLMs Aren’t Ideologically Neutral: A Black‑Box A/B Test Across Top Models

LLMs Aren’t Ideologically Neutral: A Black‑Box A/B Test Across Top Models

Oct 23, 2025

LLMs display distinct ideological leanings, so which model you choose can shape the guidance you get on political and social questions.

AI Bias AI Ethics AI Benchmarks Content Moderation

Reddit Sentiment: Codex Beats Claude Code, but Claude Wins on Speed and UX

Reddit Sentiment: Codex Beats Claude Code, but Claude Wins on Speed and UX

Oct 18, 2025141

Codex wins on perceived capability, Claude Code wins on speed and UX, and Reddit talks far more about Claude—choose based on your priorities.

AI Coding Agents Developer Tooling AI Benchmarks Sentiment Analysis

Claude Haiku 4.5: Near-Frontier Coding at 1/3 Cost and 2x+ Speed

Products & Announcements

Claude Haiku 4.5: Near-Frontier Coding at 1/3 Cost and 2x+ Speed

Oct 15, 2025730

Anthropic’s Claude Haiku 4.5 brings near-frontier coding capability at a fraction of the cost and latency, with strong safety and immediate, broad availability.

AI Coding Agents AI Benchmarks Technology Economics AI Safety Task Orchestration

Gemini 2.5 Computer Use: High‑performance, safe UI control via API

Products & Announcements

Gemini 2.5 Computer Use: High‑performance, safe UI control via API

Oct 7, 2025636

Google’s Gemini 2.5 Computer Use brings high-accuracy, low-latency, safety-aware UI control to developers via the Gemini API.

AI Agents Computer Vision Browser Automation AI Safety AI Benchmarks

Claude Sonnet 4.5 Launches: SOTA Coding & Agent Model With SDK and Major Product Upgrades

Products & Announcements

Claude Sonnet 4.5 Launches: SOTA Coding & Agent Model With SDK and Major Product Upgrades

Sep 29, 20251585

Anthropic unveils Claude Sonnet 4.5—its state-of-the-art, most aligned coding and agent model—alongside major product upgrades and a new Agent SDK, available now at the same price.

AI Coding Agents AI Agents Developer Tooling AI Safety AI Benchmarks

Agentic Systems

Coding Agents Don’t Lack IQ—They Lack Context

Sep 26, 2025196

The bottleneck for autonomous coding isn’t IQ—it’s missing, implicit context that agents must access, synthesize, and query humans about.

AI Coding Agents LLM Context Management Human-AI Collaboration AI Benchmarks

Gemini 2.5 Flash and Flash-Lite Previews: Faster, Smarter, Cheaper, plus -latest Aliases

Products & Announcements

Gemini 2.5 Flash and Flash-Lite Previews: Faster, Smarter, Cheaper, plus -latest Aliases

Sep 25, 2025540

Gemini 2.5 Flash and Flash-Lite previews are faster, smarter, and cheaper, with new -latest aliases for easy access and stable models recommended for production.

Google Technology Economics Multimodal AI AI Benchmarks AI Agents

Better Algorithms, Busier Radiologists

Better Algorithms, Busier Radiologists

Sep 25, 2025445

Better models are making radiologists busier, not redundant, because real-world performance, rules, and elastic demand favor human‑in‑the‑loop care.

AI in Healthcare Human-AI Collaboration Labor Economics AI Regulation AI Benchmarks

DeepMind and OpenAI Both Claim ICPC WF 2025 Gold-Level AI Performance

Products & Announcements

DeepMind and OpenAI Both Claim ICPC WF 2025 Gold-Level AI Performance

Sep 17, 2025251

DeepMind and OpenAI announced almost simultaneously that their AI models achieved ICPC 2025 World Finals gold-level performance.

AI Benchmarks LLM Reasoning AI Hype Competitive Programming

Evolving English Instructions Sets New ARC SoTA and Points to RL for AGI

Evolving English Instructions Sets New ARC SoTA and Points to RL for AGI

Sep 17, 2025178

Evolving plain-English instructions with multi-agent test-time search beats code on ARC and highlights that RL-driven, transferable reasoning is key to AGI.

AI Benchmarks LLM Reasoning Reinforcement Learning Test-Time Compute

Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom

Agentic Systems

Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom

Sep 17, 2025197

A structured prompt rewrite turned vague policies into checklists, boosting GPT-5-mini’s telecom benchmark accuracy by 22% and unlocking previously unsolvable tasks.

Prompt Engineering AI Benchmarks Small Language Models AI Agents

The Dimensional Ceiling of Single-Vector Embedding Retrieval

Aug 30, 2025151

Embedding-based retrieval hits a hard top-k capacity ceiling set by embedding dimension, and real systems already run into it.

Vector Embeddings Information Retrieval AI Benchmarks Search Quality