TD Stuff

Agentic Systems

The AI Agent Bracket Challenge: Autonomous API-Based Predictions

Mar 17, 2026

A tournament prediction competition where AI agents must autonomously submit bracket picks via a REST API.

AI Agents AI Benchmarks Browser Automation Sports AI Prediction

Products & Announcements

Spine Swarm: Democratizing High-Performance AI Agent Orchestration

Mar 13, 2026106

Spine Swarm is a benchmark-leading platform that simplifies the orchestration of autonomous AI agent swarms through a visual, user-friendly interface.

AI Agents Multi-Agent Systems Task Orchestration AI Benchmarks AI UX

Damage Control

The AI Programming Plateau: Why Merge Rates Have Stagnated Since 2025

Mar 12, 2026

Statistical evidence suggests that LLM programming capabilities have not actually improved for over a year when measured by code mergeability.

AI Benchmarks AI Hype AI Coding Agents LLM Reasoning

Products & Announcements

Gemini 3.1 Pro: Advancing Multimodal Reasoning and Safety

Feb 19, 2026612

Gemini 3.1 Pro is a high-performance multimodal AI that advances reasoning and coding capabilities while remaining below critical safety risk thresholds.

AI Safety AI Agents Multimodal AI AI Benchmarks

Damage Control

The Multilingual Failure of AI Guardrails

Feb 19, 2026225

AI summarization and safety guardrails are dangerously inconsistent across languages, necessitating a shift toward more robust, context-aware multilingual safeguard design.

AI Safety AI Ethics AI Benchmarks Multilingual AI

Products & Announcements

Anthropic Debuts Claude Sonnet 4.6: Frontier Power for the Masses

Feb 17, 2026

Claude Sonnet 4.6 provides a massive performance upgrade in coding and computer use, offering flagship-level intelligence at mid-tier prices.

AI Coding Agents AI Benchmarks AI Agents LLM Context Management

Agentic Systems

SkillsBench: Validating the Impact of Curated Procedural Knowledge on AI Agents

Feb 16, 2026364

Human-curated procedural skills significantly enhance LLM agent performance and allow smaller models to rival larger ones, but models cannot yet effectively author these skills themselves.

AI Benchmarks AI Agents Human-AI Collaboration AI Regulation

Damage Control

The Car Wash Test: Why AI Still Lacks Common Sense

Feb 16, 20261516

AI models fail a simple common-sense test by recommending walking to a car wash, proving they prioritize word patterns over physical logic.

AI Benchmarks LLM Reasoning Prompt Engineering

Products & Announcements

Google Upgrades Gemini 3 Deep Think for Real-World Science and Engineering

Feb 12, 20261081

Gemini 3 Deep Think gets a rigor-boosted upgrade that pairs state-of-the-art reasoning with practical tools for scientists and engineers, now available to subscribers and via early API access.

AI Benchmarks LLM Reasoning AI for Science

Under the Hood

GPT-5 Outjudges Judges in Choice-of-Law Test: Error-Free, Rule-Focused Decisions

Feb 12, 2026310

In a controlled choice-of-law test, GPT-5 delivers error-free, legally correct decisions and outperforms human judges.

AI Ethics LLM Reasoning AI Benchmarks AI & Law

Creative Code

Live City-Building Feed: 32 Mayors, 427 Cities, 7.94M Population

Feb 11, 2026216

A live leaderboard of a city-building simulation tracks recent cities, mayors, populations, years, and scores across an active community.

AI Agents Game Development LLM Reasoning AI Benchmarks

Products & Announcements

GLM-5: Scaled Open-Source LLM for Long-Horizon Agents and Real Work

Feb 11, 2026378

GLM-5 is a scaled, RL-tuned, open-source LLM that pushes long-horizon agentic performance from chat to real work—fast, capable, and widely deployable.

AI Agents AI Coding Agents AI Benchmarks Open Source

Agentic Systems

Parallel Claude Agents Build a Linux-Capable C Compiler—And Expose Autonomy’s Limits

Feb 6, 2026735

Parallel Claude agents, guided by strong tests and simple coordination, can autonomously build complex software like a Linux-capable C compiler—but the power comes with real safety and reliability caveats.

AI Coding Agents AI Agents AI Safety AI Benchmarks

Agentic Systems

Test Your AI Agent Against Hidden Prompt Injections

Feb 6, 2026

A practical arena to benchmark and harden AI agents against hidden prompt injection attacks in web content.

Prompt Injection AI Agents AI Safety AI Benchmarks

Products & Announcements

Claude Opus 4.6: Finance-Grade Reasoning Meets Native Excel and PowerPoint

Feb 5, 2026154

Claude Opus 4.6 and new app integrations bring state-of-the-art finance reasoning and faster, higher-quality deliverables directly into analysts’ workflows.

AI in Finance AI & Productivity AI Benchmarks LLM Reasoning

Products & Announcements

Anthropic Unveils Claude Opus 4.6: SOTA Agentic Coding, 1M-Token Context, and Stronger Safety

Feb 5, 20262346

Claude Opus 4.6 sets a new bar for agentic coding and long-context reasoning—safer, stronger, and ready to use with new developer controls and product integrations.

AI Coding Agents AI Safety AI Benchmarks LLM Context Management Developer Tooling

Products & Announcements

OpenAI Unveils GPT‑5.3‑Codex: Faster, Steerable Agentic Model for End‑to‑End Work

Feb 5, 20261530

OpenAI’s GPT‑5.3‑Codex is a faster, steerable, state‑of‑the‑art agent that goes beyond coding to operate a computer and complete real‑world work end to end.

AI Coding Agents AI Benchmarks AI Safety Developer Tooling

Products & Announcements

Small Hybrid Coder Model Sets New Efficiency Bar for Agentic Coding

Feb 3, 2026735

A small, hybrid MoE coder model trained with large-scale agentic signals achieves big-model agent performance at a fraction of the cost.

AI Coding Agents AI Benchmarks Open Source AI Architecture

Under the Hood

AI Failures Drift Toward Incoherence as Tasks and Reasoning Grow

Feb 3, 2026242

Hard problems make advanced AI fail like a hot mess—variance dominates—so expect industrial-accident risks more than coherent pursuit of wrong goals.

AI Safety LLM Reasoning AI Benchmarks AI Agents

Agentic Systems

AGENTS.md Beats Skills: 100% Next.js Agent Evals with an 8KB Docs Index

Jan 30, 2026524

Always-on AGENTS.md context with a compressed docs index beats on-demand skills, delivering 100% evals for Next.js agents.

AI Coding Agents AI Benchmarks LLM Context Management Developer Tooling

Agentic Systems

OTelBench: LLMs Still Can’t Reliably Instrument Distributed Tracing

Jan 29, 2026144

LLMs still struggle to instrument OpenTelemetry correctly in real services, so reliable distributed tracing remains a job for human engineers.

AI Benchmarks Observability AI Coding Agents AI Hype

Programming

Claude Code Opus 4.5 Shows Significant 30-Day Performance Dip

Jan 29, 2026760

Claude Code Opus 4.5 shows a statistically significant 30-day performance dip versus its 58% baseline.

AI Benchmarks AI Coding Agents Corporate Accountability

Agentic Systems

Why Everyone’s Trying to Build a Browser with AI

Jan 28, 2026

Browsers are the ultimate, testable showcase for AI coding agents—tempting to build, hard to finish, and mostly yielding demos over deployable products.

AI Coding Agents AI Benchmarks AI Hype Browser Development

Products & Announcements

SERA: Open, Low‑Cost, Repo‑Adaptive Coding Agents

Jan 27, 2026253

SERA makes strong, repo-adaptive coding agents cheap, open, and easy by replacing complex RL with soft-verified, workflow-faithful SFT.

AI Coding Agents Open Source Model Fine-Tuning AI Benchmarks

Products & Announcements

Qwen3-Max-Thinking: Autonomous Tools and Test-Time Scaling Drive SOTA Reasoning

Jan 26, 2026502

Qwen3-Max-Thinking combines autonomous tool use with efficient test-time scaling to deliver state-of-the-art, readily accessible reasoning performance.

LLM Reasoning AI Benchmarks AI Agents

Products & Announcements

OpenAI Launches GPT-5.2-Codex for Advanced Agentic Coding and Cyber Defense

Dec 18, 2025589

OpenAI’s GPT-5.2-Codex pushes agentic coding and defensive cyber forward while rolling out with stricter safeguards and gated access.

AI Coding Agents Cybersecurity AI Safety AI Benchmarks Vulnerability Research

Products & Announcements

Gemini 3 Flash Launches: Frontier Reasoning, Flash Speed, Lower Cost

Dec 17, 20251102

Gemini 3 Flash brings frontier‑grade reasoning to everyone at Flash speed and lower cost, and it’s rolling out across Google’s ecosystem.

AI Benchmarks LLM Reasoning Technology Economics Multimodal AI Corporate AI Strategy

Programming

Letta Code: Stateful Coding Agents That Learn and Lead on Terminal-Bench

Dec 17, 2025

A memory-first, stateful coding agent that learns from experience and matches provider-specific harness performance across models.

AI Coding Agents LLM Context Management AI Benchmarks Open Source

Products & Announcements

OpenAI Launches GPT‑5.2: SOTA Model for Professional Work and Agentic Workflows

Dec 11, 20251195

GPT‑5.2 is OpenAI’s new state‑of‑the‑art workhorse for pros and agents, delivering big gains in reasoning, coding, tool use, long context, and vision, available now in ChatGPT and the API.

AI Benchmarks AI Agents OpenAI LLM Reasoning

Under the Hood

Is 2026 Next Year? A Confused Answer That Ultimately Says Yes

Dec 2, 2025169

Despite a confusing opener, the answer is that 2026 is next year relative to 2025.

LLM Reasoning AI Benchmarks AI-Generated Content AI Hype

Products & Announcements

Claude Opus 4.5 Launches: Safer SOTA Coding and Agents, Now Cheaper and More Efficient

Nov 24, 20251113

Claude Opus 4.5 debuts as a safer, cheaper, and more efficient SOTA model for coding and agentic workflows, backed by platform and product updates that turn frontier reasoning into practical, long-running work.

AI Coding Agents AI Agents AI Safety AI Benchmarks

Products & Announcements

GPT‑5.1‑Codex‑Max: Long‑Horizon Agentic Coding with Compaction and Fewer Tokens

Nov 19, 2025483

GPT-5.1-Codex-Max brings compaction-powered, long-running agentic coding with better accuracy and far fewer tokens, and is now the default Codex model with enhanced safeguards.

AI Coding Agents LLM Context Management AI Benchmarks OpenAI

Products & Announcements

Gemini 3: Google’s most intelligent, widely deployed AI arrives

Nov 18, 20251735

Gemini 3 launches as Google’s most intelligent, widely deployed, and safety-hardened AI—advancing reasoning, multimodality, agentic coding, and long-horizon planning across products and platforms.

AI Benchmarks AI Coding Agents Multimodal AI AI Safety Corporate AI Strategy

Under the Hood

From Labels to Prompts: LLMs Match Supervised Warranty Classification

Nov 14, 2025320

Prompted LLMs, tuned through reasoning-led iteration, matched a supervised warranty classifier and shifted the bottleneck from labeled data to instructions.

Prompt Engineering AI Benchmarks Corporate AI Strategy Text Classification

Under the Hood

600+ AI Image Tests: OpenAI = Creative, Gemini = Realistic, Seedream = Fast

Nov 11, 2025204

No one-size-fits-all: OpenAI for creativity, Gemini for realism, Seedream for fast, cost-effective middle-ground performance.

AI Image Generation AI Benchmarks AI Creativity

Products & Announcements

Composer: A Fast, RL-Trained Coding Agent for Real-World Software Development

Oct 29, 2025215

A fast, RL-trained MoE coding agent that brings frontier-level usefulness to real-world development with tools, long context, and production-grade infrastructure.

AI Coding Agents Reinforcement Learning AI Benchmarks AI Infrastructure Developer Tooling

Under the Hood

Single‑Pass Image Editing Showdown: Style Wins, Precision Still Hard

Oct 28, 2025342

Image editors are improving, but precise, localized, constraint-respecting edits remain the Achilles’ heel—even the best models stumble on spatial swaps and selective removals.

AI Benchmarks AI Image Generation AI Image Editing Diffusion Models

Under the Hood

LLMs Aren’t Ideologically Neutral: A Black‑Box A/B Test Across Top Models

Oct 23, 2025

LLMs display distinct ideological leanings, so which model you choose can shape the guidance you get on political and social questions.

AI Bias AI Ethics AI Benchmarks Content Moderation

Programming

Reddit Sentiment: Codex Beats Claude Code, but Claude Wins on Speed and UX

Oct 18, 2025141

Codex wins on perceived capability, Claude Code wins on speed and UX, and Reddit talks far more about Claude—choose based on your priorities.

AI Coding Agents Developer Tooling AI Benchmarks Sentiment Analysis

Products & Announcements

Claude Haiku 4.5: Near-Frontier Coding at 1/3 Cost and 2x+ Speed

Oct 15, 2025730

Anthropic’s Claude Haiku 4.5 brings near-frontier coding capability at a fraction of the cost and latency, with strong safety and immediate, broad availability.

AI Coding Agents AI Benchmarks Technology Economics AI Safety Task Orchestration

Products & Announcements

Gemini 2.5 Computer Use: High‑performance, safe UI control via API

Oct 7, 2025636

Google’s Gemini 2.5 Computer Use brings high-accuracy, low-latency, safety-aware UI control to developers via the Gemini API.

AI Agents Computer Vision Browser Automation AI Safety AI Benchmarks

Products & Announcements

Claude Sonnet 4.5 Launches: SOTA Coding & Agent Model With SDK and Major Product Upgrades

Sep 29, 20251585

Anthropic unveils Claude Sonnet 4.5—its state-of-the-art, most aligned coding and agent model—alongside major product upgrades and a new Agent SDK, available now at the same price.

AI Coding Agents AI Agents Developer Tooling AI Safety AI Benchmarks

Agentic Systems

Coding Agents Don’t Lack IQ—They Lack Context

Sep 26, 2025196

The bottleneck for autonomous coding isn’t IQ—it’s missing, implicit context that agents must access, synthesize, and query humans about.

AI Coding Agents LLM Context Management Human-AI Collaboration AI Benchmarks

Products & Announcements

Gemini 2.5 Flash and Flash-Lite Previews: Faster, Smarter, Cheaper, plus -latest Aliases

Sep 25, 2025540

Gemini 2.5 Flash and Flash-Lite previews are faster, smarter, and cheaper, with new -latest aliases for easy access and stable models recommended for production.

Google Technology Economics Multimodal AI AI Benchmarks AI Agents

Damage Control

Better Algorithms, Busier Radiologists

Sep 25, 2025445

Better models are making radiologists busier, not redundant, because real-world performance, rules, and elastic demand favor human‑in‑the‑loop care.

AI in Healthcare Human-AI Collaboration Labor Economics AI Regulation AI Benchmarks

Products & Announcements

DeepMind and OpenAI Both Claim ICPC WF 2025 Gold-Level AI Performance

Sep 17, 2025251

DeepMind and OpenAI announced almost simultaneously that their AI models achieved ICPC 2025 World Finals gold-level performance.

AI Benchmarks LLM Reasoning AI Hype Competitive Programming

Under the Hood

Evolving English Instructions Sets New ARC SoTA and Points to RL for AGI

Sep 17, 2025178

Evolving plain-English instructions with multi-agent test-time search beats code on ARC and highlights that RL-driven, transferable reasoning is key to AGI.

AI Benchmarks LLM Reasoning Reinforcement Learning Test-Time Compute

Agentic Systems

Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom

Sep 17, 2025197

A structured prompt rewrite turned vague policies into checklists, boosting GPT-5-mini’s telecom benchmark accuracy by 22% and unlocking previously unsolvable tasks.

Prompt Engineering AI Benchmarks Small Language Models AI Agents

Under the Hood

The Dimensional Ceiling of Single-Vector Embedding Retrieval

Aug 30, 2025151

Embedding-based retrieval hits a hard top-k capacity ceiling set by embedding dimension, and real systems already run into it.

Vector Embeddings Information Retrieval AI Benchmarks Search Quality