TD Stuff

The AI Programming Plateau: Why Merge Rates Have Stagnated Since 2025

Mar 12, 2026

Statistical evidence suggests that LLM programming capabilities have not actually improved for over a year when measured by code mergeability.

AI Benchmarks AI Hype AI Coding Agents LLM Reasoning

Products & Announcements

OpenAI Debuts GPT-5.4: The Frontier Model for Professional Agents

Mar 5, 20261019

OpenAI's GPT-5.4 is a professional-grade model that introduces native computer interaction and high-efficiency tool use for autonomous agents.

OpenAI AI Agents Foundation Models LLM Reasoning LLM Context Management

Products & Announcements

GPT-5.4 Thinking Sets New Safety Bar as First General-Purpose Model with Cybersecurity Mitigations

Mar 5, 20261019

GPT-5.4 Thinking is OpenAI's first general-purpose model with high-capability cybersecurity safety mitigations.

OpenAI AI Safety Cybersecurity LLM Reasoning

Agentic Systems

Claude Opus 4.6 Solves Knuth's Hamiltonian Cycle Problem for Odd m

Mar 3, 2026837

Don Knuth details how Claude Opus 4.6 successfully solved a difficult graph theory conjecture for odd m through iterative algorithmic discovery and creative deduction.

LLM Reasoning AI for Science Anthropic Algorithms & Optimization Human-AI Collaboration

Damage Control

The Car Wash Test: Why AI Still Lacks Common Sense

Feb 16, 20261516

AI models fail a simple common-sense test by recommending walking to a car wash, proving they prioritize word patterns over physical logic.

AI Benchmarks LLM Reasoning Prompt Engineering

Under the Hood

GPT-5.2 Discovers New Physics in Gluon Interactions

Feb 13, 2026574

GPT-5.2 has derived and proven a new formula for gluon scattering amplitudes, overturning a long-held assumption in theoretical physics.

Human-AI Collaboration LLM Reasoning AI for Science Particle Physics

Products & Announcements

Google Upgrades Gemini 3 Deep Think for Real-World Science and Engineering

Feb 12, 20261081

Gemini 3 Deep Think gets a rigor-boosted upgrade that pairs state-of-the-art reasoning with practical tools for scientists and engineers, now available to subscribers and via early API access.

AI Benchmarks LLM Reasoning AI for Science

Under the Hood

GPT-5 Outjudges Judges in Choice-of-Law Test: Error-Free, Rule-Focused Decisions

Feb 12, 2026310

In a controlled choice-of-law test, GPT-5 delivers error-free, legally correct decisions and outperforms human judges.

AI Ethics LLM Reasoning AI Benchmarks AI & Law

Creative Code

Live City-Building Feed: 32 Mayors, 427 Cities, 7.94M Population

Feb 11, 2026216

A live leaderboard of a city-building simulation tracks recent cities, mayors, populations, years, and scores across an active community.

AI Agents Game Development LLM Reasoning AI Benchmarks

Under the Hood

From Word Models to World Models: Training AI for Adversarial Robustness

Feb 9, 2026238

Shift LLMs from next-token to next-state prediction by training in multi-agent, hidden-state environments so their outputs survive adversarial adaptation.

LLM Reasoning AI Agents AI Safety Game Theory

Products & Announcements

Claude Opus 4.6: Finance-Grade Reasoning Meets Native Excel and PowerPoint

Feb 5, 2026154

Claude Opus 4.6 and new app integrations bring state-of-the-art finance reasoning and faster, higher-quality deliverables directly into analysts’ workflows.

AI in Finance AI & Productivity AI Benchmarks LLM Reasoning

Under the Hood

AI Failures Drift Toward Incoherence as Tasks and Reasoning Grow

Feb 3, 2026242

Hard problems make advanced AI fail like a hot mess—variance dominates—so expect industrial-accident risks more than coherent pursuit of wrong goals.

AI Safety LLM Reasoning AI Benchmarks AI Agents

Agentic Systems

LLM-as-a-Courtroom: Evidence-Backed Doc Updates from Code Changes

Jan 27, 2026

Turn doc-update decisions into a legal-style, evidence-backed courtroom so LLMs reason better and teams trust the results.

AI Agents Developer Tooling LLM Reasoning Task Orchestration AI Architecture

Products & Announcements

Qwen3-Max-Thinking: Autonomous Tools and Test-Time Scaling Drive SOTA Reasoning

Jan 26, 2026502

Qwen3-Max-Thinking combines autonomous tool use with efficient test-time scaling to deliver state-of-the-art, readily accessible reasoning performance.

LLM Reasoning AI Benchmarks AI Agents

Products & Announcements

Gemini 3 Flash Launches: Frontier Reasoning, Flash Speed, Lower Cost

Dec 17, 20251102

Gemini 3 Flash brings frontier‑grade reasoning to everyone at Flash speed and lower cost, and it’s rolling out across Google’s ecosystem.

AI Benchmarks LLM Reasoning Technology Economics Multimodal AI Corporate AI Strategy

Products & Announcements

OpenAI Launches GPT‑5.2: SOTA Model for Professional Work and Agentic Workflows

Dec 11, 20251195

GPT‑5.2 is OpenAI’s new state‑of‑the‑art workhorse for pros and agents, delivering big gains in reasoning, coding, tool use, long context, and vision, available now in ChatGPT and the API.

AI Benchmarks AI Agents OpenAI LLM Reasoning

Under the Hood

Is 2026 Next Year? A Confused Answer That Ultimately Says Yes

Dec 2, 2025169

Despite a confusing opener, the answer is that 2026 is next year relative to 2025.

LLM Reasoning AI Benchmarks AI-Generated Content AI Hype

Products & Announcements

DeepSeek‑V3.2: Sparse Attention and Scaled RL Power an Open, Agentic Reasoner

Dec 1, 2025982

Efficient sparse attention plus large, stabilized RL and synthetic agent tasks push an open LLM to near‑frontier reasoning and agent performance, with a high‑compute variant achieving gold‑medal results.

AI Architecture LLM Reasoning AI Agents Open Source Reinforcement Learning

Products & Announcements

OpenAI Launches GPT-5.1: Smarter Chats, Easier Personalization

Nov 12, 2025555

OpenAI’s GPT-5.1 delivers smarter, warmer conversations and simpler, stronger tone customization, rolling out now and becoming the new default.

OpenAI LLM Reasoning Human-AI Collaboration AI Personalization

Under the Hood

AI as Compression: Why LLMs May Truly Be Thinking

Nov 3, 2025278

LLMs likely perform a genuine, brainlike form of thinking via recognition and compression, but turning that into human‑level intelligence demands solving hard scientific problems and grappling with serious risks.

LLM Reasoning Cognitive Science AI Consciousness AI Interpretability

Under the Hood

From Sampling to Grammars: Making LLMs Reliably Output Structured Data (Even for Thinking Models)

Sep 23, 2025234

Use efficient sampling plus grammar constraints to guarantee format today, but expect models to natively emit structured outputs tomorrow—especially when you let them think first, then constrain.

Structured Output LLM Inference LLM Reasoning

Products & Announcements

DeepMind and OpenAI Both Claim ICPC WF 2025 Gold-Level AI Performance

Sep 17, 2025251

DeepMind and OpenAI announced almost simultaneously that their AI models achieved ICPC 2025 World Finals gold-level performance.

AI Benchmarks LLM Reasoning AI Hype Competitive Programming

Under the Hood

Evolving English Instructions Sets New ARC SoTA and Points to RL for AGI

Sep 17, 2025178

Evolving plain-English instructions with multi-agent test-time search beats code on ARC and highlights that RL-driven, transferable reasoning is key to AGI.

AI Benchmarks LLM Reasoning Reinforcement Learning Test-Time Compute

Agentic Systems

GPT-5 Thinking Makes ChatGPT a Surprisingly Competent Research Assistant

Sep 8, 2025361

GPT-5 Thinking turns ChatGPT into a competent, mobile-friendly research agent that interleaves reasoning with web search and tools to deliver verifiable, deep results—provided you guide and sanity-check it.

OpenAI LLM Reasoning Retrieval-Augmented Generation Human-AI Collaboration Search Quality

Under the Hood

Why AI Is Chasing World Models Again

Sep 2, 2025211

AI is chasing coherent internal world models to move beyond brittle heuristics and achieve robust, reliable reasoning.

World Models AI Architecture LLM Reasoning Cognitive Science