LamBench Results: GPT-5.4 Dominates Lambda Calculus Benchmark

LamBench evaluates the intelligence of 21 AI models by testing them against 120 lambda calculus tasks. GPT-5.4 leads the rankings with a 91.7% accuracy rate, followed by Opus-4.6 and GPT-5.3-codex. The results demonstrate a wide disparity in logical reasoning capabilities across current AI architectures.

Key Points

LamBench measures AI intelligence by testing the ability to solve 120 lambda calculus problems.
GPT-5.4 is the current leader in the benchmark, achieving a 91.7% success rate.
The benchmark includes 21 models from major developers including OpenAI, Anthropic, Google, and DeepSeek.
There is a massive performance gap between the highest-performing model (91.7%) and the lowest-performing model (11.7%).

Sentiment

The community is broadly supportive of LamBench as a valuable and well-designed benchmark, appreciating that lambda calculus tests genuine reasoning on novel tasks. However, there is notable pushback on the single-shot methodology and skepticism about how meaningful the rankings are without multiple trials. The discussion leans toward acknowledging frontier model superiority while debating whether that gap matters practically for most use cases.

In Agreement

Novel, uncontaminated benchmarks like LamBench are the best way to differentiate models, and they consistently show top commercial models well ahead of smaller and open alternatives
Claims that small or open models are 'opus killers' are overhyped — even DeepSeek acknowledged a gap after releasing their 1.6T parameter model
Lambda calculus is a brilliant subject for benchmarking because it tests genuine algorithmic reasoning rather than pattern-matching from training data
The FFT failure across all models reveals that pure lambda calculus programming requires fundamentally different skills than standard coding tasks, validating the benchmark's difficulty

Opposed

Single-shot testing of non-deterministic probabilistic models is methodologically flawed — multiple trials per problem are needed to properly characterize model capabilities
For many practical tasks, cheaper models that are 'almost as good' at common coding work are sufficient, especially at a fraction of the cost
Benchmark results without information about model configuration (quantization, serving infrastructure) are hard to interpret and potentially misleading
The real bottleneck for developer productivity is tooling and review processes, not marginal improvements in model intelligence
Local open-weight models offer advantages through unlimited runs and full control over the inference harness, which can offset raw benchmark performance gaps