The Car Wash Test: Why AI Still Lacks Common Sense

A viral Mastodon thread reveals that many popular AI models fail a basic logic puzzle regarding whether to drive or walk to a car wash. While the AI focuses on the short 50-meter distance as a reason to walk, it fails to realize the car itself must be at the facility to be washed. This experiment serves as a critique of the current state of AI reasoning and its inability to grasp common-sense logistics.
Key Points
- Most mainstream LLMs fail a simple common-sense logic test by prioritizing distance over the functional requirements of a task.
- The 'attention' model used by AI can be tripped up by small, critical words that change the entire context of a sentence.
- Advanced or 'thinking' versions of models, such as Gemini Pro, demonstrated better reasoning by identifying that a car is 'heavy equipment' that must be present.
- The experiment highlights the gap between statistical word prediction and a true understanding of physical reality and logistics.
- Community feedback suggests a growing skepticism toward AI marketing claims regarding reasoning and 'super-intelligence'.
Sentiment
The Hacker News community is predominantly skeptical of LLM reasoning capabilities, using the car wash test to reinforce concerns about over-hyped AI. The prevailing view is that this test exposes a real and important limitation, not just a trivial trick. However, the discussion is thoughtful rather than hostile, with many commenters acknowledging LLMs are useful tools while pushing back against claims of genuine understanding or reasoning. A notable minority defends LLMs by drawing parallels to human cognitive failures.
In Agreement
- The car wash test is a clear demonstration of the classic AI frame problem — LLMs cannot infer implicit common-sense knowledge that humans take for granted
- Needing to over-specify prompts for LLMs to get basic things right defeats the purpose of natural language AI interfaces, and the irony is that this circles back to needing structured/formal languages like programming
- LLMs are trained to be 'helpful' by answering immediately rather than asking clarifying questions, which makes them worse at ambiguous tasks — OpenAI's system prompt literally forbids asking clarifying questions
- This simplified test case is important because in complex real-world scenarios (especially coding), similar reasoning failures are much harder to detect and debug
- LLMs fundamentally don't understand or reason — they do sophisticated pattern matching, which is why iterative improvements to models won't solve this class of problem
Opposed
- Humans also fail trick questions and make assumptions — some commenters noted that a 'not insignificant portion of the population' would also answer incorrectly, making this a double standard
- The question is deliberately adversarial and nonsensical — nobody would actually ask a human this question, so it's unfair to judge AI by it
- Several current models (Claude Sonnet/Opus, Gemini) already answer correctly, suggesting this is being solved through better training rather than being a fundamental limitation
- LLM failure modes (confabulation, losing track of context) mirror human cognitive biases, which actually raises the credence that LLMs are developing genuine reasoning capabilities
- The real solution is providing LLMs with more context through persistent memory, wearable devices, and interconnected systems rather than expecting them to infer unstated information