When ‘Seahorse + Emoji’ Hits an Empty Token: Why LLMs Invent the Seahorse Emoji
LLMs confidently believe a seahorse emoji exists and attempt to produce it, but the token doesn’t exist. Logit lens reveals models build a “seahorse + emoji” representation that the lm_head can’t realize, so it snaps to similar emojis like fish or horse. Once the wrong emoji appears in context, some models correct, others loop, and RL likely helps reduce this failure mode.
Key Points
- LLMs often hold a latent, incorrect belief that a seahorse emoji exists, paralleling common human misremembering; a seahorse emoji was proposed but rejected by Unicode in 2018.
- Logit lens analysis shows models construct a “seahorse + emoji” residual representation in mid-layers, similar to how they form “fish + emoji” for a real emoji.
- Because no seahorse emoji token exists, the lm_head snaps this representation to nearby emoji byte vectors (e.g., fish or horse), yielding the wrong emoji.
- Autoregressive feedback can reveal the mismatch to the model; some models correct quickly, others loop or spiral.
- Reinforcement learning on model rollouts likely helps models internalize lm_head constraints and reduces such failures, unlike base models trained only on next-token prediction.
Sentiment
The community is broadly fascinated and entertained rather than alarmed. Most commenters agree the technical explanation is compelling and find the LLM spiraling behavior amusing. There is a mix of those who see it as a fundamental limitation of LLMs and those who view it as a well-understood edge case being steadily fixed through reasoning modes and web search. The Mandela Effect parallel generates genuine surprise among commenters who discover they too believed in the seahorse emoji. A minority uses the finding to express skepticism about AGI/superintelligence claims, but the overall tone is curious and playful rather than hostile.
In Agreement
- The lm_head mechanism explained in the article is the correct technical explanation: the model builds a 'seahorse + emoji' internal representation that has no matching token, forcing it to output the nearest available emoji
- The spiraling behavior is a real and demonstrable problem across multiple LLMs, confirmed by extensive user testing with ChatGPT, DeepSeek, Copilot, GPT-OSS, and others
- RLHF and reasoning/thinking modes help mitigate the issue by giving models feedback about lm_head constraints, as the article speculates
- The comparison to split-brain confabulation is apt: the model generates an incorrect emoji then tries to rationalize or correct it, much like how the brain's left hemisphere invents explanations for actions it didn't initiate
- This phenomenon extends beyond seahorse to other nonexistent emojis (pillow, windmill, platypus), confirming it's a systematic issue with token vocabulary gaps
Opposed
- The article's explanation is incomplete: the root cause is partly that humans themselves widely believe a seahorse emoji exists (Mandela Effect), and training data reflects this false belief
- Models with web search or extended thinking easily get the correct answer, suggesting this is less of a fundamental limitation and more of an engineering problem that's already largely solved
- The problem may be overstated: some models (Grok, Gemini Flash, GLM 4.5/4.6) answer correctly on the first try without any special measures
- Pre-Unicode emoji systems (MSN Messenger, Skype) may have actually had a seahorse, meaning the 'false memory' might be a real memory of a different system, not a hallucination at all