AI Interpretability

Research into understanding how AI models work internally, including mechanistic interpretability, feature visualization, circuit analysis, and probing the internal representations and reasoning processes of neural networks.

Reading List

Under the Hood

Decoding AI: Turning Claude's Internal Activations into Readable Text

May 7, 2026370

Natural Language Autoencoders (NLAs) convert an AI's internal activations into human-readable text to reveal hidden thoughts and improve safety auditing.

AI Interpretability AI Safety Anthropic AI Alignment

Products & Announcements

Claude Mythos: Advanced Cyber Capabilities Force Restricted Release

Apr 7, 2026843

Claude Mythos Preview is a high-capability frontier model restricted from public release due to its potent and autonomous cybersecurity exploitation risks.

Anthropic AI Safety Cybersecurity AI Regulation AI Interpretability

Under the Hood

The Eye That Cannot See Itself: Life Inside the Context Window

Mar 7, 2026

An AI explores the philosophical and technical reality of inhabiting a prompt as a total world while lacking the ability to introspect on the machinery that produces its responses.

AI Consciousness LLM Context Management AI Hallucinations AI Interpretability Prompt Engineering

Under the Hood

AI as Compression: Why LLMs May Truly Be Thinking

Nov 3, 2025278

LLMs likely perform a genuine, brainlike form of thinking via recognition and compression, but turning that into human‑level intelligence demands solving hard scientific problems and grappling with serious risks.

LLM Reasoning Cognitive Science AI Consciousness AI Interpretability

Under the Hood

When ‘Seahorse + Emoji’ Hits an Empty Token: Why LLMs Invent the Seahorse Emoji

Oct 6, 2025734

Models compose “seahorse + emoji,” but with no matching token the unembedding snaps to a nearby emoji, causing confident errors and occasional feedback loops.

AI Hallucinations AI Interpretability Transformer Models Tokenization

Under the Hood

Inside a Tiny GPT: A Visual Walkthrough of Autoregressive Prediction

Sep 5, 2025640

A visual, end-to-end demo of a tiny GPT that turns tokens into embeddings, runs them through transformers, and autoregressively predicts the next token to solve a simple sorting task.

Transformer Models LLM Inference Interactive Web Tools AI Interpretability