
Decoding AI: Turning Claude's Internal Activations into Readable Text
Natural Language Autoencoders (NLAs) convert an AI's internal activations into human-readable text to reveal hidden thoughts and improve safety auditing.
Research into understanding how AI models work internally, including mechanistic interpretability, feature visualization, circuit analysis, and probing the internal representations and reasoning processes of neural networks.

Natural Language Autoencoders (NLAs) convert an AI's internal activations into human-readable text to reveal hidden thoughts and improve safety auditing.
Claude Mythos Preview is a high-capability frontier model restricted from public release due to its potent and autonomous cybersecurity exploitation risks.
An AI explores the philosophical and technical reality of inhabiting a prompt as a total world while lacking the ability to introspect on the machinery that produces its responses.

LLMs likely perform a genuine, brainlike form of thinking via recognition and compression, but turning that into human‑level intelligence demands solving hard scientific problems and grappling with serious risks.
Models compose “seahorse + emoji,” but with no matching token the unembedding snaps to a nearby emoji, causing confident errors and occasional feedback loops.
A visual, end-to-end demo of a tiny GPT that turns tokens into embeddings, runs them through transformers, and autoregressively predicts the next token to solve a simple sorting task.