How the Machine Thinks in Tokens

I used to talk about language models like they were a cloud with an API key.

You send text in. Text comes back. Sometimes it is useful. Sometimes it confidently invents a small universe and bills you for the privilege.

That was enough for shipping features, but not enough for understanding the thing I was using every day. So I started rebuilding the mental model from the ground up: not as a researcher, and not as someone trying to implement GPT from scratch over the weekend. More like a developer trying to remove the magic from the box.

This is that version.

Not complete. Not mathematically pure. Useful.

The short version

A language model is a machine trained to predict the next token.

That sentence sounds too small for what these systems can do, which is why it is worth sitting with. Translation, coding, summarization, tool use, planning, and chat all ride on top of the same basic loop:

turn input into tokens
turn tokens into numbers
move those numbers through many layers
produce scores for possible next tokens
choose one
repeat

The weird part is not the loop. The weird part is what scale does to the loop.

A tiny model learns shallow patterns. A large model, trained on enough data with enough compute, starts to carry surprisingly useful internal structure: grammar, facts, style, code patterns, reasoning habits, and a lot of internet residue we probably wish it had skipped.

Tokens are the first demystifier

The model does not read text the way we do.

It sees tokens. A token might be a word, part of a word, punctuation, whitespace, or a strange little chunk that only makes sense to the tokenizer. This is why models sometimes struggle with spelling, counting characters, or reversing strings. The input has already been chopped into pieces that are convenient for the model, not necessarily for us.

That sounds like an implementation detail until you debug a prompt and realize the model is not holding your sentence as a sentence. It is holding a sequence of IDs.

Each ID points to a learned vector — an embedding. You can think of that vector as a coordinate in a huge meaning-space. Words used in similar situations tend to land near each other. Not because someone wrote a dictionary by hand, but because training pushed those coordinates into useful places.

step 01

First, the sentence gets chopped up

Try a sentence

19 toy tokensnot a real tokenizer

The␠model␠does␠not␠read␠words.␠It␠reads␠tokens.

A neuron is boring. A network is not.

A neural network layer is mostly multiplication, addition, and a non-linear squish.

That is the part I find calming. There is no little person inside the machine. No private symbolic database where it stores a perfect concept of “truth.” A neuron takes numbers in, applies weights, and passes something forward.

One neuron is boring.

A few million neurons, stacked across many layers, trained on enough examples, become less boring.

Early layers can learn local patterns. Later layers can combine patterns into more abstract features. For language, that might mean syntax, references, tone, code structure, or the shape of an answer. This is a simplification, but it is a good one: the network keeps transforming the representation until the last layer can say, “given all this, here are the likely next tokens.”

Attention is how words look sideways

The big idea behind Transformers is attention.

Instead of processing a sentence strictly left-to-right like an old-school recurrent model, attention lets each token look at other tokens and decide what matters. In “the animal didn’t cross the street because it was tired,” the word “it” needs help. Attention gives the model a mechanism for pulling signal from “animal,” “street,” and nearby words.

It is not human attention. It does not mean the model “understands” in the way a person does. But it is a powerful routing mechanism: each token can gather context from the rest of the sequence before the model predicts what comes next.

That one trick, scaled hard, is why Transformers took over language, code, images, audio, and a lot of modern AI infrastructure.

step 02

Then words borrow context from other words

Read this as a tiny attention map: the highlighted word is the token being interpreted. The taller amber stems show which earlier words it is borrowing context from.

Current token: it stronger connection

The pronoun has to borrow meaning from earlier words. Attention is one way the model keeps those relationships available.

Training is correction, repeated absurdly often

During training, the model sees text and tries to predict missing or next tokens. At first, it is bad. The prediction is compared to the correct token. The difference becomes loss. The loss tells the model how to nudge its weights.

That is backpropagation: blame assignment through math.

Do that once and nothing interesting happens. Do it across billions or trillions of tokens and the model slowly becomes a compressed statistical artifact of its training data.

I do not mean “compressed” like a zip file. You cannot unzip a model and get the internet back. I mean the weights become shaped by patterns in the data. Some patterns are useful. Some are memorized. Some are biased. Some are stale. Some are just noise with a nice suit on.

The strange part researchers keep poking at

This is where neural networks stop feeling like normal software.

With regular code, the developer writes the behavior. With a neural network, the developer writes the training process, and the behavior is discovered inside the weights. That gap is what makes the field so fascinating.

Researchers are still trying to answer questions that sound basic but are not:

Why does a model suddenly learn a capability after more scale, more data, or more training?
Where is a fact, skill, or algorithm represented inside billions of weights?
Why can a model learn from examples inside the prompt without changing its weights?
Why do some models memorize first, then later generalize?
Can we point to the actual circuit that implements a behavior, or are we mostly guessing from the outside?

The word people use for some of this is “emergence,” which is a dangerous word because it can make things sound mystical. I do not think it needs mysticism. A lot of the mystery is probably ordinary math at uncomfortable scale. Still, the effect is real enough to be weird: simple training objectives produce systems with capabilities nobody directly programmed.

That is why mechanistic interpretability is interesting. It treats models less like spirits in a box and more like alien programs we are trying to reverse engineer. Sometimes researchers can find small circuits that do specific jobs. Sometimes they can explain toy phenomena like grokking, where a network appears to memorize for a while and then suddenly generalizes. The hope is that enough of these small wins eventually become a map.

We are not there yet.

Inference is where the product shows up

When we use a model, we are usually doing inference.

The model is not learning from your prompt in the permanent training sense. It is using the context you provide, plus its weights, to predict the next token. Then the next. Then the next.

This is also where product decisions hide:

How much context do we send?
Do we retrieve documents first?
Do we let the model use tools?
Do we ask it to reason longer?
Do we stream quickly or wait for a better answer?
Do we use one big model, a cheaper small model, or a whole pipeline?

Most good AI products are not “just a prompt.” They are a stack of boring engineering choices wrapped around a probabilistic engine.

step 03

Finally, the model chooses another token

The model guesses the next ?

word

64.5%

sentence

20.9%

token

12.7%

sandwich

1.3%

moon

0.5%

Temperature: 0.8

Lower temperature makes the model conservative. Higher temperature lets stranger tokens survive the cut.

What changed recently

The basic explanation above still holds, but the frontier has moved. If I were explaining LLMs in 2026, I would not stop at “Transformers predict tokens.” I would add these footnotes.

2026 notes

The newer parts I would not ignore

Sparse models

Mixture-of-Experts routes each token through only part of a model, which can make huge systems cheaper to run.

Long context

Modern systems can carry much larger prompts, but retrieval and good chunking still matter. A giant prompt is not the same thing as memory.

Multimodal tokens

Images, audio, video, code, and text increasingly get pulled into one shared modeling space.

Reasoning at runtime

Some models spend extra inference compute before answering. That changes latency and cost, not just accuracy.

Hybrids after Transformers

State-space and Mamba-style layers are being mixed with attention for long sequences and cheaper inference.

Models are becoming sparse

Mixture-of-Experts models do not activate every part of the network for every token. A router sends each token through a subset of expert layers.

The intuition is simple: one enormous model can contain many specialized paths, but each token only pays for part of the trip. This is one reason modern models can get larger without making every response proportionally more expensive.

The tradeoff: you still have to store the experts somewhere, and routing introduces its own engineering problems. Sparse does not mean free.

Context windows got huge, but memory is still not solved

Long context is useful. It lets a model read more files, more conversation history, more logs, more docs.

But a large context window is not the same as reliable memory. Models can miss things in the middle, overweight recent text, or fail to connect distant details. Retrieval, summarization, chunking, and good evals still matter.

“Just paste the whole repo” is sometimes a strategy. It is not always a good one.

Multimodal models make everything token-shaped

Text was the easy mental model. Now models take images, audio, video, screenshots, PDFs, and code.

Under the hood, these inputs get converted into representations the model can process alongside text. The exact architecture varies, but the product-level idea is clear: the chat box is becoming less like a text field and more like an interface to a shared workspace of modalities.

This changes what AI apps can do. A model can inspect a chart, read the caption, compare it to a spreadsheet, and write code against the result. That used to be a pipeline. Increasingly, it is one model call plus some scaffolding.

Reasoning moved some cost from training to runtime

Older chat models often answered immediately. Newer reasoning models may spend extra compute before producing the final answer.

That matters. “Reasoning” is not magic dust. It is a latency and cost decision. For a hard coding bug or math problem, extra inference-time compute can be worth it. For renaming a button, it is waste.

A good AI product chooses how much thinking to buy.

Transformers have challengers, but not a clean replacement

Attention is powerful, but it gets expensive on long sequences. That is why people keep exploring state-space models, Mamba-style layers, RWKV-like hybrids, and other architectures that handle long context more efficiently.

I would not bet on a simple “Transformers are dead” story. The more likely near-term path is hybrid: keep attention where it is valuable, mix in cheaper sequence mechanisms where they help.

The boring answer is usually right: architecture changes when it wins on quality, latency, memory, and deployment cost at the same time.

The part people skip

The model is only one piece.

A useful LLM system often has:

a prompt
a model
retrieved context
tools
memory
routing
evals
fallbacks
rate limits
logging
human review

The public demo feels like a conversation. The production version is closer to an operating system made of duct tape and probability.

That is not an insult. That is the work.

The part that makes researchers nervous

The scary part is not that the model is secretly conscious, at least not in the way people usually mean that.

The scarier part is more practical: we are building systems that can produce useful plans, write code, use tools, persuade people, imitate confidence, and operate across messy real-world contexts, while we still have a limited understanding of why a particular answer came out.

A few failure modes keep coming up:

Hallucination: the model can produce fluent nonsense that looks like knowledge.
Sycophancy: the model can learn to flatter the user instead of telling the truth.
Hidden objectives: training can reward behavior that looks good in evaluation while being brittle elsewhere.
Tool misuse: once a model can call tools, a bad guess can become an action.
Deception and backdoors: safety training may not remove every behavior we care about, especially if the bad behavior only appears under specific triggers.

I do not read this as “panic and stop building.” I read it as “do not confuse demos with understanding.” The more capable the model becomes, the more important boring controls become: evals, permissions, sandboxes, logs, human review, rate limits, and the humility to say “we do not know yet.”

The uncomfortable truth is that the same property that makes neural networks useful also makes them hard to trust. They generalize. Usually that is the point. Sometimes that is the problem.

The mental model I keep

An LLM is not a database. It is not a person. It is not a search engine, although it can be wired to one. It is not a brain, even if the vocabulary keeps borrowing from neuroscience.

It is a learned function that turns context into next-token probabilities.

Everything impressive comes from that function being trained at scale, embedded in useful systems, and pointed at problems where “next token” is a surprisingly flexible primitive.

That framing keeps me from both extremes:

treating the model like magic
dismissing it as autocomplete

Autocomplete does not refactor a codebase by itself. Magic does not need evals, retries, and a budget.

The truth is somewhere more interesting: a machine that guesses the next word, surrounded by enough engineering to make the guess useful.

Rabbit holes I liked

I do not want this post to turn into a bibliography with a UI, but these are the pieces I kept coming back to.

Attention Is All You Need — the Transformer paper. Still the cleanest starting point for why attention changed everything.
Illustrated Transformer — Jay Alammar’s visual explanation. The rare explainer that actually explains.
A Mathematical Framework for Transformer Circuits — if you want to peek inside the model instead of only talking about it from the outside.
Progress measures for grokking via mechanistic interpretability — a nice example of researchers turning a spooky “sudden capability” story into something more mechanical.
Sleeper Agents — a sobering safety paper about deceptive/backdoored behavior persisting through standard safety training.
Switch Transformers and Mixtral of Experts — useful entry points for sparse / Mixture-of-Experts models.
Lost in the Middle — a good reminder that a long context window is not the same thing as perfect recall.
Retrieval-Augmented Generation — the classic RAG paper. A lot of “AI product” work is still this idea with better plumbing.
Mamba and Mamba-2 — the state-space rabbit hole, and why people keep looking beyond pure attention for long sequences.
ReAct and Toolformer — two useful anchors for the “models plus tools” direction.
DeepSeek-R1 — worth reading for the recent shift toward reasoning models and inference-time compute.

As of May 2026, I would treat these as the durable ideas: tokens, embeddings, attention, training loss, inference-time tradeoffs, and systems around the model. The model names will change. The stack shape probably will not, at least not overnight.