Transformer Deep Dive

How a prompt becomes an answer inside the ESP32-S3 LLM — every stage, explained

1. The Full Pipeline — Bird's-Eye View

When you type Q: What is WiFi? into the device, here is every stage the text passes through before the model produces the first token of an answer. Each stage runs in PSRAM on the ESP32-S3.

Raw Text
Q: What is WiFi?\nA:
The firmware wraps your question into the Q:/A: format the model was trained on, then passes raw characters to the tokenizer.
characters → integers
Tokenization (BPE)
The Byte-Pair Encoding tokenizer breaks text into subword tokens. It learned these from the training data — frequent words like "WiFi" or "Type" are single tokens; rare words get split into pieces.
Q:
What
is
WiFi
?\nA:
412
87
23
156
5
Each integer is an index into the vocabulary table (up to 4096 entries). The same text always produces the same token IDs.
integer IDs → vectors
Token + Position Embedding
Each token ID is looked up in a token embedding table (4096 rows × dim columns). This converts each integer into a dense vector of dim floating-point numbers.

A separate position embedding (128 rows × dim columns) is added to encode where each token sits in the sequence. Token #0 gets position vector #0, token #1 gets position vector #1, and so on.
[0.12, -0.34, …]
[0.56, 0.11, …]
[-0.23, 0.78, …]
[0.91, -0.02, …]
[0.04, 0.65, …]
Each vector has exactly dim numbers. For dim=128, that's 128 floats per token. For dim=192, that's 192. This is the "width" of the model — how many dimensions each token has to describe itself.
enters the transformer layers
Transformer Layers (repeated N times)
The entire sequence of token vectors passes through every layer, one at a time. Each layer has two sub-stages:
Multi-Head Self-Attention — tokens look at each other
Feed-Forward Network (FFN) — each token is processed independently

With 22 layers, the token vectors are transformed 22 times. Each transformation refines the representations, building from raw word identity toward question-answer associations.
after all N layers
Final Layer Norm
Normalizes the output vectors to have stable magnitude. Without this, values could drift to extreme ranges after many layers of addition and multiplication.
vector → vocabulary probabilities
LM Head (Projection to Vocabulary)
The final token's vector (the one at the "A:" position) is multiplied by the token embedding table in reverse (transposed). This produces a score for every word in the vocabulary.

The highest-scoring token becomes the first word of the answer. The model then appends that token to the sequence and runs the whole pipeline again to generate the next token, and so on, until it produces an end-of-sequence signal or hits the token limit.
"WiFi" → 0.82
"Type" → 0.04
"BLE" → 0.02

2. Inside a Single Transformer Layer

Every layer has the same internal structure. The key concept is the residual stream — a highway of information that passes straight through every layer. Each sub-stage (attention, FFN) reads from this stream, computes something, and adds its result back. This means early layers can pass information directly to later layers without it being destroyed.

Residual Stream
token vectors from previous layer (or embedding)
Layer Norm 1 — stabilize values before attention
Multi-Head
Self-Attention

tokens look at each other
+
Layer Norm 2 — stabilize values before FFN
Feed-Forward
Network

per-token processing
+
Residual Stream
updated vectors → next layer

Why the + (addition) matters: If attention or the FFN produces garbage in early training, the + means the original information still gets through. The model can learn to "do nothing" in a layer by outputting zeros. This is why deeper models (more layers) don't automatically hurt — unused layers just pass data through. But each useful layer gets one more chance to refine the representations.

3. Multi-Head Self-Attention — In Detail

This is the most important and most complex part of the transformer. Attention is how tokens communicate with each other. Without it, each token would be processed in isolation and the model could never understand that "What" + "is" + "WiFi" together form a question about WiFi.

3a. What Attention Does

For each token position, attention asks: "Which other tokens in this sequence should I pay attention to, and what information should I pull from them?"

It does this through three learned projections — Query, Key, and Value:

Query (Q) — "What am I looking for?" — generated from the current token
Key (K) — "What do I contain?" — generated from every token
Value (V) — "What information do I carry?" — generated from every token

The attention score between two tokens = dot product of the Query of token A with the Key of token B. High score means "token A should pay attention to token B." The scores are normalized with softmax (so they sum to 1.0), then used to create a weighted average of all the Value vectors.

3b. Concrete Example: "Q: What is WiFi?\nA:"

Here's what the attention pattern might look like at different layers for the final token "A:" (which needs to predict the first answer word):

Early Layer (Layer 1)

Q:
What
is
WiFi
A:
"A:" sees
0.18
0.15
0.16
0.21
0.30
Diffuse attention — looking at everything roughly equally. Early layers haven't learned specific patterns yet. They gather raw information.

Middle Layer (Layer 10)

Q:
What
is
WiFi
A:
"A:" sees
0.05
0.15
0.02
0.72
0.06
Focused attention — heavily attending to "WiFi". The model has learned that the topic word after "What is" is the key to selecting the right answer.

Late Layer (Layer 20)

Q:
What
is
WiFi
A:
"A:" sees
0.35
0.05
0.02
0.33
0.25
Mixed attention — checking "Q:" (question format marker) + "WiFi" (topic) + itself (accumulated state). Late layers are combining the routed information into the final answer representation.

Key insight: Attention is how the model learns that "Q: What is WiFi?" should produce a different answer than "Q: What is BLE?". The middle layers focus on the distinguishing word and route different information into the residual stream depending on which topic word they find. With only 8 layers (dim=256), there aren't enough layers to build this early→middle→late pipeline.

4. What "Heads" Are and Why They Matter

4a. The Core Idea

Multi-head attention splits the dim-dimensional vector into parallel, independent attention operations called heads. Each head gets a slice of the vector — called head_dim — and runs its own Q/K/V attention on just those dimensions.

head_dim = dim ÷ n_heads

Think of each head as a specialist that can attend to a different thing at the same time. One head might focus on the topic word ("WiFi"), another on the question type ("What is"), another on the format markers ("Q:", "A:"), and another on positional patterns (what comes after what).

After all heads compute their attention outputs independently, the results are concatenated back into a single dim-sized vector and projected through one more weight matrix (the "output projection"). This merges the specialists' findings back into a unified representation.

4b. Head Dimension — The Key Tradeoff

More heads doesn't mean better. What matters is head_dim — how many dimensions each head works with. If head_dim is too small, each head can't form expressive enough Query/Key patterns to distinguish between concepts.

dim=128, 4 heads
H0: 32d
H1: 32d
H2: 32d
H3: 32d
32 dims/head
dim=192, 6 heads
H0: 32d
H1: 32d
H2: 32d
H3: 32d
H4: 32d
H5: 32d
32 dims/head
dim=256, 8 heads
H0
H1
H2
H3
H4
H5
H6
H7
32 dims/head

All three presets use 32 dims per head. This is intentional. Research has shown 32 is a good minimum for expressive attention patterns. The difference is the number of parallel specialists: 4, 6, or 8 heads. More heads means the model can attend to more patterns simultaneously within a single layer.

4c. What Each Head Learns to Do

In practice, heads specialize during training. Here's what we'd typically see in a trained Q&A model:

Head Typical Specialization What It Attends To
Head 0 Topic detection The noun after "What is" / "How do I" — the key discriminating word
Head 1 Format tracking The "Q:" and "A:" markers — knows where question ends and answer begins
Head 2 Previous token Always looks at the immediately preceding token — important for local coherence
Head 3 Question type "What is" vs "How do I" vs "Can I" — determines answer shape (definition vs command vs yes/no)
Head 4+ Sub-topic / modifier Secondary words like "not working" or "range" or "update" that modify the answer

Why 4 heads (dim=128) is enough for this task: With ~130 unique answers and simple Q: format questions, 4 parallel attention patterns per layer are sufficient. The model needs to detect: (1) the topic word, (2) the question format, (3) local token context, and (4) answer style. More heads would help if the questions were more complex (multi-hop reasoning, long context), but for short Q&A lookup, 4 is adequate.

4d. The Q/K/V Weight Matrices — How Heads Are Implemented

Under the hood, each head has three small weight matrices:

WQ : dim × head_dim — projects input into a Query vector
WK : dim × head_dim — projects input into a Key vector
WV : dim × head_dim — projects input into a Value vector

In GPT-2, these are packed into a single matrix c_attn of size dim × (3 × dim) for efficiency. The output projection c_proj is dim × dim. So the total attention parameters per layer = 4 × dim².

Modeldim4 × dim²Attn params/layer
dim=12812865,53664 KB
dim=192192147,456144 KB
dim=256256262,144256 KB

Attention cost scales quadratically with dim — this is why wider models eat PSRAM so fast. Going from dim=128 to dim=256 quadruples the attention parameters per layer.

5. The Feed-Forward Network (FFN) — The Knowledge Store

After attention lets tokens communicate, the FFN processes each token independently. Think of it as a lookup table: the model checks what concept the token vector currently represents, then adjusts it toward the correct answer pattern.

Structure: Two linear transformations with a GELU activation in between:

input (dim) → expand (FFN inner) → GELU → contract (dim) → output (dim)

The expansion factor is critical. For FFN=768 with dim=128, the inner dimension is 6× the model width. This means:

5a. What the FFN Neurons Do

Each neuron in the FFN inner layer acts as a feature detector. It activates (fires) when the input vector matches a particular pattern. With 768 neurons, the model has 768 "slots" to recognize different concepts per layer.

Input (dim)
8 of 128 shown
↓ expand ↓
Inner (FFN)
24 of 768 shown
↓ contract ↓
Output (dim)
8 of 128 shown

The GELU activation is what makes the FFN non-linear. It lets neurons partially activate (not just on/off), creating smoother feature detection. The sparse activation pattern above is typical — only ~10-30% of FFN neurons fire for any given input. This is how the model stores many patterns without them interfering with each other.

5b. FFN Size vs Number of Layers

Here's the tradeoff at dim=192 when reducing FFN to buy more layers:

FFN FFN params/layer Total params/layer Max layers Feature slots/layer Total feature slots
768 294,912 ~443K 16 768 12,288
640 245,760 ~393K 18 640 11,520
512 196,608 ~344K 20 512 10,240
384 147,456 ~295K 24 384 9,216

Total feature slots go down slightly as you narrow the FFN, but you gain depth. For a task with ~130 unique answers, even 384 slots per layer is far more than enough. The routing depth is the bottleneck, not storage capacity.

6. Complete Token Journey — Step by Step

Let's trace exactly what happens to the prompt Q: What is WiFi?\nA: through a 22-layer, dim=128 model. Every number is the actual shape of the data at that point.

1
Raw text → "Q: What is WiFi?\nA:"
Just a string of characters. No math yet.
2
BPE tokenization → [412, 87, 23, 156, 5]
5 token IDs. The tokenizer was trained on your hardwareone_rich.txt, so "WiFi" and "Q:" are single tokens (they appear frequently). Shape: [5] integers.
3
Token embedding lookup → each ID indexes into a 4096×128 table.
Token 412 → row 412 of the table → a vector of 128 floats.
Shape: [5, 128] — 5 tokens, each with 128 dimensions.
4
Add position embeddings → token at position 0 gets row 0 of the 128×128 position table, position 1 gets row 1, etc.
Shape: still [5, 128] but now each vector encodes both "what" (token identity) and "where" (position in sequence).
enters the layer stack
5
Layer 0 — LayerNorm 1
Normalize each token vector to zero mean, unit variance. Learned scale+bias (128 params each). Prevents signal from exploding or vanishing.
Shape: [5, 128] (unchanged)
6
Layer 0 — Multi-Head Self-Attention

6a. All 5 token vectors are projected through c_attn (128×384 matrix) to produce Q, K, V. Shape of each: [5, 128].

6b. Split into 4 heads. Each head gets [5, 32] for its Q, K, V.

6c. Each head computes attention scores: Q × KT[5, 5] matrix (every token vs every token). A causal mask sets future positions to -infinity so token 3 can only see tokens 0-3, never token 4. Scores are divided by √32 for stability, then softmax'd to sum to 1.0.

6d. Scores × V → weighted sum of values. Shape: [5, 32] per head.

6e. Concatenate all 4 heads: [5, 128]. Project through c_proj (128×128). Output shape: [5, 128].

6f. ADD to residual stream (the original input from step 5). This is the "+" in the residual connection.
7
Layer 0 — LayerNorm 2
Normalize again before the FFN. Shape: [5, 128].
8
Layer 0 — Feed-Forward Network

8a. Each token independently: multiply by c_fc (128×768) → [5, 768]. Add bias.

8b. Apply GELU activation. ~70% of the 768 values become near-zero. The remaining ~30% encode which "features" this token matched.

8c. Multiply by c_proj (768×128) → [5, 128]. This collapses the sparse activations back to model width, writing the result into the token vector.

8d. ADD to residual stream.
Layers 1 through 21 — repeat steps 5-8
Each layer reads the updated residual stream and adds its own refinements. By layer 10-15, the token vectors no longer represent individual words — they encode abstract concepts like "this is a definitional question about a wireless protocol." By layer 20-21, the final token's vector has been shaped to produce the correct answer.

The shape NEVER changes: [5, 128] throughout all 22 layers.
9
Final LayerNorm
One last normalization pass. Shape: [5, 128].
10
LM Head — take ONLY the last token's vector (position 4, the "A:" token).
Multiply by embedding table transposed (128×4096) → [4096] logits — one score per vocabulary entry.

Softmax converts to probabilities. The token with the highest probability becomes the first output word. Suppose it's "WiFi" (token 156, probability 0.82).

Append token 156 to the sequence. The input is now [412, 87, 23, 156, 5, 156] — shape [6].
Run the entire pipeline again (steps 3-10) to generate the second token. Repeat until end-of-sequence or token limit (80 tokens).

7. The KV Cache — Why Generation Doesn't Recompute Everything

Generating each new token would naively require running all previous tokens through the entire model again. The KV cache avoids this.

After computing the K and V vectors for each token in each layer, they're stored in memory. When generating the next token, only the NEW token needs to run through the model — it computes its own Q, then attends to the cached K/V from all previous tokens.

Cache size per layer = 2 (K+V) × seq_len × dim bytes (INT8)

ModelCache/layerTotal KV cache
dim=128, 22 layers32 KB704 KB
dim=192, 12 layers48 KB576 KB
dim=192, 20 layers48 KB960 KB
dim=256, 8 layers64 KB512 KB

The KV cache is allocated in PSRAM alongside the model weights. A dim=192 model with 20 layers uses 960 KB for KV cache — this must fit within the ~8 MB budget along with the model weights.

8. Head Count Comparison Across Architectures

dim=128 — 4 heads

Dimensions per head32
Parallel attention patterns4 per layer
Total across 22 layers88 attention ops
Attention params/layer65,536
Q×K matrix size[5,5] × 4 heads
Can track simultaneously4 patterns

dim=192 — 6 heads

Dimensions per head32
Parallel attention patterns6 per layer
Total across 20 layers*120 attention ops
Attention params/layer147,456
Q×K matrix size[5,5] × 6 heads
Can track simultaneously6 patterns

dim=256 — 8 heads

Dimensions per head32
Parallel attention patterns8 per layer
Total across 8 layers64 attention ops
Attention params/layer262,144
Q×K matrix size[5,5] × 8 heads
Can track simultaneously8 patterns

*dim=192 shown with 20 layers (FFN=512 variant)

Total attention operations across the full model: The dim=128 model with 22 layers gets 88 total attention operations (4 heads × 22 layers). The dim=192/20-layer variant gets 120 (6 × 20). The dim=256 model only gets 64 (8 × 8). Even though dim=256 has more heads per layer, the lack of depth means it has fewer total attention opportunities — and it's the total that determines how well the model can build up complex question→answer routing.

9. Why Depth Wins for Q&A Routing

A useful mental model for what each group of layers does during Q&A:

Layer group Function dim=128
(22 layers)
dim=192
(20 layers)
dim=256
(8 layers)
Encoding Build word-level representations. "WiFi" becomes a concept, not just characters. Layers 0-5
(6 layers)
Layers 0-4
(5 layers)
Layers 0-2
(3 layers)
Routing Match question pattern to answer pattern. Suppress competing answers. The hardest job. Layers 6-17
(12 layers)
Layers 5-15
(11 layers)
Layers 3-5
(3 layers)
Generation Format the output tokens. Convert abstract answer representation into actual token predictions. Layers 18-21
(4 layers)
Layers 16-19
(4 layers)
Layers 6-7
(2 layers)

This is why the dim=256 model fails. It has only ~3 layers for the routing phase — the step where "What is WiFi?" needs to be separated from "What is BLE?" and "WiFi not connecting" and the 25 other entries that mention WiFi. With 3 layers, the model can't build enough discrimination, so it defaults to the most common answer in the topic cluster. The dim=128 model gets 12 routing layers and the dim=192 gets 11. That's the difference between correct and wrong answers.

10. Summary — What Controls What

Parameter Controls Analogy
dim Width of every vector. How many "dimensions" the model has to describe each token. Affects how well it can separate similar concepts. Size of the whiteboard each token carries
n_heads How many independent attention patterns per layer. Each head is a "specialist" that can focus on a different relationship. Number of people reading the sequence simultaneously
FFN (n_inner) Width of the knowledge store. How many feature patterns can be detected per layer. Where factual associations are stored. Size of the filing cabinet at each processing step
n_layers Depth of processing. How many sequential refinement steps the model gets. Each layer builds on all previous layers' work. Number of rounds of review before the final answer
seq_len Maximum context window. How many tokens can be in the prompt + answer combined. Fixed at 128 for all our presets. Length of the desk — how many papers fit side by side
vocab_size Size of the token dictionary. Larger vocab = fewer tokens per sentence (less splitting), but bigger embedding table. Size of the dictionary the model knows words from
head_dim dim ÷ n_heads. How "smart" each individual attention head is. Below 32, heads struggle to form useful patterns. IQ of each individual reader

11. Causal Masking — Why the Model Can't Cheat

During attention, each token can only attend to tokens at the same or earlier positions. This is enforced by a causal mask — a triangular matrix that sets all "future" positions to negative infinity before softmax.

Attention mask for 5-token sequence
Q: What is WiFi A:
Q:
What
is
WiFi
A:

This is why the "A:" token is the most important position. It's the only token that can see the entire question. "Q:" can only see itself. "WiFi" can see everything before it but not "A:". The model must route ALL question information into the "A:" position through attention — this is the bottleneck that determines answer quality.

12. Generation — The Autoregressive Loop

The model generates one token at a time. Each token is appended to the sequence, then the model runs again. The KV cache avoids recomputing previous tokens.

Step 0 input: Q: What is WiFi?\nA: → predict → WiFi
Step 1 input: Q: What is WiFi?\nA: WiFi → predict → is
Step 2 input: Q: What is WiFi?\nA: WiFi is → predict → a
Step 3 input: Q: What is WiFi?\nA: WiFi is a → predict → wireless
… continues until EOS token or 80-token limit

At each step, only the newest token runs through the full model. Previous tokens' K and V are read from cache. This makes generation O(n) per token instead of O(n²) — critical on the ESP32-S3 where every millisecond counts.

Hardware One LLM — Transformer Architecture Reference
Companion to architecture_comparison.html