Transformer Deep Dive — How a Prompt Becomes an Answer

1. The Full Pipeline — Bird's-Eye View

When you type Q: What is WiFi? into the device, here is every stage the text passes through before the model produces the first token of an answer. Each stage runs in PSRAM on the ESP32-S3.

Raw Text

Q: What is WiFi?\nA:
The firmware wraps your question into the Q:/A: format the model was trained on, then passes raw characters to the tokenizer.

characters → integers↓

Tokenization (BPE)

The Byte-Pair Encoding tokenizer breaks text into subword tokens. It learned these from the training data — frequent words like "WiFi" or "Type" are single tokens; rare words get split into pieces.

What

WiFi

?\nA:

↓

412

156

Each integer is an index into the vocabulary table (up to 4096 entries). The same text always produces the same token IDs.

integer IDs → vectors↓

Token + Position Embedding

Each token ID is looked up in a token embedding table (4096 rows × dim columns). This converts each integer into a dense vector of dim floating-point numbers.

A separate position embedding (128 rows × dim columns) is added to encode where each token sits in the sequence. Token #0 gets position vector #0, token #1 gets position vector #1, and so on.

[0.12, -0.34, …]

[0.56, 0.11, …]

[-0.23, 0.78, …]

[0.91, -0.02, …]

[0.04, 0.65, …]

Each vector has exactly dim numbers. For dim=128, that's 128 floats per token. For dim=192, that's 192. This is the "width" of the model — how many dimensions each token has to describe itself.

enters the transformer layers↓

Transformer Layers (repeated N times)

The entire sequence of token vectors passes through every layer, one at a time. Each layer has two sub-stages:
Multi-Head Self-Attention — tokens look at each other
Feed-Forward Network (FFN) — each token is processed independently

With 22 layers, the token vectors are transformed 22 times. Each transformation refines the representations, building from raw word identity toward question-answer associations.

after all N layers↓

Final Layer Norm

Normalizes the output vectors to have stable magnitude. Without this, values could drift to extreme ranges after many layers of addition and multiplication.

vector → vocabulary probabilities↓

LM Head (Projection to Vocabulary)

The final token's vector (the one at the "A:" position) is multiplied by the token embedding table in reverse (transposed). This produces a score for every word in the vocabulary.

The highest-scoring token becomes the first word of the answer. The model then appends that token to the sequence and runs the whole pipeline again to generate the next token, and so on, until it produces an end-of-sequence signal or hits the token limit.

"WiFi" → 0.82

"Type" → 0.04

"BLE" → 0.02

…

2. Inside a Single Transformer Layer

Every layer has the same internal structure. The key concept is the residual stream — a highway of information that passes straight through every layer. Each sub-stage (attention, FFN) reads from this stream, computes something, and adds its result back. This means early layers can pass information directly to later layers without it being destroyed.

Residual Stream
token vectors from previous layer (or embedding)

Layer Norm 1 — stabilize values before attention

Multi-Head
Self-Attention
tokens look at each other

Layer Norm 2 — stabilize values before FFN

Feed-Forward
Network
per-token processing

Residual Stream
updated vectors → next layer

Why the + (addition) matters: If attention or the FFN produces garbage in early training, the + means the original information still gets through. The model can learn to "do nothing" in a layer by outputting zeros. This is why deeper models (more layers) don't automatically hurt — unused layers just pass data through. But each useful layer gets one more chance to refine the representations.

3. Multi-Head Self-Attention — In Detail

This is the most important and most complex part of the transformer. Attention is how tokens communicate with each other. Without it, each token would be processed in isolation and the model could never understand that "What" + "is" + "WiFi" together form a question about WiFi.

3a. What Attention Does

For each token position, attention asks: "Which other tokens in this sequence should I pay attention to, and what information should I pull from them?"

It does this through three learned projections — Query, Key, and Value:

Query (Q) — "What am I looking for?" — generated from the current token
Key (K) — "What do I contain?" — generated from every token
Value (V) — "What information do I carry?" — generated from every token

The attention score between two tokens = dot product of the Query of token A with the Key of token B. High score means "token A should pay attention to token B." The scores are normalized with softmax (so they sum to 1.0), then used to create a weighted average of all the Value vectors.

3b. Concrete Example: "Q: What is WiFi?\nA:"

Here's what the attention pattern might look like at different layers for the final token "A:" (which needs to predict the first answer word):

Early Layer (Layer 1)

What

WiFi

"A:" sees

0.18

0.15

0.16

0.21

0.30

Diffuse attention — looking at everything roughly equally. Early layers haven't learned specific patterns yet. They gather raw information.

Middle Layer (Layer 10)

What

WiFi

"A:" sees

0.05

0.15

0.02

0.72

0.06

Focused attention — heavily attending to "WiFi". The model has learned that the topic word after "What is" is the key to selecting the right answer.

Late Layer (Layer 20)

What

WiFi

"A:" sees

0.35

0.05

0.02

0.33

0.25

Mixed attention — checking "Q:" (question format marker) + "WiFi" (topic) + itself (accumulated state). Late layers are combining the routed information into the final answer representation.

Key insight: Attention is how the model learns that "Q: What is WiFi?" should produce a different answer than "Q: What is BLE?". The middle layers focus on the distinguishing word and route different information into the residual stream depending on which topic word they find. With only 8 layers (dim=256), there aren't enough layers to build this early→middle→late pipeline.

4. What "Heads" Are and Why They Matter

4a. The Core Idea

Multi-head attention splits the dim-dimensional vector into parallel, independent attention operations called heads. Each head gets a slice of the vector — called head_dim — and runs its own Q/K/V attention on just those dimensions.

head_dim = dim ÷ n_heads

Think of each head as a specialist that can attend to a different thing at the same time. One head might focus on the topic word ("WiFi"), another on the question type ("What is"), another on the format markers ("Q:", "A:"), and another on positional patterns (what comes after what).

After all heads compute their attention outputs independently, the results are concatenated back into a single dim-sized vector and projected through one more weight matrix (the "output projection"). This merges the specialists' findings back into a unified representation.

4b. Head Dimension — The Key Tradeoff

More heads doesn't mean better. What matters is head_dim — how many dimensions each head works with. If head_dim is too small, each head can't form expressive enough Query/Key patterns to distinguish between concepts.

dim=128, 4 heads

H0: 32d

H1: 32d

H2: 32d

H3: 32d

32 dims/head

dim=192, 6 heads

H0: 32d

H1: 32d

H2: 32d

H3: 32d

H4: 32d

H5: 32d

32 dims/head

dim=256, 8 heads

32 dims/head

All three presets use 32 dims per head. This is intentional. Research has shown 32 is a good minimum for expressive attention patterns. The difference is the number of parallel specialists: 4, 6, or 8 heads. More heads means the model can attend to more patterns simultaneously within a single layer.

4c. What Each Head Learns to Do

In practice, heads specialize during training. Here's what we'd typically see in a trained Q&A model:

Head	Typical Specialization	What It Attends To
Head 0	Topic detection	The noun after "What is" / "How do I" — the key discriminating word
Head 1	Format tracking	The "Q:" and "A:" markers — knows where question ends and answer begins
Head 2	Previous token	Always looks at the immediately preceding token — important for local coherence
Head 3	Question type	"What is" vs "How do I" vs "Can I" — determines answer shape (definition vs command vs yes/no)
Head 4+	Sub-topic / modifier	Secondary words like "not working" or "range" or "update" that modify the answer

Why 4 heads (dim=128) is enough for this task: With ~130 unique answers and simple Q: format questions, 4 parallel attention patterns per layer are sufficient. The model needs to detect: (1) the topic word, (2) the question format, (3) local token context, and (4) answer style. More heads would help if the questions were more complex (multi-hop reasoning, long context), but for short Q&A lookup, 4 is adequate.

4d. The Q/K/V Weight Matrices — How Heads Are Implemented

Under the hood, each head has three small weight matrices:

W_Q : dim × head_dim — projects input into a Query vector
W_K : dim × head_dim — projects input into a Key vector
W_V : dim × head_dim — projects input into a Value vector

In GPT-2, these are packed into a single matrix c_attn of size dim × (3 × dim) for efficiency. The output projection c_proj is dim × dim. So the total attention parameters per layer = 4 × dim².

Model	dim	4 × dim²	Attn params/layer
dim=128	128	65,536	64 KB
dim=192	192	147,456	144 KB
dim=256	256	262,144	256 KB

Attention cost scales quadratically with dim — this is why wider models eat PSRAM so fast. Going from dim=128 to dim=256 quadruples the attention parameters per layer.

5. The Feed-Forward Network (FFN) — The Knowledge Store

After attention lets tokens communicate, the FFN processes each token independently. Think of it as a lookup table: the model checks what concept the token vector currently represents, then adjusts it toward the correct answer pattern.

Structure: Two linear transformations with a GELU activation in between:

input (dim) → expand (FFN inner) → GELU → contract (dim) → output (dim)

The expansion factor is critical. For FFN=768 with dim=128, the inner dimension is 6× the model width. This means:

The first matrix (c_fc): dim × FFN = 128 × 768 = 98,304 parameters
The second matrix (c_proj): FFN × dim = 768 × 128 = 98,304 parameters
Total FFN parameters per layer: 2 × dim × FFN

5a. What the FFN Neurons Do

Each neuron in the FFN inner layer acts as a feature detector. It activates (fires) when the input vector matches a particular pattern. With 768 neurons, the model has 768 "slots" to recognize different concepts per layer.

Input (dim)

8 of 128 shown

↓ expand ↓

Inner (FFN)

24 of 768 shown

↓ contract ↓

Output (dim)

8 of 128 shown

The GELU activation is what makes the FFN non-linear. It lets neurons partially activate (not just on/off), creating smoother feature detection. The sparse activation pattern above is typical — only ~10-30% of FFN neurons fire for any given input. This is how the model stores many patterns without them interfering with each other.

5b. FFN Size vs Number of Layers

Here's the tradeoff at dim=192 when reducing FFN to buy more layers:

FFN	FFN params/layer	Total params/layer	Max layers	Feature slots/layer	Total feature slots
768	294,912	~443K	16	768	12,288
640	245,760	~393K	18	640	11,520
512	196,608	~344K	20	512	10,240
384	147,456	~295K	24	384	9,216

Total feature slots go down slightly as you narrow the FFN, but you gain depth. For a task with ~130 unique answers, even 384 slots per layer is far more than enough. The routing depth is the bottleneck, not storage capacity.

6. Complete Token Journey — Step by Step

Let's trace exactly what happens to the prompt Q: What is WiFi?\nA: through a 22-layer, dim=128 model. Every number is the actual shape of the data at that point.

Raw text → "Q: What is WiFi?\nA:"
Just a string of characters. No math yet.

↓

BPE tokenization → [412, 87, 23, 156, 5]
5 token IDs. The tokenizer was trained on your hardwareone_rich.txt, so "WiFi" and "Q:" are single tokens (they appear frequently). Shape: [5] integers.

↓

Token embedding lookup → each ID indexes into a 4096×128 table.
Token 412 → row 412 of the table → a vector of 128 floats.
Shape: [5, 128] — 5 tokens, each with 128 dimensions.

↓

Add position embeddings → token at position 0 gets row 0 of the 128×128 position table, position 1 gets row 1, etc.
Shape: still [5, 128] but now each vector encodes both "what" (token identity) and "where" (position in sequence).

enters the layer stack↓

Layer 0 — LayerNorm 1
Normalize each token vector to zero mean, unit variance. Learned scale+bias (128 params each). Prevents signal from exploding or vanishing.
Shape: [5, 128] (unchanged)

↓

Layer 0 — Multi-Head Self-Attention

6a. All 5 token vectors are projected through c_attn (128×384 matrix) to produce Q, K, V. Shape of each: [5, 128].

6b. Split into 4 heads. Each head gets [5, 32] for its Q, K, V.

6c. Each head computes attention scores: Q × K^T → [5, 5] matrix (every token vs every token). A causal mask sets future positions to -infinity so token 3 can only see tokens 0-3, never token 4. Scores are divided by √32 for stability, then softmax'd to sum to 1.0.

6d. Scores × V → weighted sum of values. Shape: [5, 32] per head.

6e. Concatenate all 4 heads: [5, 128]. Project through c_proj (128×128). Output shape: [5, 128].

6f. ADD to residual stream (the original input from step 5). This is the "+" in the residual connection.

↓

Layer 0 — LayerNorm 2
Normalize again before the FFN. Shape: [5, 128].

↓

Layer 0 — Feed-Forward Network

8a. Each token independently: multiply by c_fc (128×768) → [5, 768]. Add bias.

8b. Apply GELU activation. ~70% of the 768 values become near-zero. The remaining ~30% encode which "features" this token matched.

8c. Multiply by c_proj (768×128) → [5, 128]. This collapses the sparse activations back to model width, writing the result into the token vector.

8d. ADD to residual stream.

↓

…

Layers 1 through 21 — repeat steps 5-8
Each layer reads the updated residual stream and adds its own refinements. By layer 10-15, the token vectors no longer represent individual words — they encode abstract concepts like "this is a definitional question about a wireless protocol." By layer 20-21, the final token's vector has been shaped to produce the correct answer.

The shape NEVER changes: [5, 128] throughout all 22 layers.

↓

Final LayerNorm
One last normalization pass. Shape: [5, 128].

↓

LM Head — take ONLY the last token's vector (position 4, the "A:" token).
Multiply by embedding table transposed (128×4096) → [4096] logits — one score per vocabulary entry.

Softmax converts to probabilities. The token with the highest probability becomes the first output word. Suppose it's "WiFi" (token 156, probability 0.82).

Append token 156 to the sequence. The input is now [412, 87, 23, 156, 5, 156] — shape [6].
Run the entire pipeline again (steps 3-10) to generate the second token. Repeat until end-of-sequence or token limit (80 tokens).

7. The KV Cache — Why Generation Doesn't Recompute Everything

Generating each new token would naively require running all previous tokens through the entire model again. The KV cache avoids this.

After computing the K and V vectors for each token in each layer, they're stored in memory. When generating the next token, only the NEW token needs to run through the model — it computes its own Q, then attends to the cached K/V from all previous tokens.

Cache size per layer = 2 (K+V) × seq_len × dim bytes (INT8)

Model	Cache/layer	Total KV cache
dim=128, 22 layers	32 KB	704 KB
dim=192, 12 layers	48 KB	576 KB
dim=192, 20 layers	48 KB	960 KB
dim=256, 8 layers	64 KB	512 KB

The KV cache is allocated in PSRAM alongside the model weights. A dim=192 model with 20 layers uses 960 KB for KV cache — this must fit within the ~8 MB budget along with the model weights.

8. Head Count Comparison Across Architectures

dim=128 — 4 heads

Dimensions per head32

Parallel attention patterns4 per layer

Total across 22 layers88 attention ops

Attention params/layer65,536

Q×K matrix size[5,5] × 4 heads

Can track simultaneously4 patterns

dim=192 — 6 heads

Dimensions per head32

Parallel attention patterns6 per layer

Total across 20 layers*120 attention ops

Attention params/layer147,456

Q×K matrix size[5,5] × 6 heads

Can track simultaneously6 patterns

dim=256 — 8 heads

Dimensions per head32

Parallel attention patterns8 per layer

Total across 8 layers64 attention ops

Attention params/layer262,144

Q×K matrix size[5,5] × 8 heads

Can track simultaneously8 patterns

*dim=192 shown with 20 layers (FFN=512 variant)

Total attention operations across the full model: The dim=128 model with 22 layers gets 88 total attention operations (4 heads × 22 layers). The dim=192/20-layer variant gets 120 (6 × 20). The dim=256 model only gets 64 (8 × 8). Even though dim=256 has more heads per layer, the lack of depth means it has fewer total attention opportunities — and it's the total that determines how well the model can build up complex question→answer routing.

9. Why Depth Wins for Q&A Routing

A useful mental model for what each group of layers does during Q&A:

Layer group	Function	dim=128 (22 layers)	dim=192 (20 layers)	dim=256 (8 layers)
Encoding	Build word-level representations. "WiFi" becomes a concept, not just characters.	Layers 0-5 (6 layers)	Layers 0-4 (5 layers)	Layers 0-2 (3 layers)
Routing	Match question pattern to answer pattern. Suppress competing answers. The hardest job.	Layers 6-17 (12 layers)	Layers 5-15 (11 layers)	Layers 3-5 (3 layers)
Generation	Format the output tokens. Convert abstract answer representation into actual token predictions.	Layers 18-21 (4 layers)	Layers 16-19 (4 layers)	Layers 6-7 (2 layers)

This is why the dim=256 model fails. It has only ~3 layers for the routing phase — the step where "What is WiFi?" needs to be separated from "What is BLE?" and "WiFi not connecting" and the 25 other entries that mention WiFi. With 3 layers, the model can't build enough discrimination, so it defaults to the most common answer in the topic cluster. The dim=128 model gets 12 routing layers and the dim=192 gets 11. That's the difference between correct and wrong answers.

10. Summary — What Controls What

Parameter	Controls	Analogy
dim	Width of every vector. How many "dimensions" the model has to describe each token. Affects how well it can separate similar concepts.	Size of the whiteboard each token carries
n_heads	How many independent attention patterns per layer. Each head is a "specialist" that can focus on a different relationship.	Number of people reading the sequence simultaneously
FFN (n_inner)	Width of the knowledge store. How many feature patterns can be detected per layer. Where factual associations are stored.	Size of the filing cabinet at each processing step
n_layers	Depth of processing. How many sequential refinement steps the model gets. Each layer builds on all previous layers' work.	Number of rounds of review before the final answer
seq_len	Maximum context window. How many tokens can be in the prompt + answer combined. Fixed at 128 for all our presets.	Length of the desk — how many papers fit side by side
vocab_size	Size of the token dictionary. Larger vocab = fewer tokens per sentence (less splitting), but bigger embedding table.	Size of the dictionary the model knows words from
head_dim	dim ÷ n_heads. How "smart" each individual attention head is. Below 32, heads struggle to form useful patterns.	IQ of each individual reader

11. Causal Masking — Why the Model Can't Cheat

During attention, each token can only attend to tokens at the same or earlier positions. This is enforced by a causal mask — a triangular matrix that sets all "future" positions to negative infinity before softmax.

Attention mask for 5-token sequence

        
        Q:
        What
        is
        WiFi
        A:
      

        Q:
        ✓
        ✗
        ✗
        ✗
        ✗
      

        What
        ✓
        ✓
        ✗
        ✗
        ✗
      

        is
        ✓
        ✓
        ✓
        ✗
        ✗
      

        WiFi
        ✓
        ✓
        ✓
        ✓
        ✗
      

        A:
        ✓
        ✓
        ✓
        ✓
        ✓
      

This is why the "A:" token is the most important position. It's the only token that can see the entire question. "Q:" can only see itself. "WiFi" can see everything before it but not "A:". The model must route ALL question information into the "A:" position through attention — this is the bottleneck that determines answer quality.

12. Generation — The Autoregressive Loop

The model generates one token at a time. Each token is appended to the sequence, then the model runs again. The KV cache avoids recomputing previous tokens.

      Step 0 input:
      Q: What is WiFi?\nA:
      → predict →
      WiFi
    

      Step 1 input:
      Q: What is WiFi?\nA: WiFi
      → predict →
       is
    

      Step 2 input:
      Q: What is WiFi?\nA: WiFi is
      → predict →
       a
    

      Step 3 input:
      Q: What is WiFi?\nA: WiFi is a
      → predict →
       wireless
    
… continues until EOS token or 80-token limit

At each step, only the newest token runs through the full model. Previous tokens' K and V are read from cache. This makes generation O(n) per token instead of O(n²) — critical on the ESP32-S3 where every millisecond counts.

Hardware One LLM — Transformer Architecture Reference
Companion to architecture_comparison.html