1. The Full Pipeline — Bird's-Eye View
When you type Q: What is WiFi? into the device, here is every stage the text passes through before the model produces the first token of an answer. Each stage runs in PSRAM on the ESP32-S3.
Q: What is WiFi?\nA:The firmware wraps your question into the Q:/A: format the model was trained on, then passes raw characters to the tokenizer.
Multi-Head Self-Attention — tokens look at each other
Feed-Forward Network (FFN) — each token is processed independently
With 22 layers, the token vectors are transformed 22 times. Each transformation refines the representations, building from raw word identity toward question-answer associations.
The highest-scoring token becomes the first word of the answer. The model then appends that token to the sequence and runs the whole pipeline again to generate the next token, and so on, until it produces an end-of-sequence signal or hits the token limit.
2. Inside a Single Transformer Layer
Every layer has the same internal structure. The key concept is the residual stream — a highway of information that passes straight through every layer. Each sub-stage (attention, FFN) reads from this stream, computes something, and adds its result back. This means early layers can pass information directly to later layers without it being destroyed.
token vectors from previous layer (or embedding)
Self-Attention
tokens look at each other
Network
per-token processing
updated vectors → next layer
Why the + (addition) matters: If attention or the FFN produces garbage in early training, the + means the original information still gets through. The model can learn to "do nothing" in a layer by outputting zeros. This is why deeper models (more layers) don't automatically hurt — unused layers just pass data through. But each useful layer gets one more chance to refine the representations.
3. Multi-Head Self-Attention — In Detail
This is the most important and most complex part of the transformer. Attention is how tokens communicate with each other. Without it, each token would be processed in isolation and the model could never understand that "What" + "is" + "WiFi" together form a question about WiFi.
3a. What Attention Does
For each token position, attention asks: "Which other tokens in this sequence should I pay attention to, and what information should I pull from them?"
It does this through three learned projections — Query, Key, and Value:
Query (Q) — "What am I looking for?" — generated from the current token
Key (K) — "What do I contain?" — generated from every token
Value (V) — "What information do I carry?" — generated from every token
The attention score between two tokens = dot product of the Query of token A with the Key of token B. High score means "token A should pay attention to token B." The scores are normalized with softmax (so they sum to 1.0), then used to create a weighted average of all the Value vectors.
3b. Concrete Example: "Q: What is WiFi?\nA:"
Here's what the attention pattern might look like at different layers for the final token "A:" (which needs to predict the first answer word):
Early Layer (Layer 1)
Middle Layer (Layer 10)
Late Layer (Layer 20)
Key insight: Attention is how the model learns that "Q: What is WiFi?" should produce a different answer than "Q: What is BLE?". The middle layers focus on the distinguishing word and route different information into the residual stream depending on which topic word they find. With only 8 layers (dim=256), there aren't enough layers to build this early→middle→late pipeline.
4. What "Heads" Are and Why They Matter
4a. The Core Idea
Multi-head attention splits the dim-dimensional vector into parallel, independent attention operations called heads. Each head gets a slice of the vector — called head_dim — and runs its own Q/K/V attention on just those dimensions.
head_dim = dim ÷ n_heads
Think of each head as a specialist that can attend to a different thing at the same time. One head might focus on the topic word ("WiFi"), another on the question type ("What is"), another on the format markers ("Q:", "A:"), and another on positional patterns (what comes after what).
After all heads compute their attention outputs independently, the results are concatenated back into a single dim-sized vector and projected through one more weight matrix (the "output projection"). This merges the specialists' findings back into a unified representation.
4b. Head Dimension — The Key Tradeoff
More heads doesn't mean better. What matters is head_dim — how many dimensions each head works with. If head_dim is too small, each head can't form expressive enough Query/Key patterns to distinguish between concepts.
All three presets use 32 dims per head. This is intentional. Research has shown 32 is a good minimum for expressive attention patterns. The difference is the number of parallel specialists: 4, 6, or 8 heads. More heads means the model can attend to more patterns simultaneously within a single layer.
4c. What Each Head Learns to Do
In practice, heads specialize during training. Here's what we'd typically see in a trained Q&A model:
| Head | Typical Specialization | What It Attends To |
|---|---|---|
| Head 0 | Topic detection | The noun after "What is" / "How do I" — the key discriminating word |
| Head 1 | Format tracking | The "Q:" and "A:" markers — knows where question ends and answer begins |
| Head 2 | Previous token | Always looks at the immediately preceding token — important for local coherence |
| Head 3 | Question type | "What is" vs "How do I" vs "Can I" — determines answer shape (definition vs command vs yes/no) |
| Head 4+ | Sub-topic / modifier | Secondary words like "not working" or "range" or "update" that modify the answer |
Why 4 heads (dim=128) is enough for this task: With ~130 unique answers and simple Q: format questions, 4 parallel attention patterns per layer are sufficient. The model needs to detect: (1) the topic word, (2) the question format, (3) local token context, and (4) answer style. More heads would help if the questions were more complex (multi-hop reasoning, long context), but for short Q&A lookup, 4 is adequate.
4d. The Q/K/V Weight Matrices — How Heads Are Implemented
Under the hood, each head has three small weight matrices:
WQ : dim × head_dim — projects input into a Query vector
WK : dim × head_dim — projects input into a Key vector
WV : dim × head_dim — projects input into a Value vector
In GPT-2, these are packed into a single matrix c_attn of size dim × (3 × dim) for efficiency. The output projection c_proj is dim × dim. So the total attention parameters per layer = 4 × dim².
| Model | dim | 4 × dim² | Attn params/layer |
|---|---|---|---|
| dim=128 | 128 | 65,536 | 64 KB |
| dim=192 | 192 | 147,456 | 144 KB |
| dim=256 | 256 | 262,144 | 256 KB |
Attention cost scales quadratically with dim — this is why wider models eat PSRAM so fast. Going from dim=128 to dim=256 quadruples the attention parameters per layer.
5. The Feed-Forward Network (FFN) — The Knowledge Store
After attention lets tokens communicate, the FFN processes each token independently. Think of it as a lookup table: the model checks what concept the token vector currently represents, then adjusts it toward the correct answer pattern.
Structure: Two linear transformations with a GELU activation in between:
input (dim) → expand (FFN inner) → GELU → contract (dim) → output (dim)
The expansion factor is critical. For FFN=768 with dim=128, the inner dimension is 6× the model width. This means:
- The first matrix (
c_fc): dim × FFN = 128 × 768 = 98,304 parameters - The second matrix (
c_proj): FFN × dim = 768 × 128 = 98,304 parameters - Total FFN parameters per layer: 2 × dim × FFN
5a. What the FFN Neurons Do
Each neuron in the FFN inner layer acts as a feature detector. It activates (fires) when the input vector matches a particular pattern. With 768 neurons, the model has 768 "slots" to recognize different concepts per layer.
The GELU activation is what makes the FFN non-linear. It lets neurons partially activate (not just on/off), creating smoother feature detection. The sparse activation pattern above is typical — only ~10-30% of FFN neurons fire for any given input. This is how the model stores many patterns without them interfering with each other.
5b. FFN Size vs Number of Layers
Here's the tradeoff at dim=192 when reducing FFN to buy more layers:
| FFN | FFN params/layer | Total params/layer | Max layers | Feature slots/layer | Total feature slots |
|---|---|---|---|---|---|
| 768 | 294,912 | ~443K | 16 | 768 | 12,288 |
| 640 | 245,760 | ~393K | 18 | 640 | 11,520 |
| 512 | 196,608 | ~344K | 20 | 512 | 10,240 |
| 384 | 147,456 | ~295K | 24 | 384 | 9,216 |
Total feature slots go down slightly as you narrow the FFN, but you gain depth. For a task with ~130 unique answers, even 384 slots per layer is far more than enough. The routing depth is the bottleneck, not storage capacity.
6. Complete Token Journey — Step by Step
Let's trace exactly what happens to the prompt Q: What is WiFi?\nA: through a 22-layer, dim=128 model. Every number is the actual shape of the data at that point.
Just a string of characters. No math yet.
5 token IDs. The tokenizer was trained on your hardwareone_rich.txt, so "WiFi" and "Q:" are single tokens (they appear frequently). Shape:
[5] integers.
Normalize each token vector to zero mean, unit variance. Learned scale+bias (128 params each). Prevents signal from exploding or vanishing.
Shape:
[5, 128] (unchanged)
6a. All 5 token vectors are projected through
c_attn (128×384 matrix) to produce Q, K, V. Shape of each: [5, 128].6b. Split into 4 heads. Each head gets
[5, 32] for its Q, K, V.6c. Each head computes attention scores: Q × KT →
[5, 5] matrix (every token vs every token). A causal mask sets future positions to -infinity so token 3 can only see tokens 0-3, never token 4. Scores are divided by √32 for stability, then softmax'd to sum to 1.0.6d. Scores × V → weighted sum of values. Shape:
[5, 32] per head.6e. Concatenate all 4 heads:
[5, 128]. Project through c_proj (128×128). Output shape: [5, 128].6f. ADD to residual stream (the original input from step 5). This is the "+" in the residual connection.
Normalize again before the FFN. Shape:
[5, 128].
8a. Each token independently: multiply by
c_fc (128×768) → [5, 768]. Add bias.8b. Apply GELU activation. ~70% of the 768 values become near-zero. The remaining ~30% encode which "features" this token matched.
8c. Multiply by
c_proj (768×128) → [5, 128]. This collapses the sparse activations back to model width, writing the result into the token vector.8d. ADD to residual stream.
Each layer reads the updated residual stream and adds its own refinements. By layer 10-15, the token vectors no longer represent individual words — they encode abstract concepts like "this is a definitional question about a wireless protocol." By layer 20-21, the final token's vector has been shaped to produce the correct answer.
The shape NEVER changes:
[5, 128] throughout all 22 layers.
One last normalization pass. Shape:
[5, 128].
Multiply by embedding table transposed (128×4096) →
[4096] logits — one score per vocabulary entry.Softmax converts to probabilities. The token with the highest probability becomes the first output word. Suppose it's "WiFi" (token 156, probability 0.82).
Append token 156 to the sequence. The input is now
[412, 87, 23, 156, 5, 156] — shape [6].Run the entire pipeline again (steps 3-10) to generate the second token. Repeat until end-of-sequence or token limit (80 tokens).
7. The KV Cache — Why Generation Doesn't Recompute Everything
Generating each new token would naively require running all previous tokens through the entire model again. The KV cache avoids this.
After computing the K and V vectors for each token in each layer, they're stored in memory. When generating the next token, only the NEW token needs to run through the model — it computes its own Q, then attends to the cached K/V from all previous tokens.
Cache size per layer = 2 (K+V) × seq_len × dim bytes (INT8)
| Model | Cache/layer | Total KV cache |
|---|---|---|
| dim=128, 22 layers | 32 KB | 704 KB |
| dim=192, 12 layers | 48 KB | 576 KB |
| dim=192, 20 layers | 48 KB | 960 KB |
| dim=256, 8 layers | 64 KB | 512 KB |
The KV cache is allocated in PSRAM alongside the model weights. A dim=192 model with 20 layers uses 960 KB for KV cache — this must fit within the ~8 MB budget along with the model weights.
8. Head Count Comparison Across Architectures
dim=128 — 4 heads
dim=192 — 6 heads
dim=256 — 8 heads
*dim=192 shown with 20 layers (FFN=512 variant)
Total attention operations across the full model: The dim=128 model with 22 layers gets 88 total attention operations (4 heads × 22 layers). The dim=192/20-layer variant gets 120 (6 × 20). The dim=256 model only gets 64 (8 × 8). Even though dim=256 has more heads per layer, the lack of depth means it has fewer total attention opportunities — and it's the total that determines how well the model can build up complex question→answer routing.
9. Why Depth Wins for Q&A Routing
A useful mental model for what each group of layers does during Q&A:
| Layer group | Function | dim=128 (22 layers) |
dim=192 (20 layers) |
dim=256 (8 layers) |
|---|---|---|---|---|
| Encoding | Build word-level representations. "WiFi" becomes a concept, not just characters. | Layers 0-5 (6 layers) |
Layers 0-4 (5 layers) |
Layers 0-2 (3 layers) |
| Routing | Match question pattern to answer pattern. Suppress competing answers. The hardest job. | Layers 6-17 (12 layers) |
Layers 5-15 (11 layers) |
Layers 3-5 (3 layers) |
| Generation | Format the output tokens. Convert abstract answer representation into actual token predictions. | Layers 18-21 (4 layers) |
Layers 16-19 (4 layers) |
Layers 6-7 (2 layers) |
This is why the dim=256 model fails. It has only ~3 layers for the routing phase — the step where "What is WiFi?" needs to be separated from "What is BLE?" and "WiFi not connecting" and the 25 other entries that mention WiFi. With 3 layers, the model can't build enough discrimination, so it defaults to the most common answer in the topic cluster. The dim=128 model gets 12 routing layers and the dim=192 gets 11. That's the difference between correct and wrong answers.
10. Summary — What Controls What
| Parameter | Controls | Analogy |
|---|---|---|
| dim | Width of every vector. How many "dimensions" the model has to describe each token. Affects how well it can separate similar concepts. | Size of the whiteboard each token carries |
| n_heads | How many independent attention patterns per layer. Each head is a "specialist" that can focus on a different relationship. | Number of people reading the sequence simultaneously |
| FFN (n_inner) | Width of the knowledge store. How many feature patterns can be detected per layer. Where factual associations are stored. | Size of the filing cabinet at each processing step |
| n_layers | Depth of processing. How many sequential refinement steps the model gets. Each layer builds on all previous layers' work. | Number of rounds of review before the final answer |
| seq_len | Maximum context window. How many tokens can be in the prompt + answer combined. Fixed at 128 for all our presets. | Length of the desk — how many papers fit side by side |
| vocab_size | Size of the token dictionary. Larger vocab = fewer tokens per sentence (less splitting), but bigger embedding table. | Size of the dictionary the model knows words from |
| head_dim | dim ÷ n_heads. How "smart" each individual attention head is. Below 32, heads struggle to form useful patterns. | IQ of each individual reader |
11. Causal Masking — Why the Model Can't Cheat
During attention, each token can only attend to tokens at the same or earlier positions. This is enforced by a causal mask — a triangular matrix that sets all "future" positions to negative infinity before softmax.
This is why the "A:" token is the most important position. It's the only token that can see the entire question. "Q:" can only see itself. "WiFi" can see everything before it but not "A:". The model must route ALL question information into the "A:" position through attention — this is the bottleneck that determines answer quality.
12. Generation — The Autoregressive Loop
The model generates one token at a time. Each token is appended to the sequence, then the model runs again. The KV cache avoids recomputing previous tokens.
At each step, only the newest token runs through the full model. Previous tokens' K and V are read from cache. This makes generation O(n) per token instead of O(n²) — critical on the ESP32-S3 where every millisecond counts.
Companion to architecture_comparison.html