ESP32-S3 LLM Architecture Comparison

Same 8 MB PSRAM budget, same training data, different width/depth tradeoffs
dim=128
HW1HelpAgent_slim — 22 layers, 4 heads
Token Embedding 4096 × 128
Layer 0
H0
H1
H2
H3
FFN: 128→720→128
Layer 1
H0
H1
H2
H3
FFN: 128→720→128
Layer 2
H0
H1
H2
H3
FFN: 128→720→128
Layers 3 – 19
Layer 20
H0
H1
H2
H3
FFN: 128→720→128
Layer 21
H0
H1
H2
H3
FFN: 128→720→128
LM Head 128 → vocab
dim=192
HW1HelpAgent192 — 12 layers, 6 heads
Token Embedding 4096 × 192
Layer 0
H0
H1
H2
H3
H4
H5
FFN: 192→768→192
Layer 1
H0
H1
H2
H3
H4
H5
FFN: 192→768→192
Layer 2
H0
H1
H2
H3
H4
H5
FFN: 192→768→192
Layers 3 – 9
Layer 10
H0
H1
H2
H3
H4
H5
FFN: 192→768→192
Layer 11
H0
H1
H2
H3
H4
H5
FFN: 192→768→192
LM Head 192 → vocab
dim=256
HW1HelpAgent256 — 8 layers, 8 heads
Token Embedding 4096 × 256
Layer 0
H0
H1
H2
H3
H4
H5
H6
H7
FFN: 256→768→256
Layer 1
H0
H1
H2
H3
H4
H5
H6
H7
FFN: 256→768→256
Layer 2
H0
H1
H2
H3
H4
H5
H6
H7
FFN: 256→768→256
Layers 3 – 5
Layer 6
H0
H1
H2
H3
H4
H5
H6
H7
FFN: 256→768→256
Layer 7
H0
H1
H2
H3
H4
H5
H6
H7
FFN: 256→768→256
LM Head 256 → vocab

What "dim" means: each token is a vector of N numbers

dim=128
128 dimensions to encode meaning
dim=192
192 dimensions — 2.25× more room
dim=256
256 dimensions — 4× more room to separate concepts

How dim affects topic separation

More dimensions = more room to keep similar concepts apart

dim=128 — cramped

WiFi
MQTT
ESP-NOW
BLE
BME
OLED
IMU
GPS
Topics overlap → wrong answers

dim=192 — breathing room

WiFi
MQTT
ESP-NOW
BLE
BME
OLED
IMU
GPS
Better separation

dim=256 — well separated

WiFi
MQTT
ESP-NOW
BLE
BME
OLED
IMU
GPS
Clear boundaries → right answers

Inside one transformer layer

dim=128 layer

Attention QKV128×384
Attention Out128×128
Heads4 × 32d
FFN Up128→720
FFN Down720→128
Weights/layer~250 KB
KV cache/layer64 KB
× 22 layers5500 KB + 1408 KB

dim=192 layer

Attention QKV192×576
Attention Out192×192
Heads6 × 32d
FFN Up192→768
FFN Down768→192
Weights/layer~442 KB
KV cache/layer96 KB
× 12 layers5308 KB + 1152 KB

dim=256 layer

Attention QKV256×768
Attention Out256×256
Heads8 × 32d
FFN Up256→768
FFN Down768→256
Weights/layer~655 KB
KV cache/layer128 KB
× 8 layers5243 KB + 1024 KB

Full comparison

dim=128 dim=192 dim=256
PresetHW1HelpAgent_slimHW1HelpAgent192HW1HelpAgent256
Embedding dim128192256
Layers22128
Attention heads468
Head dim323232
FFN inner720768768
Parameters~5.8M~6.1M~6.3M
INT8 weights~6092 KB~6142 KB~6336 KB
KV cache1408 KB1152 KB1024 KB
Total PSRAM~7822 KB~7654 KB~7640 KB
Repr. capacity2.25×
Processing depth22 passes12 passes8 passes
Best forPattern matchingBalancedTopic separation