Encoder understands, Decoder speaks, Attention connects meaning, FFN refines, Softmax chooses words.
https://www.udemy.com/course/prompt-engineering-for-everyone-bootcamp/
0️⃣ High-level structure (what this diagram shows)
Input → Encoder stack → Context representations
Target (shifted) → Decoder stack → Output probabilities- Left block → Encoder (understands input)
- Right block → Decoder (generates output)
- Nx → same block repeated N times (usually 6–48+)
1️⃣ Tokenization (implicit, before diagram)
Before anything:
"I love transformers"
→ ["I", "love", "transformers"]
→ [101, 203, 503]💡 Transformers do not read text, they read token IDs.
2️⃣ Embedding Layer (Encoder & Decoder)
What it does
- Converts token IDs → dense vectors
- Example:
503 → [0.12, -0.88, 0.34, ...]
Why it matters
- Captures semantic meaning
- Same idea used in vector databases (RAG)
📌 In the diagram:
- Input Embedding → Encoder side
- Output Embedding → Decoder side
3️⃣ Positional Encoding
Problem it solves
Transformers process tokens in parallel, so they don't know order.
Solution
Add position information:
Final embedding = token_embedding + positional_encodingTypes
- Sinusoidal (original paper)
- Learned (GPT-style models)
📌 Shown at bottom of encoder & decoder inputs.
4️⃣ Encoder Stack (repeated Nx times)
Each Encoder block has 2 sub-layers:
4.1️⃣ Multi-Head Self-Attention (Encoder)
What happens
Each token looks at all other tokens in the input sentence.
Example:
"I love transformers"
→ "love" attends strongly to "transformers"Internals (simplified)
- Query (Q), Key (K), Value (V)
- Attention score = similarity(Q, K)
- Weighted sum of V
Multi-Head?
- Multiple attention heads = multiple perspectives
- Syntax
- Semantics
- Relationships
📌 This block is NOT masked in encoder.
4.2️⃣ Add & Norm (Residual Connection + LayerNorm)
After attention:
output = LayerNorm(input + attention_output)Why?
- Stabilizes training
- Prevents vanishing gradients
- Allows very deep stacks
4.3️⃣ Feed-Forward Network (FFN)
What it does
- Applies a non-linear transformation
- Processes each token independently
Formula:
FFN(x) = max(0, xW1 + b1)W2 + b2💡 Important:
FFN does NOT mix tokens — attention already did that.
4.4️⃣ Add & Norm (again)
Same residual logic after FFN.
5️⃣ Encoder Output
After N encoder layers:
- Each token has a context-aware representation
- This output is passed to every decoder layer
Think of it as:
"Fully understood input sentence"
6️⃣ Decoder Input (Shifted Right)
Example:
Target sentence: "Je suis ici"
Decoder input: "<start> Je suis"Why shifted?
- Prevents cheating
- Forces next-token prediction
📌 This is why decoding is autoregressive.
7️⃣ Decoder Stack (repeated Nx times)
Each decoder block has 3 sub-layers (very important difference):
7.1️⃣ Masked Multi-Head Self-Attention
Masked means:
- Token can see past tokens only
- Cannot see future words
Example:
"<start> Je"
→ cannot see "suis"📌 This enables text generation.
7.2️⃣ Encoder-Decoder Attention (Cross-Attention)
This is the bridge between encoder & decoder.
- Queries → decoder
- Keys & Values → encoder output
Meaning:
"While generating, look back at the input sentence."
This is why translation works:
"I am here" → "Je suis ici"7.3️⃣ Feed-Forward Network (same as encoder)
Same FFN logic, applied per token.
7.4️⃣ Add & Norm everywhere
Every sub-layer:
input + output → LayerNorm8️⃣ Linear Layer
- Maps decoder hidden state → vocabulary size
- Example:
[0.21, -0.9, 1.3] → logits for 50k words9️⃣ Softmax
- Converts logits → probabilities
- Highest probability token = next token
Example:
P("ici") = 0.72
P("là") = 0.181️⃣0️⃣ Output Probabilities → Token Generation
- Pick token (greedy / sampling / beam search)
- Append to input
- Repeat loop
This is how LLMs generate text token by token.
🔥 Encoder vs Decoder Summary (Interview Gold)
| Component | Encoder | Decoder |
| --------------- | ------------ | ---------- |
| Self-Attention | ✅ (unmasked) | ✅ (masked) |
| Cross-Attention | ❌ | ✅ |
| FFN | ✅ | ✅ |
| Generation | ❌ | ✅ |🧠 Why GPT-style LLMs Don't Use Encoder
GPT removes:
- Encoder
- Cross-attention
Keeps:
- Decoder-only stack
- Masked self-attention
- Next-token prediction