Encoder understands, Decoder speaks, Attention connects meaning, FFN refines, Softmax chooses words.

https://www.udemy.com/course/prompt-engineering-for-everyone-bootcamp/

0️⃣ High-level structure (what this diagram shows)

Input → Encoder stack → Context representations
Target (shifted) → Decoder stack → Output probabilities
  • Left block → Encoder (understands input)
  • Right block → Decoder (generates output)
  • Nx → same block repeated N times (usually 6–48+)

1️⃣ Tokenization (implicit, before diagram)

Before anything:

"I love transformers"
→ ["I", "love", "transformers"]
→ [101, 203, 503]

💡 Transformers do not read text, they read token IDs.

2️⃣ Embedding Layer (Encoder & Decoder)

What it does

  • Converts token IDs → dense vectors
  • Example:
  • 503 → [0.12, -0.88, 0.34, ...]

Why it matters

  • Captures semantic meaning
  • Same idea used in vector databases (RAG)

📌 In the diagram:

  • Input Embedding → Encoder side
  • Output Embedding → Decoder side

3️⃣ Positional Encoding

Problem it solves

Transformers process tokens in parallel, so they don't know order.

Solution

Add position information:

Final embedding = token_embedding + positional_encoding

Types

  • Sinusoidal (original paper)
  • Learned (GPT-style models)

📌 Shown at bottom of encoder & decoder inputs.

4️⃣ Encoder Stack (repeated Nx times)

Each Encoder block has 2 sub-layers:

4.1️⃣ Multi-Head Self-Attention (Encoder)

What happens

Each token looks at all other tokens in the input sentence.

Example:

"I love transformers"
→ "love" attends strongly to "transformers"

Internals (simplified)

  • Query (Q), Key (K), Value (V)
  • Attention score = similarity(Q, K)
  • Weighted sum of V

Multi-Head?

  • Multiple attention heads = multiple perspectives
  • Syntax
  • Semantics
  • Relationships

📌 This block is NOT masked in encoder.

4.2️⃣ Add & Norm (Residual Connection + LayerNorm)

After attention:

output = LayerNorm(input + attention_output)

Why?

  • Stabilizes training
  • Prevents vanishing gradients
  • Allows very deep stacks

4.3️⃣ Feed-Forward Network (FFN)

What it does

  • Applies a non-linear transformation
  • Processes each token independently

Formula:

FFN(x) = max(0, xW1 + b1)W2 + b2

💡 Important:

FFN does NOT mix tokens — attention already did that.

4.4️⃣ Add & Norm (again)

Same residual logic after FFN.

5️⃣ Encoder Output

After N encoder layers:

  • Each token has a context-aware representation
  • This output is passed to every decoder layer

Think of it as:

"Fully understood input sentence"

6️⃣ Decoder Input (Shifted Right)

Example:

Target sentence: "Je suis ici"
Decoder input: "<start> Je suis"

Why shifted?

  • Prevents cheating
  • Forces next-token prediction

📌 This is why decoding is autoregressive.

7️⃣ Decoder Stack (repeated Nx times)

Each decoder block has 3 sub-layers (very important difference):

7.1️⃣ Masked Multi-Head Self-Attention

Masked means:

  • Token can see past tokens only
  • Cannot see future words

Example:

"<start> Je"
→ cannot see "suis"

📌 This enables text generation.

7.2️⃣ Encoder-Decoder Attention (Cross-Attention)

This is the bridge between encoder & decoder.

  • Queries → decoder
  • Keys & Values → encoder output

Meaning:

"While generating, look back at the input sentence."

This is why translation works:

"I am here" → "Je suis ici"

7.3️⃣ Feed-Forward Network (same as encoder)

Same FFN logic, applied per token.

7.4️⃣ Add & Norm everywhere

Every sub-layer:

input + output → LayerNorm

8️⃣ Linear Layer

  • Maps decoder hidden state → vocabulary size
  • Example:
[0.21, -0.9, 1.3] → logits for 50k words

9️⃣ Softmax

  • Converts logits → probabilities
  • Highest probability token = next token

Example:

P("ici") = 0.72
P("là") = 0.18

1️⃣0️⃣ Output Probabilities → Token Generation

  • Pick token (greedy / sampling / beam search)
  • Append to input
  • Repeat loop

This is how LLMs generate text token by token.

🔥 Encoder vs Decoder Summary (Interview Gold)

| Component       | Encoder      | Decoder    |
| --------------- | ------------ | ---------- |
| Self-Attention  | ✅ (unmasked) | ✅ (masked) |
| Cross-Attention | ❌            | ✅          |
| FFN             | ✅            | ✅          |
| Generation      | ❌            | ✅          |

🧠 Why GPT-style LLMs Don't Use Encoder

GPT removes:

  • Encoder
  • Cross-attention

Keeps:

  • Decoder-only stack
  • Masked self-attention
  • Next-token prediction