LLM : Transformer Architecture

Encoder understands, Decoder speaks, Attention connects meaning, FFN refines, Softmax chooses words.

Piyali Das

~3 min read · April 29, 2026 (Updated: April 29, 2026) · Free: Yes

Encoder understands, Decoder speaks, Attention connects meaning, FFN refines, Softmax chooses words.

https://www.udemy.com/course/prompt-engineering-for-everyone-bootcamp/

0️⃣ High-level structure (what this diagram shows)

Input → Encoder stack → Context representations
Target (shifted) → Decoder stack → Output probabilities

Left block → Encoder (understands input)
Right block → Decoder (generates output)
Nx → same block repeated N times (usually 6–48+)

1️⃣ Tokenization (implicit, before diagram)

Before anything:

"I love transformers"
→ ["I", "love", "transformers"]
→ [101, 203, 503]

💡 Transformers do not read text, they read token IDs.

2️⃣ Embedding Layer (Encoder & Decoder)

What it does

Converts token IDs → dense vectors
Example:
503 → [0.12, -0.88, 0.34, ...]

Why it matters

Captures semantic meaning
Same idea used in vector databases (RAG)

📌 In the diagram:

Input Embedding → Encoder side
Output Embedding → Decoder side

3️⃣ Positional Encoding

Problem it solves

Transformers process tokens in parallel, so they don't know order.

Solution

Add position information:

Final embedding = token_embedding + positional_encoding

Types

Sinusoidal (original paper)
Learned (GPT-style models)

📌 Shown at bottom of encoder & decoder inputs.

4️⃣ Encoder Stack (repeated Nx times)

Each Encoder block has 2 sub-layers:

4.1️⃣ Multi-Head Self-Attention (Encoder)

What happens

Each token looks at all other tokens in the input sentence.

Example:

"I love transformers"
→ "love" attends strongly to "transformers"

Internals (simplified)

Query (Q), Key (K), Value (V)
Attention score = similarity(Q, K)
Weighted sum of V

Multi-Head?

Multiple attention heads = multiple perspectives
Syntax
Semantics
Relationships

📌 This block is NOT masked in encoder.

4.2️⃣ Add & Norm (Residual Connection + LayerNorm)

After attention:

output = LayerNorm(input + attention_output)

Why?

Stabilizes training
Prevents vanishing gradients
Allows very deep stacks

4.3️⃣ Feed-Forward Network (FFN)

What it does

Applies a non-linear transformation
Processes each token independently

Formula:

FFN(x) = max(0, xW1 + b1)W2 + b2

💡 Important:

FFN does NOT mix tokens — attention already did that.

4.4️⃣ Add & Norm (again)

Same residual logic after FFN.

5️⃣ Encoder Output

After N encoder layers:

Each token has a context-aware representation
This output is passed to every decoder layer

Think of it as:

"Fully understood input sentence"

6️⃣ Decoder Input (Shifted Right)

Example:

Target sentence: "Je suis ici"
Decoder input: "<start> Je suis"

Why shifted?

Prevents cheating
Forces next-token prediction

📌 This is why decoding is autoregressive.

7️⃣ Decoder Stack (repeated Nx times)

Each decoder block has 3 sub-layers (very important difference):

7.1️⃣ Masked Multi-Head Self-Attention

Masked means:

Token can see past tokens only
Cannot see future words

Example:

"<start> Je"
→ cannot see "suis"

📌 This enables text generation.

7.2️⃣ Encoder-Decoder Attention (Cross-Attention)

This is the bridge between encoder & decoder.

Queries → decoder
Keys & Values → encoder output

Meaning:

"While generating, look back at the input sentence."

This is why translation works:

"I am here" → "Je suis ici"

7.3️⃣ Feed-Forward Network (same as encoder)

Same FFN logic, applied per token.

7.4️⃣ Add & Norm everywhere

Every sub-layer:

input + output → LayerNorm

8️⃣ Linear Layer

Maps decoder hidden state → vocabulary size
Example:

[0.21, -0.9, 1.3] → logits for 50k words

9️⃣ Softmax

Converts logits → probabilities
Highest probability token = next token

Example:

P("ici") = 0.72
P("là") = 0.18

1️⃣0️⃣ Output Probabilities → Token Generation

Pick token (greedy / sampling / beam search)
Append to input
Repeat loop

This is how LLMs generate text token by token.

🔥 Encoder vs Decoder Summary (Interview Gold)

| Component       | Encoder      | Decoder    |
| --------------- | ------------ | ---------- |
| Self-Attention  | ✅ (unmasked) | ✅ (masked) |
| Cross-Attention | ❌            | ✅          |
| FFN             | ✅            | ✅          |
| Generation      | ❌            | ✅          |

🧠 Why GPT-style LLMs Don't Use Encoder

GPT removes:

Encoder
Cross-attention

Keeps:

Decoder-only stack
Masked self-attention
Next-token prediction

#transformers #llm #ai #technology #tech

< Go to the original

LLM : Transformer Architecture

Encoder understands, Decoder speaks, Attention connects meaning, FFN refines, Softmax chooses words.

0️⃣ High-level structure (what this diagram shows)

1️⃣ Tokenization (implicit, before diagram)

2️⃣ Embedding Layer (Encoder & Decoder)

What it does

Why it matters

3️⃣ Positional Encoding

Problem it solves

Solution

Types

4️⃣ Encoder Stack (repeated Nx times)

4.1️⃣ Multi-Head Self-Attention (Encoder)

What happens

Internals (simplified)

Multi-Head?

4.2️⃣ Add & Norm (Residual Connection + LayerNorm)

4.3️⃣ Feed-Forward Network (FFN)

What it does

4.4️⃣ Add & Norm (again)

5️⃣ Encoder Output

6️⃣ Decoder Input (Shifted Right)

7️⃣ Decoder Stack (repeated Nx times)

7.1️⃣ Masked Multi-Head Self-Attention

Masked means:

7.2️⃣ Encoder-Decoder Attention (Cross-Attention)

7.3️⃣ Feed-Forward Network (same as encoder)

7.4️⃣ Add & Norm everywhere

8️⃣ Linear Layer

9️⃣ Softmax

1️⃣0️⃣ Output Probabilities → Token Generation

🔥 Encoder vs Decoder Summary (Interview Gold)

🧠 Why GPT-style LLMs Don't Use Encoder

Reporting a Problem