Qwen3 LLM Based Embeddings Explained

1) Introduction

Mandeep Singh

~8 min read · November 28, 2025 (Updated: November 28, 2025) · Free: Yes

1) Introduction

Text embeddings — vector representations of semantic meaning — are the backbone of modern search and RAG (Retrieval-Augmented Generation) systems. For years, the industry standard for these models was the BERT architecture. However, the release of the LLM-Based-Embedding series marks a significant shift, demonstrating that massive, decoder-only Large Language Models (LLMs) can be adapted to outperform dedicated encoders. In this blog, we will talk about Qwen3[1] embedding model.

(Note: I used Gemini Pro Banana for illustrations and it's does a way better job then textual explanation!!)

The Paradigm Shift: BERT (Encoder) vs. Qwen (Decoder)

To understand why Qwen3 is special, we must first look at how it differs from traditional embedding models.

Bert Vs Qwen

1. Attention Mechanism (Bi-directional vs. Causal)

BERT (Encoder-Only): Uses bi-directional attention. When processing the word "Apple" in a sentence, it can simultaneously "see" the words that come before and after it. This is naturally suited for understanding context but is computationally expensive to scale to long documents.
Qwen (Decoder-Only): Uses causal attention (standard GPT style). Tokens can only attend to previous tokens. While historically considered a limitation for embeddings (as early tokens lack full sentence context), Qwen leverages the last token (EOS). Because of the causal mask, the final token has effectively aggregated the information of the entire sequence.

2. Context Window & Scaling

Decoder models like Qwen are highly efficient for large contexts because their training objective, Next Token Prediction, allows them to learn from every single token in the sequence. In contrast, BERT's Masked Language Modeling (MLM) requires processing the full context but only generates a learning signal for the small percentage of masked words. This inherent inefficiency makes training bidirectional models on massive sequences computationally prohibitive compared to decoders.

BERT: Typically limited to 512 tokens. Scaling a BERT model to 7B+ parameters is inefficient and rare.
Qwen: Built on a generative LLM backbone, it naturally supports massive context windows (32k to 128k tokens) and scales easily to 8B+ parameters, allowing it to "read" entire reports or code files in a single pass.

Training Objective Matters — MLM Vs Next Token Prediction

Real-World Comparison: Embedding a Product Review

Input: A 5,000-word review of the new iPhone with sections on: Design (words 1–800)

Performance (words 801–1,600)

Camera (words 1,601–2,400)

Battery (words 2,401–3,200)

Software (words 3,201–4,000)

Conclusion (words 4,001–5,000)

BERT's approach:

Step 1: Read first 512 tokens (words 1-400) → Create embedding
Step 2: Read next 512 tokens (words 401-800) → Create embedding
Step 3: Read next 512 tokens (words 801-1,200) → Create embedding
...continues for ~13 chunks...
Step 13: Somehow combine all these embeddings → Final embedding

Problems:
❌ Each chunk lacks context from other chunks
❌ "Battery" section doesn't know what "Design" section said
❌ Combining 13 separate embeddings is clumsy

QWEN's approach:

Step 1: Process all 5,000 words sequentially
Step 2: Extract the final token's hidden state → Final embedding

Benefits:
✅ The final embedding has "read" the entire review
✅ Understands relationships across all sections
✅ Single coherent embedding, not a franken-combination

2) Model Details

2.1) Architecture

The Qwen3-Embedding model (e.g., Qwen/Qwen3-Embedding-8B) is a dense, decoder-only transformer based on the Qwen3 foundation. It features 36 transformer layers and an embedding dimension of 4096.

1. Last Token Pooling & Causal Masking: Unlike BERT models that rely on a specialized [CLS] token at the start of a sequence, Qwen leverages the nature of autoregressive modeling.

Mechanism: In a standard LLM forward pass, the hidden state of the final token (the End-of-Sequence or [EOS] token) contains the aggregated information of the entire preceding sequence due to the causal attention mask.

Advantages:

Seamless Adaptation: This allows the embedding model to be trained exactly like a standard LLM, sharing weights and optimizations (like Flash Attention 2) with the generative family.
Unified Deployment: Because the architecture is identical to the generative Qwen models, the same high-performance inference engines (like vLLM or TGI) can be used to serve embeddings without custom kernels or architecture-specific modifications.

2. Matryoshka Representation Learning (MRL)

Storage costs for vector databases can be prohibitive with 4096-dimensional vectors. Qwen3 uses MRL [2], a technique that "nests" information by importance.

How it works: During training, the loss function isn't just calculated on the full 4096 dimensions. It is simultaneously calculated on the first 512, 1024, and 2048 dimensions.
The Result: This forces the model to pack the most critical semantic information into the earlier dimensions. At inference time, you can safely truncate the vector to 1024 dimensions to save 75% of your storage cost while retaining ~95% of the retrieval performance.

3. Rotary Positional Embeddings (RoPE)

To handle long contexts (up to 32k tokens for the 8B model), Qwen3 utilizes RoPE. Unlike absolute position embeddings (which learn a fixed vector for "Position 1" vs "Position 500"), RoPE encodes position by rotating the vector in embedding space. This allows the model to generalize better to sequence lengths it wasn't explicitly trained on, maintaining retrieval accuracy across long legal contracts or research papers.

For more information, please check my previous post on this topic:

https://medium.com/@mandeep0405/llama-4s-architecture-deconstructed-moe-irope-and-early-fusion-explained-e58eb9403067

2.2) Training Pipeline

Qwen3's performance is largely attributed to a sophisticated, multi-stage training process that blends synthetic data with novel merging techniques.

Stage 1: Weakly-Supervised Pre-Training (Synthetic Scale)

The team bypassed the noise of web-scraped data by using a "Teacher" model (Qwen3–72B Instruct) to synthesize a massive dataset.

Volume: Approximately 150 million synthetic query-document pairs.

Diversity: The generator was prompted to create data across 93 languages and diverse tasks, including:

Asymmetric Retrieval: "Write a query that finds this document."

2. Symmetric Semantic Similarity: "Write two sentences that mean the same thing but use different words."

3. Bitext Mining: Translation pairs for cross-lingual retrieval.

Stage 2: Supervised Fine-Tuning (SFT)

The model is then refined using a rigorous contrastive learning setup.

Data Sources: A mix of ~7 million labeled pairs from benchmarks like MS MARCO, HotpotQA, and NLI, combined with the best ~12 million synthetic pairs from Stage 1.

Input Tuple Structure: Training samples are organized not just as pairs, but as complex tuples: {Instruction, Query, Positive_Doc, Hard_Negative_1, ..., Hard_Negative_N}.

Positive (d+): The ground-truth document that answers the query.
Hard Negatives (d-): This is where Qwen3 excels. Instead of using random documents as negatives, the team mines "hard" negatives based on high lexical overlap (e.g., BM25 scores) that are semantically incorrect.

Deep Dive: InfoNCE Loss Explained

At the heart of Stage 2 is the InfoNCE (Information Noise Contrastive Estimation) loss function. This mathematical framework essentially transforms the vague concept of "semantic similarity" into a concrete classification problem. The objective is that for a given a query, the model must identify the one correct document from a batch of impostors.

Imagine the model outputs a "score" based on the dot product of two vectors. This score represents how well the query aligns with the document in high-dimensional space.

We want the alignment score for the Positive match to be as high as possible (e.g., pulling the vectors together).
We want the alignment scores for all Negative matches to be as low as possible (e.g., pushing the vectors apart).

For e.g.,

Query (q): "What is the capital of France?"

Positive document (d+): "Paris is the capital and largest city of France."

Hard negatives (d-):
- d1: "Lyon is the third-largest city in France."
- d2: "France is a country in Western Europe."  
- d3: "The French Revolution began in 1789."

Mathematically, this is expressed as a softmax function over similarity scores. Here, the numerator represents the strength of the correct match, while the denominator represents the sum of the correct match plus all the incorrect matches (the noise). Minimizing this loss is equivalent to maximizing the ratio of "Signal" (numerator) to "Signal + Noise" (denominator):

The denominator, however, is far more complex than a simple sum of errors. In robust implementations like Qwen3, it represents the total "Noise" in the vector space, rigorously constructed from five distinct resistance terms:

The Positive Match: Similarity to correct document
Hard Negatives: These are the specific "trap" documents (like the "Lyon" example) that force the model to distinguish nuance. They look correct (high lexical overlap) but are semantically wrong, providing the steepest learning gradient.
In-Batch Negatives: Every other document ($d_j$) intended for other queries in the same batch acts as a "random" negative. This ensures the model pushes away generally irrelevant content.
Query-to-Query Repulsion: The loss often includes terms to penalize if the current query (q1) is too similar to other queries (q2, q3) in the batch. This prevents "query collapse," ensuring distinct questions maintain distinct vector footprints.
Doc-to-Doc Repulsion: Similarly, the positive document (d1) is pushed away from other positive documents (d2,d3) to ensure distinct concepts occupy distinct regions of the hypersphere.

3) Conclusions

Qwen3-Embedding represents generative data synthesis by leveraging a smarter "teacher" model to create perfect training data and adopting a scalable decoder architecture. Qwen3 eliminates the context limits of BERT while providing state-of-the-art retrieval performance across multiple languages.

References

Appendix

From huggingface, qwen3 usage using the Transformer package

# Requires transformers>=4.51.0

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-8B', padding_side='left')
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-8B')

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-8B', attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
# [[0.7493016123771667, 0.0750647559762001], [0.08795969933271408, 0.6318399906158447]]

#llm #qwen #embedding #encoder-decoder #bert

< Go to the original

Qwen3 LLM Based Embeddings Explained

1) Introduction

1) Introduction

The Paradigm Shift: BERT (Encoder) vs. Qwen (Decoder)

Real-World Comparison: Embedding a Product Review

2) Model Details

2.1) Architecture

2.2) Training Pipeline

3) Conclusions

References

Appendix

Reporting a Problem