Deep Dive · Architecture Analysis

The Big LLM Architecture Comparison

From DeepSeek V3 to GLM-5 — a comprehensive look at how modern open-weight language models are structurally designed, what makes them tick, and where the field is heading.

Last updated March 2026 · Prateek Singh, PhD
MoE Linear Attention Normalization KV Cache
Scroll

It has been seven years since the original GPT architecture was developed. Looking back at GPT-2 (2019) and forward to DeepSeek V3 and Llama 4 (2024–2025), one might be surprised at how structurally similar these models still are at their core.

Positional embeddings evolved from absolute to rotational (RoPE). Multi-Head Attention largely gave way to Grouped-Query Attention. SwiGLU replaced GELU. But beneath these refinements, the foundational transformer decoder architecture remains largely intact — we are polishing, not rebuilding.

This blog focuses on the structural architectural choices that define today's flagship open-weight models — not benchmark scores or training recipes. It covers what LLM developers are actually building in 2025.

Mixture-of-Experts

Sparse activation: activate only a handful of experts per token for massive capacity at low inference cost.

🗜️

KV Cache Compression

MLA, sliding window attention, and partial RoPE all target the memory bottleneck in long-context inference.

📐

Normalization Tricks

Post-Norm vs. Pre-Norm, QK-Norm, and sandwich norms — placement matters for training stability.

🌊

Linear Attention Revival

Gated DeltaNet, Mamba-2 hybrids, and lightning attention offer O(n) alternatives to quadratic attention.

01

DeepSeek V3 / R1

The model that changed everything in early 2025

671B params 37B active 128k context MoE · Sparse

DeepSeek R1 made a massive impact when released in January 2025. It is a reasoning model built on top of the DeepSeek V3 architecture (December 2024), which introduced two key innovations that distinguish it from most other LLMs.

Multi-Head Latent Attention (MLA)

Instead of sharing key/value heads like GQA, MLA compresses the K/V tensors into a lower-dimensional space before storing them in the KV cache. At inference, they are projected back. This adds a matrix multiplication but dramatically reduces cache memory — while the DeepSeek-V2 paper shows MLA actually outperforms standard MHA on benchmarks.

Mixture-of-Experts (DeepSeekMoE)

Each FeedForward module is replaced by 256 small experts with a router selecting only 9 (1 shared + 8 routed) per token. Total parameters: 671B. Active parameters per step: just 37B. The shared expert — always active — handles common patterns, freeing specialized experts for niche knowledge.

DeepSeek V3 — Layer Stack (61×)
Linear Output Layer
Final RMSNorm
↑ ×61 transformer blocks ↓
MoE Layer (256 experts, 9 active)
RMSNorm 2
First 3 blocks: Dense FFN
Multi-Head Latent Attention
RMSNorm 1
+ RoPE
Token Embedding Layer

DeepSeek V3 outperformed the 405B Llama 3 at launch, despite being nearly twice as large in total parameters — but with far fewer active parameters at inference time. This efficiency/capacity tradeoff is the core insight powering the MoE revival of 2025.

02

OLMo 2

Fully transparent — the blueprint for reproducible LLM research

7B / 13B params 4k context Dense · MHA

The Allen Institute's OLMo series is celebrated for its full transparency: training data, code, and checkpoints are all released publicly. OLMo 2 sits at the Pareto frontier of compute-to-performance at the time of its January 2025 release.

The architecture is mostly standard — notably still using traditional Multi-Head Attention (MHA) rather than GQA. The interesting innovations are in normalization placement.

Post-Norm (instead of Pre-Norm)

Most LLMs since GPT place norms before each sub-layer (Pre-Norm). OLMo 2 places them after, inside the residual connections. Combined with QK-Norm (RMSNorm applied to queries and keys inside attention), this yields smoother gradient norms and fewer training loss spikes.

💡

QK-Norm applies an additional RMSNorm to the query and key vectors before RoPE is applied. It stabilizes the dot-product magnitudes that drive attention scores. Not new to OLMo — it dates back to the 2023 Scaling Vision Transformers paper — but it's becoming a standard add-on across the field.

03

Gemma 3

Google's underappreciated gem with a clever efficiency trick

1B – 27B 128k context Dense · Sliding Window

While others adopted MoE for efficiency, Gemma 3 went a different route: Sliding Window Attention. Instead of every token attending to all others (O(n²)), local attention layers restrict each token's view to a fixed nearby window.

Sliding Window Attention — 5:1 Ratio

Gemma 2 used a 1:1 ratio of local to global attention. Gemma 3 pushes this to 5:1 — five sliding-window layers for every one full-attention layer — and shrinks the window from 4096 (Gemma 2) to just 1024 tokens. Their ablation studies show minimal impact on output quality but massive KV cache memory savings, especially at long context lengths.

Pre + Post RMSNorm (Sandwich Norm)

Gemma 3 applies RMSNorm both before and after each attention and feed-forward sub-layer. This "best of both worlds" approach costs very little compute (RMSNorm is cheap) but may improve training stability by combining the benefits of Pre-Norm gradient behavior with the output normalization of Post-Norm.

04

Llama 4 Maverick

Meta's MoE debut with a classic expert design philosophy

400B params 17B active 512k context MoE · GQA

Llama 4 Maverick closely mirrors the DeepSeek V3 architecture — MoE layers, SwiGLU FeedForward, RMSNorm — but with two notable differences:

GQA instead of MLA: Meta kept Grouped-Query Attention rather than adopting DeepSeek's more complex Multi-Head Latent Attention, likely prioritizing implementation simplicity.

Fewer, larger experts: Llama 4 uses a more classic MoE configuration — 2 active experts with hidden size 8,192 each — compared to DeepSeek's 9 active experts with hidden size 2,048. Despite having twice as many active parameters in absolute terms (17B vs 37B), Llama 4 Maverick is roughly a third the total size.

🔁

Llama 4 also alternates between dense and MoE blocks in every other transformer layer, rather than applying MoE to all layers. This is different from DeepSeek V3 (which only skips the first 3 blocks). The impact on performance versus efficiency is an open research question.

05

Qwen3 Series

Dense + MoE, from 0.6B to 235B — consistently at the top

0.6B → 235B 22B active (235B-A22B) Dense + MoE

The Qwen team delivers consistently strong models. Qwen3 offers 7 dense sizes (0.6B to 32B) and 2 MoE variants (30B-A3B, 235B-A22B), giving practitioners flexibility depending on their inference budget.

The 235B-A22B architecture is remarkably similar to DeepSeek V3 — but notably drops the shared expert. The Qwen3 team explained this was partly due to not seeing significant gains and concerns around inference optimization complexity.

QK-Norm + Deeper Architecture

Compared to Llama 3 1B, the Qwen3 0.6B is a deeper but narrower architecture: more transformer blocks, smaller hidden dimensions, fewer attention heads. This trades inference speed for a smaller memory footprint — a good design for edge deployment scenarios.

06

Kimi K2

1 trillion parameters — the largest open-weight model of this generation

1 Trillion params 32B active 128k / 256k context MoE · MLA

Kimi K2 is essentially a scaled-up DeepSeek V3: same MLA + MoE architecture, but pushed to 1 trillion total parameters. The team uses more experts per MoE module and fewer MLA heads compared to DeepSeek, making different capacity/bandwidth tradeoffs.

Notably, Kimi K2 was trained using the Muon optimizer (rather than AdamW) — reportedly the first production model of this scale to do so, yielding impressively smooth loss curves.

🧠

The Kimi K2 Thinking variant (released Nov 2025) shares the same architecture but extends context from 128k to 256k tokens. According to Moonshot AI's benchmarks, it surpasses leading proprietary models on several agentic reasoning and coding tasks.

Architecture Comparison

Model Params (Total) Params (Active) Attention FFN Key Features
DeepSeek V3 671B 37B MLA MoE ×256 Shared Expert RoPE
Llama 4 Maverick 400B 17B GQA MoE alternating SwiGLU 512k ctx
Gemma 3 27B 27B 27B GQA + SWA Dense Pre+Post Norm 5:1 ratio
Qwen3 235B-A22B 235B 22B GQA MoE ×128 QK-Norm No shared exp.
Kimi K2 1T 32B MLA MoE (more exp.) Muon opt. 256k ctx
OLMo 2 7B 7B 7B MHA Dense Post-Norm QK-Norm
GLM-5 744B 40B MLA + DSA MoE ×256 Shared Expert Sparse Attn
Mistral 3 Large 673B 41B MLA MoE (32 exp.) Shared Expert Vision enc.

Release Timeline

Architecture milestones across 15+ months of frontier model releases

Dec 2024
DeepSeek V3
The New Baseline

DeepSeek V3 reshapes expectations for open-weight models. At 671B parameters with only 37B active per token, it matches or beats GPT-4 class models while being fully open. The architecture introduces two ideas that will dominate the next year: Multi-Head Latent Attention (MLA) for KV cache compression and a fine-grained MoE with 256 experts and a dedicated shared expert.

DeepSeek V3 · 671B
MLA replaces GQA for KV compression 256 routed experts + 1 shared expert 37B active params per token Multi-Token Prediction (MTP) training 128k context window
Jan 2025
DeepSeek R1 · OLMo 2
Reasoning + Openness

DeepSeek R1 adds chain-of-thought reasoning on top of V3's architecture — same structure, trained with reinforcement learning to think before answering. Meanwhile, OLMo 2 from Allen AI becomes the most transparent model released: fully open weights, training data, and code. Its architectural contribution is Post-Norm placement and QK-Norm, which improve training stability and would be widely copied.

DeepSeek R1 OLMo 2 · 7B / 32B
R1: RL-trained reasoning on V3 base OLMo 2: Post-Norm (RMSNorm inside residual) OLMo 2: QK-Norm on Q and K vectors Fully open: weights, data, and training code
Mar 2025
Gemma 3 · Mistral Small 3.1
Memory Efficiency

Gemma 3 from Google DeepMind introduces an aggressive 5:1 sliding window attention ratio — for every global attention layer, five layers only attend to a local window of 1024 tokens. This slashes KV cache memory at long context. It also uses both Pre-Norm and Post-Norm around the attention block. Mistral Small 3.1 takes a different approach: standard GQA with no sliding window, focusing on raw inference speed over memory.

Gemma 3 · 1B–27B Gemma 3n Mistral Small 3.1 · 24B
Gemma 3: Sliding window 5:1, size=1024 Gemma 3: Pre + Post Norm around attention Gemma 3n: Per-Layer Embeddings (PLE) streamed on demand Mistral 3.1: Standard GQA, no sliding window
Apr 2025
Llama 4 · Qwen3
MoE Goes Mainstream

Llama 4 Maverick marks Meta's full adoption of MoE — a significant shift from Llama 3's dense architecture. It uses alternating dense and MoE blocks with GQA (not MLA), and a 512k context window via iRoPE. Simultaneously, Qwen3 from Alibaba releases a full dense-to-MoE family spanning 0.6B to 235B. Its key contribution: QK-Norm on both Q and K combined with a deeper (more layers, narrower) architecture versus Llama's approach.

Llama 4 Scout Llama 4 Maverick · ~400B Qwen3 Dense · 0.6B–32B Qwen3 MoE · 235B-A22B
Llama 4: GQA + alternating dense/MoE blocks Llama 4: 512k context via iRoPE Qwen3 MoE: 8 active experts, no shared expert Qwen3: QK-Norm + deeper architecture vs Llama
May 2025
SmolLM3 · Grok 2.5
Positional Experiments

SmolLM3 from HuggingFace demonstrates that you can skip positional embeddings entirely in some layers — NoPE (No Positional Embeddings) every 4th layer. The causal mask already encodes token order implicitly; removing RoPE from select layers improves length generalization beyond the training context. Grok 2.5 from xAI takes a different MoE bet: only 8 large experts (coarse-grained) versus DeepSeek's 256 small ones, with a dense SwiGLU as its shared expert.

SmolLM3 · 3B Grok 2.5 · ~270B
SmolLM3: NoPE every 4th layer SmolLM3: Better length generalization Grok 2.5: Only 8 large experts (coarse MoE) Grok 2.5: Dense SwiGLU as shared expert
Jul 2025
Kimi K2 · GPT-OSS · GLM-4.5
The Trillion Param Era

Kimi K2 from Moonshot AI scales DeepSeek V3's architecture to 1 trillion total parameters — the largest open model yet — keeping 32B active. It validates that MLA + fine-grained MoE scales beyond the original V3 design. GPT-OSS (OpenAI's first open-weight release since GPT-2) introduces attention sinks as learned logit biases and sliding window every other layer. GLM-4.5 from Zhipu adds attention bias units and places 3 dense transformer layers before the MoE blocks begin.

Kimi K2 · 1T total · 32B active GPT-OSS · 20B / 120B GLM-4.5 · 355B / 106B
Kimi K2: MLA + MoE at 1T scale GPT-OSS: Learned attention sink logits GPT-OSS: Sliding window every other layer GLM-4.5: 3 dense layers before MoE starts GLM-4.5: Attention bias + MTP training
Sep–Dec 2025
Qwen3-Next · MiniMax-M2 · Kimi Linear · Nemotron 3 · DeepSeek V3.2 · Mistral 3
Linear Attention Wave

A wave of models explore moving beyond standard O(n²) attention. Qwen3-Next pairs Gated DeltaNet (linear) with full attention in a 3:1 ratio. Kimi Linear invents its own Kimi Delta Attention with channel-wise gating, combined with MLA in the full-attention layers. Nemotron 3 goes furthest: Mamba-2 state-space layers + sparse MoE + GQA in a single architecture. Meanwhile, MiniMax-M2 introduces Partial RoPE (applied to only half of head dimensions) for better length extrapolation. Mistral 3 Large adopts MLA directly, essentially cloning the DeepSeek V3 design at 673B.

Qwen3-Next · 80B-A3B Kimi Linear · 48B-A3B MiniMax-M2 · 230B Nemotron 3 Nano · 30B-A3B DeepSeek V3.2 · 671B Mistral 3 Large · 673B
Qwen3-Next: Gated DeltaNet 3:1 hybrid + MTP Kimi Linear: Novel channel-wise gated attention + NoPE MiniMax-M2: Partial RoPE on 50% of head dims Nemotron 3: Mamba-2 + MoE + GQA triple hybrid DeepSeek V3.2: Adds DeepSeek Sparse Attention (DSA) Mistral 3: MLA adopted, multimodal vision encoder
Feb 2026
GLM-5 · Arcee Trinity · OLMo 3 · Xiaomi MiMo-V2
Architectural Maturation

GLM-5 represents the culmination of the DeepSeek V3 template, adding MLA and DeepSeek Sparse Attention on top of GLM-4.5's MoE foundation — 744B total, 256 experts, 40B active across 78 layers. Arcee Trinity Large stacks the most techniques of any model: MoE + Sliding Window 3:1 + NoPE on global layers + gated attention + depth-scaled sandwich norm. OLMo 3 extends OLMo 2 with sliding window attention and YaRN for global layers. Xiaomi MiMo-V2-Flash adds aggressive 5:1 sliding window with a tiny window size of 128 tokens.

GLM-5 · 744B · 40B active Arcee Trinity · 400B OLMo 3 · 7B / 32B Xiaomi MiMo-V2-Flash · 309B
GLM-5: MLA + DSA + 256 experts at 744B GLM-5: MTP for speculative decoding Trinity: NoPE + gated attn + sandwich norm OLMo 3: Sliding window 3:1 + fully open MiMo-V2: Sliding window 5:1 with size=128