The Big LLM Architecture Comparison

Overview

It has been seven years since the original GPT architecture was developed. Looking back at GPT-2 (2019) and forward to DeepSeek V3 and Llama 4 (2024–2025), one might be surprised at how structurally similar these models still are at their core.

Positional embeddings evolved from absolute to rotational (RoPE). Multi-Head Attention largely gave way to Grouped-Query Attention. SwiGLU replaced GELU. But beneath these refinements, the foundational transformer decoder architecture remains largely intact — we are polishing, not rebuilding.

This blog focuses on the structural architectural choices that define today's flagship open-weight models — not benchmark scores or training recipes. It covers what LLM developers are actually building in 2025.

⚡

Mixture-of-Experts

Sparse activation: activate only a handful of experts per token for massive capacity at low inference cost.

🗜️

KV Cache Compression

MLA, sliding window attention, and partial RoPE all target the memory bottleneck in long-context inference.

📐

Normalization Tricks

Post-Norm vs. Pre-Norm, QK-Norm, and sandwich norms — placement matters for training stability.

🌊

Linear Attention Revival

Gated DeltaNet, Mamba-2 hybrids, and lightning attention offer O(n) alternatives to quadratic attention.

DeepSeek V3 / R1

The model that changed everything in early 2025

671B params 37B active 128k context MoE · Sparse

DeepSeek R1 made a massive impact when released in January 2025. It is a reasoning model built on top of the DeepSeek V3 architecture (December 2024), which introduced two key innovations that distinguish it from most other LLMs.

Multi-Head Latent Attention (MLA)

Instead of sharing key/value heads like GQA, MLA compresses the K/V tensors into a lower-dimensional space before storing them in the KV cache. At inference, they are projected back. This adds a matrix multiplication but dramatically reduces cache memory — while the DeepSeek-V2 paper shows MLA actually outperforms standard MHA on benchmarks.

Mixture-of-Experts (DeepSeekMoE)

Each FeedForward module is replaced by 256 small experts with a router selecting only 9 (1 shared + 8 routed) per token. Total parameters: 671B. Active parameters per step: just 37B. The shared expert — always active — handles common patterns, freeing specialized experts for niche knowledge.

DeepSeek V3 — Layer Stack (61×)

Linear Output Layer

Final RMSNorm

↑ ×61 transformer blocks ↓

MoE Layer (256 experts, 9 active)

RMSNorm 2

First 3 blocks: Dense FFN

Multi-Head Latent Attention

RMSNorm 1

+ RoPE

Token Embedding Layer

DeepSeek V3 outperformed the 405B Llama 3 at launch, despite being nearly twice as large in total parameters — but with far fewer active parameters at inference time. This efficiency/capacity tradeoff is the core insight powering the MoE revival of 2025.

OLMo 2

Fully transparent — the blueprint for reproducible LLM research

7B / 13B params 4k context Dense · MHA

The Allen Institute's OLMo series is celebrated for its full transparency: training data, code, and checkpoints are all released publicly. OLMo 2 sits at the Pareto frontier of compute-to-performance at the time of its January 2025 release.

The architecture is mostly standard — notably still using traditional Multi-Head Attention (MHA) rather than GQA. The interesting innovations are in normalization placement.

Post-Norm (instead of Pre-Norm)

Most LLMs since GPT place norms before each sub-layer (Pre-Norm). OLMo 2 places them after, inside the residual connections. Combined with QK-Norm (RMSNorm applied to queries and keys inside attention), this yields smoother gradient norms and fewer training loss spikes.

💡

QK-Norm applies an additional RMSNorm to the query and key vectors before RoPE is applied. It stabilizes the dot-product magnitudes that drive attention scores. Not new to OLMo — it dates back to the 2023 Scaling Vision Transformers paper — but it's becoming a standard add-on across the field.

Gemma 3

Google's underappreciated gem with a clever efficiency trick

1B – 27B 128k context Dense · Sliding Window

While others adopted MoE for efficiency, Gemma 3 went a different route: Sliding Window Attention. Instead of every token attending to all others (O(n²)), local attention layers restrict each token's view to a fixed nearby window.

Sliding Window Attention — 5:1 Ratio

Gemma 2 used a 1:1 ratio of local to global attention. Gemma 3 pushes this to 5:1 — five sliding-window layers for every one full-attention layer — and shrinks the window from 4096 (Gemma 2) to just 1024 tokens. Their ablation studies show minimal impact on output quality but massive KV cache memory savings, especially at long context lengths.

Pre + Post RMSNorm (Sandwich Norm)

Gemma 3 applies RMSNorm both before and after each attention and feed-forward sub-layer. This "best of both worlds" approach costs very little compute (RMSNorm is cheap) but may improve training stability by combining the benefits of Pre-Norm gradient behavior with the output normalization of Post-Norm.

Llama 4 Maverick

Meta's MoE debut with a classic expert design philosophy

400B params 17B active 512k context MoE · GQA

Llama 4 Maverick closely mirrors the DeepSeek V3 architecture — MoE layers, SwiGLU FeedForward, RMSNorm — but with two notable differences:

GQA instead of MLA: Meta kept Grouped-Query Attention rather than adopting DeepSeek's more complex Multi-Head Latent Attention, likely prioritizing implementation simplicity.

Fewer, larger experts: Llama 4 uses a more classic MoE configuration — 2 active experts with hidden size 8,192 each — compared to DeepSeek's 9 active experts with hidden size 2,048. Despite having twice as many active parameters in absolute terms (17B vs 37B), Llama 4 Maverick is roughly a third the total size.

🔁

Llama 4 also alternates between dense and MoE blocks in every other transformer layer, rather than applying MoE to all layers. This is different from DeepSeek V3 (which only skips the first 3 blocks). The impact on performance versus efficiency is an open research question.

Qwen3 Series

Dense + MoE, from 0.6B to 235B — consistently at the top

0.6B → 235B 22B active (235B-A22B) Dense + MoE

The Qwen team delivers consistently strong models. Qwen3 offers 7 dense sizes (0.6B to 32B) and 2 MoE variants (30B-A3B, 235B-A22B), giving practitioners flexibility depending on their inference budget.

The 235B-A22B architecture is remarkably similar to DeepSeek V3 — but notably drops the shared expert. The Qwen3 team explained this was partly due to not seeing significant gains and concerns around inference optimization complexity.

QK-Norm + Deeper Architecture

Compared to Llama 3 1B, the Qwen3 0.6B is a deeper but narrower architecture: more transformer blocks, smaller hidden dimensions, fewer attention heads. This trades inference speed for a smaller memory footprint — a good design for edge deployment scenarios.

Kimi K2

1 trillion parameters — the largest open-weight model of this generation

1 Trillion params 32B active 128k / 256k context MoE · MLA

Kimi K2 is essentially a scaled-up DeepSeek V3: same MLA + MoE architecture, but pushed to 1 trillion total parameters. The team uses more experts per MoE module and fewer MLA heads compared to DeepSeek, making different capacity/bandwidth tradeoffs.

Notably, Kimi K2 was trained using the Muon optimizer (rather than AdamW) — reportedly the first production model of this scale to do so, yielding impressively smooth loss curves.

🧠

The Kimi K2 Thinking variant (released Nov 2025) shares the same architecture but extends context from 128k to 256k tokens. According to Moonshot AI's benchmarks, it surpasses leading proprietary models on several agentic reasoning and coding tasks.

Model	Params (Total)	Params (Active)	Attention	FFN	Key Features
DeepSeek V3	671B	37B	MLA	MoE ×256	Shared Expert RoPE
Llama 4 Maverick	400B	17B	GQA	MoE alternating	SwiGLU 512k ctx
Gemma 3 27B	27B	27B	GQA + SWA	Dense	Pre+Post Norm 5:1 ratio
Qwen3 235B-A22B	235B	22B	GQA	MoE ×128	QK-Norm No shared exp.
Kimi K2	1T	32B	MLA	MoE (more exp.)	Muon opt. 256k ctx
OLMo 2 7B	7B	7B	MHA	Dense	Post-Norm QK-Norm
GLM-5	744B	40B	MLA + DSA	MoE ×256	Shared Expert Sparse Attn
Mistral 3 Large	673B	41B	MLA	MoE (32 exp.)	Shared Expert Vision enc.

Macro Patterns

Key Architectural Trends in 2025

MoE is now mainstream

DeepSeek V3's success triggered a wave of MoE adoption. Llama 4, Qwen3, GLM-5, Mistral 3 Large, Kimi K2 all use it. Fine-grained, many-small-expert designs are preferred over fewer large experts.

KV cache is the battleground

MLA (compress K/V), sliding window attention (truncate K/V), and partial RoPE (limit position scaling) all target the same bottleneck: KV cache memory at long context.

Normalization experimentation

Post-Norm (OLMo), Pre+Post Norm (Gemma 3), depth-scaled sandwich norm (Trinity), QK-Norm (Qwen3, OLMo, MiniMax) — everyone has an opinion on norm placement now.

Linear attention revival (cautious)

Qwen3-Next, Kimi Linear, and Nemotron 3 are betting on O(n) alternatives (Gated DeltaNet, Mamba-2). MiniMax M2 retreated to full attention after linear attention struggled on reasoning tasks.

DeepSeek V3 as a template

Kimi K2, Mistral 3 Large, and GLM-5 all adopt the DeepSeek V3 MLA + DeepSeekMoE architecture — sometimes almost identically. It's become the de facto reference design for large-scale open models.

Deeper vs. wider debate

Qwen3 goes deeper (more layers, smaller hidden dim). GPT-OSS goes wider (larger hidden dim, fewer layers). Gemma 2 ablations suggest wider is slightly better for a fixed parameter count, but the gap is small.

Chronology

Release Timeline

Architecture milestones across 15+ months of frontier model releases

Dec 2024

DeepSeek V3

The New Baseline

DeepSeek V3 reshapes expectations for open-weight models. At 671B parameters with only 37B active per token, it matches or beats GPT-4 class models while being fully open. The architecture introduces two ideas that will dominate the next year: Multi-Head Latent Attention (MLA) for KV cache compression and a fine-grained MoE with 256 experts and a dedicated shared expert.

DeepSeek V3 · 671B

MLA replaces GQA for KV compression 256 routed experts + 1 shared expert 37B active params per token Multi-Token Prediction (MTP) training 128k context window

Jan 2025

DeepSeek R1 · OLMo 2

Reasoning + Openness

DeepSeek R1 adds chain-of-thought reasoning on top of V3's architecture — same structure, trained with reinforcement learning to think before answering. Meanwhile, OLMo 2 from Allen AI becomes the most transparent model released: fully open weights, training data, and code. Its architectural contribution is Post-Norm placement and QK-Norm, which improve training stability and would be widely copied.

DeepSeek R1 OLMo 2 · 7B / 32B

R1: RL-trained reasoning on V3 base OLMo 2: Post-Norm (RMSNorm inside residual) OLMo 2: QK-Norm on Q and K vectors Fully open: weights, data, and training code

Mar 2025

Gemma 3 · Mistral Small 3.1

Memory Efficiency

Gemma 3 from Google DeepMind introduces an aggressive 5:1 sliding window attention ratio — for every global attention layer, five layers only attend to a local window of 1024 tokens. This slashes KV cache memory at long context. It also uses both Pre-Norm and Post-Norm around the attention block. Mistral Small 3.1 takes a different approach: standard GQA with no sliding window, focusing on raw inference speed over memory.

Gemma 3 · 1B–27B Gemma 3n Mistral Small 3.1 · 24B

Gemma 3: Sliding window 5:1, size=1024 Gemma 3: Pre + Post Norm around attention Gemma 3n: Per-Layer Embeddings (PLE) streamed on demand Mistral 3.1: Standard GQA, no sliding window

Apr 2025

Llama 4 · Qwen3

MoE Goes Mainstream

Llama 4 Maverick marks Meta's full adoption of MoE — a significant shift from Llama 3's dense architecture. It uses alternating dense and MoE blocks with GQA (not MLA), and a 512k context window via iRoPE. Simultaneously, Qwen3 from Alibaba releases a full dense-to-MoE family spanning 0.6B to 235B. Its key contribution: QK-Norm on both Q and K combined with a deeper (more layers, narrower) architecture versus Llama's approach.

Llama 4 Scout Llama 4 Maverick · ~400B Qwen3 Dense · 0.6B–32B Qwen3 MoE · 235B-A22B

Llama 4: GQA + alternating dense/MoE blocks Llama 4: 512k context via iRoPE Qwen3 MoE: 8 active experts, no shared expert Qwen3: QK-Norm + deeper architecture vs Llama

May 2025

SmolLM3 · Grok 2.5

Positional Experiments

SmolLM3 from HuggingFace demonstrates that you can skip positional embeddings entirely in some layers — NoPE (No Positional Embeddings) every 4th layer. The causal mask already encodes token order implicitly; removing RoPE from select layers improves length generalization beyond the training context. Grok 2.5 from xAI takes a different MoE bet: only 8 large experts (coarse-grained) versus DeepSeek's 256 small ones, with a dense SwiGLU as its shared expert.

SmolLM3 · 3B Grok 2.5 · ~270B

SmolLM3: NoPE every 4th layer SmolLM3: Better length generalization Grok 2.5: Only 8 large experts (coarse MoE) Grok 2.5: Dense SwiGLU as shared expert

Jul 2025

Kimi K2 · GPT-OSS · GLM-4.5

The Trillion Param Era

Kimi K2 from Moonshot AI scales DeepSeek V3's architecture to 1 trillion total parameters — the largest open model yet — keeping 32B active. It validates that MLA + fine-grained MoE scales beyond the original V3 design. GPT-OSS (OpenAI's first open-weight release since GPT-2) introduces attention sinks as learned logit biases and sliding window every other layer. GLM-4.5 from Zhipu adds attention bias units and places 3 dense transformer layers before the MoE blocks begin.

Kimi K2 · 1T total · 32B active GPT-OSS · 20B / 120B GLM-4.5 · 355B / 106B

Kimi K2: MLA + MoE at 1T scale GPT-OSS: Learned attention sink logits GPT-OSS: Sliding window every other layer GLM-4.5: 3 dense layers before MoE starts GLM-4.5: Attention bias + MTP training

Sep–Dec 2025

Qwen3-Next · MiniMax-M2 · Kimi Linear · Nemotron 3 · DeepSeek V3.2 · Mistral 3

Linear Attention Wave

A wave of models explore moving beyond standard O(n²) attention. Qwen3-Next pairs Gated DeltaNet (linear) with full attention in a 3:1 ratio. Kimi Linear invents its own Kimi Delta Attention with channel-wise gating, combined with MLA in the full-attention layers. Nemotron 3 goes furthest: Mamba-2 state-space layers + sparse MoE + GQA in a single architecture. Meanwhile, MiniMax-M2 introduces Partial RoPE (applied to only half of head dimensions) for better length extrapolation. Mistral 3 Large adopts MLA directly, essentially cloning the DeepSeek V3 design at 673B.

Qwen3-Next · 80B-A3B Kimi Linear · 48B-A3B MiniMax-M2 · 230B Nemotron 3 Nano · 30B-A3B DeepSeek V3.2 · 671B Mistral 3 Large · 673B

Qwen3-Next: Gated DeltaNet 3:1 hybrid + MTP Kimi Linear: Novel channel-wise gated attention + NoPE MiniMax-M2: Partial RoPE on 50% of head dims Nemotron 3: Mamba-2 + MoE + GQA triple hybrid DeepSeek V3.2: Adds DeepSeek Sparse Attention (DSA) Mistral 3: MLA adopted, multimodal vision encoder

Feb 2026

GLM-5 · Arcee Trinity · OLMo 3 · Xiaomi MiMo-V2

Architectural Maturation

GLM-5 represents the culmination of the DeepSeek V3 template, adding MLA and DeepSeek Sparse Attention on top of GLM-4.5's MoE foundation — 744B total, 256 experts, 40B active across 78 layers. Arcee Trinity Large stacks the most techniques of any model: MoE + Sliding Window 3:1 + NoPE on global layers + gated attention + depth-scaled sandwich norm. OLMo 3 extends OLMo 2 with sliding window attention and YaRN for global layers. Xiaomi MiMo-V2-Flash adds aggressive 5:1 sliding window with a tiny window size of 128 tokens.

GLM-5 · 744B · 40B active Arcee Trinity · 400B OLMo 3 · 7B / 32B Xiaomi MiMo-V2-Flash · 309B

GLM-5: MLA + DSA + 256 experts at 744B GLM-5: MTP for speculative decoding Trinity: NoPE + gated attn + sandwich norm OLMo 3: Sliding window 3:1 + fully open MiMo-V2: Sliding window 5:1 with size=128