From DeepSeek V3 to GLM-5 — a comprehensive look at how modern open-weight language models are structurally designed, what makes them tick, and where the field is heading.
It has been seven years since the original GPT architecture was developed. Looking back at GPT-2 (2019) and forward to DeepSeek V3 and Llama 4 (2024–2025), one might be surprised at how structurally similar these models still are at their core.
Positional embeddings evolved from absolute to rotational (RoPE). Multi-Head Attention largely gave way to Grouped-Query Attention. SwiGLU replaced GELU. But beneath these refinements, the foundational transformer decoder architecture remains largely intact — we are polishing, not rebuilding.
This blog focuses on the structural architectural choices that define today's flagship open-weight models — not benchmark scores or training recipes. It covers what LLM developers are actually building in 2025.
Sparse activation: activate only a handful of experts per token for massive capacity at low inference cost.
MLA, sliding window attention, and partial RoPE all target the memory bottleneck in long-context inference.
Post-Norm vs. Pre-Norm, QK-Norm, and sandwich norms — placement matters for training stability.
Gated DeltaNet, Mamba-2 hybrids, and lightning attention offer O(n) alternatives to quadratic attention.
The model that changed everything in early 2025
DeepSeek R1 made a massive impact when released in January 2025. It is a reasoning model built on top of the DeepSeek V3 architecture (December 2024), which introduced two key innovations that distinguish it from most other LLMs.
Instead of sharing key/value heads like GQA, MLA compresses the K/V tensors into a lower-dimensional space before storing them in the KV cache. At inference, they are projected back. This adds a matrix multiplication but dramatically reduces cache memory — while the DeepSeek-V2 paper shows MLA actually outperforms standard MHA on benchmarks.
Each FeedForward module is replaced by 256 small experts with a router selecting only 9 (1 shared + 8 routed) per token. Total parameters: 671B. Active parameters per step: just 37B. The shared expert — always active — handles common patterns, freeing specialized experts for niche knowledge.
DeepSeek V3 outperformed the 405B Llama 3 at launch, despite being nearly twice as large in total parameters — but with far fewer active parameters at inference time. This efficiency/capacity tradeoff is the core insight powering the MoE revival of 2025.
Fully transparent — the blueprint for reproducible LLM research
The Allen Institute's OLMo series is celebrated for its full transparency: training data, code, and checkpoints are all released publicly. OLMo 2 sits at the Pareto frontier of compute-to-performance at the time of its January 2025 release.
The architecture is mostly standard — notably still using traditional Multi-Head Attention (MHA) rather than GQA. The interesting innovations are in normalization placement.
Most LLMs since GPT place norms before each sub-layer (Pre-Norm). OLMo 2 places them after, inside the residual connections. Combined with QK-Norm (RMSNorm applied to queries and keys inside attention), this yields smoother gradient norms and fewer training loss spikes.
QK-Norm applies an additional RMSNorm to the query and key vectors before RoPE is applied. It stabilizes the dot-product magnitudes that drive attention scores. Not new to OLMo — it dates back to the 2023 Scaling Vision Transformers paper — but it's becoming a standard add-on across the field.
Google's underappreciated gem with a clever efficiency trick
While others adopted MoE for efficiency, Gemma 3 went a different route: Sliding Window Attention. Instead of every token attending to all others (O(n²)), local attention layers restrict each token's view to a fixed nearby window.
Gemma 2 used a 1:1 ratio of local to global attention. Gemma 3 pushes this to 5:1 — five sliding-window layers for every one full-attention layer — and shrinks the window from 4096 (Gemma 2) to just 1024 tokens. Their ablation studies show minimal impact on output quality but massive KV cache memory savings, especially at long context lengths.
Gemma 3 applies RMSNorm both before and after each attention and feed-forward sub-layer. This "best of both worlds" approach costs very little compute (RMSNorm is cheap) but may improve training stability by combining the benefits of Pre-Norm gradient behavior with the output normalization of Post-Norm.
Meta's MoE debut with a classic expert design philosophy
Llama 4 Maverick closely mirrors the DeepSeek V3 architecture — MoE layers, SwiGLU FeedForward, RMSNorm — but with two notable differences:
GQA instead of MLA: Meta kept Grouped-Query Attention rather than adopting DeepSeek's more complex Multi-Head Latent Attention, likely prioritizing implementation simplicity.
Fewer, larger experts: Llama 4 uses a more classic MoE configuration — 2 active experts with hidden size 8,192 each — compared to DeepSeek's 9 active experts with hidden size 2,048. Despite having twice as many active parameters in absolute terms (17B vs 37B), Llama 4 Maverick is roughly a third the total size.
Llama 4 also alternates between dense and MoE blocks in every other transformer layer, rather than applying MoE to all layers. This is different from DeepSeek V3 (which only skips the first 3 blocks). The impact on performance versus efficiency is an open research question.
Dense + MoE, from 0.6B to 235B — consistently at the top
The Qwen team delivers consistently strong models. Qwen3 offers 7 dense sizes (0.6B to 32B) and 2 MoE variants (30B-A3B, 235B-A22B), giving practitioners flexibility depending on their inference budget.
The 235B-A22B architecture is remarkably similar to DeepSeek V3 — but notably drops the shared expert. The Qwen3 team explained this was partly due to not seeing significant gains and concerns around inference optimization complexity.
Compared to Llama 3 1B, the Qwen3 0.6B is a deeper but narrower architecture: more transformer blocks, smaller hidden dimensions, fewer attention heads. This trades inference speed for a smaller memory footprint — a good design for edge deployment scenarios.
1 trillion parameters — the largest open-weight model of this generation
Kimi K2 is essentially a scaled-up DeepSeek V3: same MLA + MoE architecture, but pushed to 1 trillion total parameters. The team uses more experts per MoE module and fewer MLA heads compared to DeepSeek, making different capacity/bandwidth tradeoffs.
Notably, Kimi K2 was trained using the Muon optimizer (rather than AdamW) — reportedly the first production model of this scale to do so, yielding impressively smooth loss curves.
The Kimi K2 Thinking variant (released Nov 2025) shares the same architecture but extends context from 128k to 256k tokens. According to Moonshot AI's benchmarks, it surpasses leading proprietary models on several agentic reasoning and coding tasks.
| Model | Params (Total) | Params (Active) | Attention | FFN | Key Features |
|---|---|---|---|---|---|
| DeepSeek V3 | 671B | 37B | MLA | MoE ×256 | Shared Expert RoPE |
| Llama 4 Maverick | 400B | 17B | GQA | MoE alternating | SwiGLU 512k ctx |
| Gemma 3 27B | 27B | 27B | GQA + SWA | Dense | Pre+Post Norm 5:1 ratio |
| Qwen3 235B-A22B | 235B | 22B | GQA | MoE ×128 | QK-Norm No shared exp. |
| Kimi K2 | 1T | 32B | MLA | MoE (more exp.) | Muon opt. 256k ctx |
| OLMo 2 7B | 7B | 7B | MHA | Dense | Post-Norm QK-Norm |
| GLM-5 | 744B | 40B | MLA + DSA | MoE ×256 | Shared Expert Sparse Attn |
| Mistral 3 Large | 673B | 41B | MLA | MoE (32 exp.) | Shared Expert Vision enc. |
DeepSeek V3's success triggered a wave of MoE adoption. Llama 4, Qwen3, GLM-5, Mistral 3 Large, Kimi K2 all use it. Fine-grained, many-small-expert designs are preferred over fewer large experts.
MLA (compress K/V), sliding window attention (truncate K/V), and partial RoPE (limit position scaling) all target the same bottleneck: KV cache memory at long context.
Post-Norm (OLMo), Pre+Post Norm (Gemma 3), depth-scaled sandwich norm (Trinity), QK-Norm (Qwen3, OLMo, MiniMax) — everyone has an opinion on norm placement now.
Qwen3-Next, Kimi Linear, and Nemotron 3 are betting on O(n) alternatives (Gated DeltaNet, Mamba-2). MiniMax M2 retreated to full attention after linear attention struggled on reasoning tasks.
Kimi K2, Mistral 3 Large, and GLM-5 all adopt the DeepSeek V3 MLA + DeepSeekMoE architecture — sometimes almost identically. It's become the de facto reference design for large-scale open models.
Qwen3 goes deeper (more layers, smaller hidden dim). GPT-OSS goes wider (larger hidden dim, fewer layers). Gemma 2 ablations suggest wider is slightly better for a fixed parameter count, but the gap is small.
Architecture milestones across 15+ months of frontier model releases
DeepSeek V3 reshapes expectations for open-weight models. At 671B parameters with only 37B active per token, it matches or beats GPT-4 class models while being fully open. The architecture introduces two ideas that will dominate the next year: Multi-Head Latent Attention (MLA) for KV cache compression and a fine-grained MoE with 256 experts and a dedicated shared expert.
DeepSeek R1 adds chain-of-thought reasoning on top of V3's architecture — same structure, trained with reinforcement learning to think before answering. Meanwhile, OLMo 2 from Allen AI becomes the most transparent model released: fully open weights, training data, and code. Its architectural contribution is Post-Norm placement and QK-Norm, which improve training stability and would be widely copied.
Gemma 3 from Google DeepMind introduces an aggressive 5:1 sliding window attention ratio — for every global attention layer, five layers only attend to a local window of 1024 tokens. This slashes KV cache memory at long context. It also uses both Pre-Norm and Post-Norm around the attention block. Mistral Small 3.1 takes a different approach: standard GQA with no sliding window, focusing on raw inference speed over memory.
Llama 4 Maverick marks Meta's full adoption of MoE — a significant shift from Llama 3's dense architecture. It uses alternating dense and MoE blocks with GQA (not MLA), and a 512k context window via iRoPE. Simultaneously, Qwen3 from Alibaba releases a full dense-to-MoE family spanning 0.6B to 235B. Its key contribution: QK-Norm on both Q and K combined with a deeper (more layers, narrower) architecture versus Llama's approach.
SmolLM3 from HuggingFace demonstrates that you can skip positional embeddings entirely in some layers — NoPE (No Positional Embeddings) every 4th layer. The causal mask already encodes token order implicitly; removing RoPE from select layers improves length generalization beyond the training context. Grok 2.5 from xAI takes a different MoE bet: only 8 large experts (coarse-grained) versus DeepSeek's 256 small ones, with a dense SwiGLU as its shared expert.
Kimi K2 from Moonshot AI scales DeepSeek V3's architecture to 1 trillion total parameters — the largest open model yet — keeping 32B active. It validates that MLA + fine-grained MoE scales beyond the original V3 design. GPT-OSS (OpenAI's first open-weight release since GPT-2) introduces attention sinks as learned logit biases and sliding window every other layer. GLM-4.5 from Zhipu adds attention bias units and places 3 dense transformer layers before the MoE blocks begin.
A wave of models explore moving beyond standard O(n²) attention. Qwen3-Next pairs Gated DeltaNet (linear) with full attention in a 3:1 ratio. Kimi Linear invents its own Kimi Delta Attention with channel-wise gating, combined with MLA in the full-attention layers. Nemotron 3 goes furthest: Mamba-2 state-space layers + sparse MoE + GQA in a single architecture. Meanwhile, MiniMax-M2 introduces Partial RoPE (applied to only half of head dimensions) for better length extrapolation. Mistral 3 Large adopts MLA directly, essentially cloning the DeepSeek V3 design at 673B.
GLM-5 represents the culmination of the DeepSeek V3 template, adding MLA and DeepSeek Sparse Attention on top of GLM-4.5's MoE foundation — 744B total, 256 experts, 40B active across 78 layers. Arcee Trinity Large stacks the most techniques of any model: MoE + Sliding Window 3:1 + NoPE on global layers + gated attention + depth-scaled sandwich norm. OLMo 3 extends OLMo 2 with sliding window attention and YaRN for global layers. Xiaomi MiMo-V2-Flash adds aggressive 5:1 sliding window with a tiny window size of 128 tokens.