Tamanna Hossain-Kay

Yet Another DeepSeek Overview

R package build / 2025-02-08


The release of DeepSeek-R1, a Chinese open-weight language model, in January 2025 sent shockwaves through the tech industry, triggering a stunning $1 trillion loss in US stock market value in a single day. For the first time in a few years, an open-weight model has demonstrated the capability to rival proprietary giants like OpenAI’s GPT models. Not only does it offer competitive performance, but it also achieves this using older GPUs than Western labs due to US export restrictions.

R1 utilizes many optimization strategies for efficiency, and also has a unique training strategy for reasoning. Unlike most recent LLMs, which rely heavily on post-training using labelled data, DeepSeek-R1 has demonstrated remarkable reasoning abilities through verifiable rewards alone.

This breakthrough didn’t happen overnight. Over the past year, 15 research papers have laid the foundation for R1’s capabilities. While this blog won’t cover all of them, we’ll explore the most critical aspects that explain how R1 came to be.

Timeline

January 5, 2024 – DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

January 11, 2024 - DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

April 27, 2024 - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

May 7, 2024 – DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

December 27, 2024 – DeepSeek-V3 Technical Report Detailed DeepSeek-V3, a 671B-parameter Mixture-of-Experts model with 37B parameters active per token.

January 22, 2025 – DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeekMOE

In Mixture-of-Expert (MoE) language models, the FFNs in transformer blocks are substituted with MoE layers. This contains multiple “expert” sub-networksand a learned gating function to select or weight each expert.

Conventional MoE

Usually Moe architectures (eg. GShard, Mixtral) activate the top-𝐾 of 𝑁 experts. They have a small number (eg. 8) of large experts. These face the following challenges,

DeepSeekMoe

DeepSeekMoe address the issues in conventional Moe architectures in 2 main ways,

Validation

The following DeepSeekMOE architecture was trained on 100B tokens from a diverse multi-lingual corpus,

Dense Upper Bound

Ablations

Expert Specialization

Redundancy

To check redundancy, different proportions of top routed experts are disabled and top K experts chosen without them. Pile loss increases and is higher than baseline (Gshard X1.5), suggesting experts are irreplacable (Fig 4).

Shared Expert

Disable shared expert and allow activation of an additional routed expert instead. Pile loss rises sharply from 1.808 to 2.414

Accuracy of Knowledge

  • Vary number of activated routed experts from 3-7, and compare to GShard with same numer activated parameters as DeepSeekMoe with 7 active. DeepSeekMoE can get same performance to GShard with only 4 active experts (Fig 5).

  • Train from scratch and compare with GShard with same total parameters but double the number of active parameters. DeepSeekMoe still performs better than GShard (Fig 6)

Scalability

16B

  • Total 16B parameters, and 2.8 activate parameters
  • All FFNs except in first layer are MoE, because of convergence issues
  • Each MoE layer:
    • 2 shared experts and 64 routed experts
    • each expert = 0.25 * standard FFN parameters
    • 6 active routed experts
  • 28 Transformer layers with hidden dimension to 2048.
  • 16 attention heads with head has a dimension of 128

Internal Baseline

  • Compare to DeekSeek-7b-dense
  • Both trained from scratch on 2T tokens

  • DeepSeekMoE-16B has 40% less computations than DeepSeek-7b-dense but comparable performance
  • DeepSeekMoE 16B is strong in language modeling and knowledge-intensive tasks such as Pile, HellaSwag, TriviaQA, and NaturalQuestions. This is in line with the hypothesis that FFNs in transformers memorize knowledge (MoE FFN parameters > attention paramters) Explain futher?
  • DeepSeekMoE limitated in addressing multiple-choice tasks, maybe due to fewer attention parameters

LLaMA2-7B

Comparable performance with LLaMA2 7B, which has ~2.5 times the active parameters that DeepSeekMoe-16b has.

DeepSeekMath

  • Focus on reinforcement learning: Introduces a new technique called Group Relative Policy Optimization (GRPO), which similar to PPO but without the need for a critic model.

  • Shows that their new method is part of a general RL framework containing other methods like PPO, DPO, and RFT.

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) modifies PPO by removing the separate value function (critic) network and instead uses a group-based baseline for policy updates. In practice, multiple responses are sampled for each prompt, and the average reward of these responses serves as a baseline to compute advantages for the policy update. Traditionally, the value function measures the expected reward of the current generation from the policy. Conceptually, now this is being replaced by the average actual reward from multiple generations from the policy, i.e., a type of expected reward.

Removing the value function helps by,

GRPO optimizes the following policy,

Advantage Computation

The advantage is computed in two ways,

  • Outcome Supervision: For each question \(q\), a group of outputs \(\{ o_1, o_2, \dots, o_G \}\) is sampled from the old policy \(\pi_{\theta_{\text{old}}}\). A reward model assigns scores \(r = \{ r_1, r_2, \dots, r_G \}\) to these outputs. The rewards are then normalized by subtracting the group mean and dividing by the standard deviation to create the advtange:

\[ \hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)} \]

  • Process Supervision: A single reward at the end of each output may be insufficient for complex reasoning tasks. Process supervision which provides a reward at the end of each reasoning step, thus providing denser supervision during training. Given a question \(q\) and \(G\) sampled outputs \(\{ o_1, o_2, \dots, o_G \}\), a process reward model scores each step, producing stepwise rewards:

\[ R = \left\{ \{ r_{\text{index}(1)}^1, \dots, r_{\text{index}(K_1)}^1 \}, \dots, \{ r_{\text{index}(1)}^G, \dots, r_{\text{index}(K_G)}^G \} \right\} \]

where \(\text{index}(j)\) is the end token index of step \(j\), and \(K_i\) is the total number of steps in output \(i\). These rewards are then normalized:

\[ \tilde{r}_{\text{index}(j)}^i = \frac{r_{\text{index}(j)}^i - \text{mean}(R)}{\text{std}(R)} \]

The advantage for each token is computed as the sum of normalized rewards from that token’s step onward:

\[ \hat{A}_{i,t} = \sum_{\text{index}(j) \geq t} \tilde{r}_{\text{index}(j)}^i \]

Iterative RL with GRPO

As reinforcement learning progresses, the old reward model may become insufficient for supervising the evolving policy. To address this, iterative GRPO is employed:

  • New training data for the reward model is generated from the latest policy samples.
  • The reward model is continually updated using a replay mechanism that retains 10% of historical data.
  • The reference model is reset to the current policy, and the policy is further trained with the updated reward model.

This iterative process ensures the reward model remains aligned with the evolving policy.

Unified RL Framework

Many RL methods can be seen as specialized cases of the following unified framework,

PPO

The advantage function in PPO is computed using Generalized Advantage Estimation (GAE), which uses a reward model and a learn value function.

RFT & Online RFT

Results

RL Ablation: SFT VS RL

RL improves Pass@K but not Maj@K. So it is pushing the right answer to the top and making the distribution fundamentally more robust, but not making the model fundamentally better Page 21, not understsanding the framing and implication well here, like what does fundamentally better mean?

RL Ablation: RFT VS RL

  • Offline RFT performs similar to online methods early on when the SFT reference model is closer to the updated policy model. However, as training progresses , offline methods lag in performance because offline sampling do not represent the current policy as well.

  • GRPO methods perform better than Online RFT, even though both sample from current policy. This is because GRPO uses a reward model so can score individual instances at granular magnitues, whereas Online RFT can only award 1 for correct and 0 for incorrect.

  • Process supervision performs better than outcome supervision for GRPO, demonstrating that more fine-grained, reasoning step-aware gradient coefficients are helpful.

Iterative RL

Comparison with Other LLMs

Two DeepSeekMath Models are used,

  • DeepSeekMath-Insruct: SFT mathematical instruction tuning

  • DeepSeekMath-RL: GRPO on DeepSeekMath-Insruct. Presumably with process supervision and iterative RL since they performed best in ablations (Ques: Is this confirmed in the paper?)

DeepSeekMath-RL performs better than all baselines across math reasoning benchmarks.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

  • Introduce Multi-Latent Attention (MLA) for compressing the KV-Cache

  • Uses MoE architecture introduced in DeepSeekMoe

  • Applies YaRN for extending context length to 128k

Multi-Latent Attenion (MLA)

In standard MHA, the input embedding, \(h\), to the attention mechanism is projected to keys, values, and queries, which are then slices into sub- keys, values, and queries for the various heads. However, attention is both memory- and compute-intensive. The KV cache can be a big bottleneck for sequence length and batch size.

In MLA, the K,V, and Qs are compressed, starting by projecting \(h\) a bit differently.

RoPE

Rotary Positional Embedding (RoPE) integrates positional information by applying a position dependent rotation to query and key vectors in the attention mechanism. Concretely, for a token at position \(n\), each 2D-subspace \(\left(x_{2 j}, x_{2 j+1}\right)\) of a query or key vector is rotated by an angle \(\theta_n\), yielding: \[ \binom{x_{2 j}^{\prime}}{x_{2 j+1}^{\prime}}=\left(\begin{array}{cc} \cos \theta_n & -\sin \theta_n \\ \sin \theta_n & \cos \theta_n \end{array}\right)\binom{x_{2 j}}{x_{2 j+1}} . \]

Here, \(\theta_n\) grows with the token’s position \(n\), ensuring the model learns both absolute and relative positional relationships in a flexible, extendable way.

RoPE is not applied to the upprojected keys and queries. Instead, decopuled rotary position queries, \(Q^R\) and keys,\(K^R\), are created, which are concetenated to \(Q^C\) and \(K^C\) to create the final keys and quries for the attention mechanism. The RoPE key is shared across heads.

Why Decoupled? If RoPE is applied directly to \(K^C\), then during inference these keys woud have to be materialized. But for efficiency, they don’t actually materialize the keys. Instead, this can be done,

\[ \begin{aligned} q_{\text {new }}^{\top} k_t & =\left(W_Q h_{\text {new }}\right)^{\top}\left(W_{U K} c_{K V_t}\right) & & (\text { Standard query-key dot product) } \\ & =\left(W_{U K}^{\top} W_Q h_{\text {new }}\right)^{\top} c_{K V_t} \end{aligned} \]

However, the RoPEd keys do need to be computed and cached during inference, but think they may be smaller so its not as expensive.

Results

MLA performing even better than MHA on most of these seems weird.

DeepSeek-V3 Technical Report

  • Introduce multi-Token prediction
  • Uses auxilliary-loss free load-balancing for MoE introduced by DeepSeek in a recent paper along with a sequence auxilliary-loss
  • Infrasructure improvement for efficiently using older GPUs
  • Mostly uses architecture similar to DeepSeek-V2 (MoE with MLA and GRPO)

Multi-Token prediction (MTP)

Multi-token prediction was introduced in Meta’s Better & Faster Large Language Models via Multi-token Prediction, where the output later of a language model is replaced with n output heads for predicting n future tokens. In contrast, here use auxilliary “MTP Modules” (like additional layers) for predicting n future tokens.

This may improve data efficiency but that’s not the goal here. The goal here is improved representations, because MTP enables pre-planning for tokens further ahead. Here sequentially predict additional tokens and keep the complete causal chain at each prediction depth. During inference the additional MTP modules are removed.

Auxilliary-Loss Free Load-Balancing

Bias Term

The device and expert auxilliary loss introduced in DeepSeekMoe is replaced here, beacuse adding to the loss can affect language model training.

Instead a loss free strategy introduced in Deep’s paper Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts is used. Each expert has a bias value that decreases when overused and increases when underused, that is added to the gate logits. This bias adjusts top-k selection only for load balancing but is not used as part of gate logits otherwise.

Sequence-Wise Auxilliary Loss

To avoid pathological cases in a batch (eg. while the batch is balanced across experts, all tokens in one sequence uses expert X, where as all tokens in another sequence uses expert Y), a sequence-wise auxilliary loss is used.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  • DeekSeek-R1-Zero: First open-source model to show models can learn to reason with RL without SFT
  • DeepSeek-R1: A multi-stage training pipeline including some SFT improves readibility and reasoning
  • Performance comparable with GPT-4o-0513 and Claude-3.5-Sonnet-1022
  • Many infrastructure optimizations

DeepSeek-R1-Zero

  • DeepSeek-V3 base is trained using RL only. Uses GRPO for reinforcement learning (introduced in DeepSeekMath) on DeepSeek-V3-Base

  • Reward model is rule based checking for accuracy and thinking format. Do not use outcome or process neural reward model because they suffer from reward hacking

  • Training template: For training they create a simple template that instructs the base model to first produce a reasoning process, then present the final answer. They keep constraints limited to this structural format to avoid content-specific bias. But the models have an “aha” reasoning moment on their own.

DeepSeek-R1-Zero shows consistent improvement during RL training, with its AIME 2024 pass@1 close to OpenAI-o1-0912, and surpassing OpenAI-o1-0912 when majority voting is used.

DeepSeek-R1-Zero’s improvement is intrinsic, driven by extended test-time computation. By generating more reasoning tokens as training progresses, it naturally refines its problem-solving ability for complex tasks.

See an example below of the model learning to revisit and reevaluate its previous steps with a long thinking chains. The exploration of alternative approaches toproblem-solving arise spontaneously through the RL process

However, DeepSeek-R1-Zero faces limitations of readability

DeepSeek-R1

Training Pipeline

Multi-stage training pipeline for improved readibility and reasoning patterns

  • Step 1: Some “cold-start” reasoning oriented SFT data collected from DeepSeek-R1-Zero with some human post-processing. DeepSeek-V3-Base is supervised fine-tuned on this data to create data_ckpt_1
  • Step 2: Reasoning oriented RL-GRPO on data_ckpt_1 is to create a data_ckpt_2.
  • Step 3: When RL converges, use rejection sampling from data_ckpt_2 to create more SFT data. However, unlike Step 1, this data is just not on reasoning but other general purpose tasks as well. DeepSeek-V3-Base is supervised fine-tuned using this data to create r1_ckpt_1
    • Reasoning Data: (600K) There is both reasoning data that can be evaluated using a rule-based reward model, and ones that need a reward model (DeepSeek-V3 is used to determine reward). Chain of though (CoT) goes through filtering, where mixed language, code, long paragraphs, and incorrect chains are removed.
    • Non-Reasoning Data: (200K) Eg. writing, factual QA, self-cognition, and translation, with CoT where appropriate.
  • Step 4: RL geared towards improving reasoning and alignment (helpfulness/harmlessness). For reasoning uses GRPO with rule-based rewards, and for non-reasoning tasks presumably using GRPO with a reward model.

Results

  • Education Benchmarks: DeepSeek-R1 outperforms DeepSeek-V3 on MMLU, MMLU-Pro, and GPQA Diamond, especially in STEM subjects.

  • QA & Document Analysis: DeepSeek-R1 excels in long-context QA (FRAMES) and fact-based queries (SimpleQA), surpassing DeepSeek-V3.

  • Format Adherence: Strong performance on IF-Eval reflects improved instruction following.

  • Writing & Open-Domain QA: DeepSeek-R1 significantly outperforms DeepSeek-V3 on AlpacaEval2.0 and ArenaHard, demonstrating better generalization across reasoning and diverse tasks.

  • Concise Summaries: DeepSeek-R1 are concise and avoids length bias in GPT-based evaluations, averaging 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0, ensuring robustness across tasks.

Limitations