Figure1: SimKO improves pass@K performance on math tasks (AIME24/25, AMC, MATH500, Minerva, Olympiadbench) and logic tasks (Synlogic, BBH) compared to GRPO, as shown in the plots (left and middle). The figure on the right shows the k-th highest candidate probabilities averaged over the dataset. The SimKO-trained model exhibits a less concentrated probability distribution compared to GRPO.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR’s exploration.

Analysis: Over-Concentrated Distribution Degrades Pass@K Score

Intuition of Over-Concentrated Distribution

Here is an illustrative example where the model generates the \( l \)-th token given the context \( s_{i,l} = (x_i, y_{<l}) \). In Figure2 (a), two valid reasoning paths begin with \( y_{l}^{(1)} \) and \( y_{l}^{(2)} \), where \( y_{l}^{(k)} \) represents the rank-\( k \) candidate under the probability distribution \( \pi_\theta(\cdot \mid s_{i,l}) \). A model with strong exploration capabilities will distribute its probability more evenly between \( \pi_\theta(y_{l}^{(1)} \mid s_{i,l}) \) and \( \pi_\theta(y_{l}^{(2)} \mid s_{i,l}) \), as demonstrated in the upper panel of Figure2 (b). In contrast, when exploration is poor, the model tends to over-concentrate its probability mass on a single path. Figure2 (c) further highlights that entropy alone is insufficient to capture the model's exploration dynamics. Even when entropy is the same, the model may be in a better exploration state in some cases, which is illustrated by the more uniform distribution in the "Good exploration" scenario compared to the "Bad exploration" scenario. This suggests that a more nuanced metric than entropy is needed to evaluate exploration quality.

Figure2: (a) The exploration behavior visualized according to the token-level posterior probability. (b) Com- parison of two exploration strategy. (c) An example of two distributions with identical entropy but distinct probability distribution.

We introduce a new metric, \( \Lambda^{(k)} \), to monitor the token probability distribution within the top-K candidates. We also conduct ablation studies with Positive Sample Reinforce (PSR) and Negative Sample Reinforce (NSR) strategies to understand the impact of gradient manipulations. As shown in the Figure3 (a), GRPO training results in a heavy concentration of probability mass on a single candidate. This effect is further exacerbated by PSR, while NSR helps mitigate it to some extent. Crucially, the data (Figure3 (a)-(d)) reveal an inverse relationship between the concentration of probability mass and pass@K performance: as the concentration increases, pass@K accuracy decreases. This observation underscores that excessive exploitation limits the exploration of alternative reasoning paths. Based on these insights, we design an algorithmic approach that encourages a more balanced exploration of reasoning paths, improving performance.

Figure3: (a)-(c) Training dynamics of average log probability \(\Lambda\) and top-K probabilities \(\Lambda^{(k)}\) derived by GRPO, NSR, and PSR. (d) The corresponding pass@1 and pass@K results of the RLVL-trained models.

Method

Overview

We present the three key components of our SimKO method: (a) We begin by identifying the ``forking" tokens, which are high-entropy tokens, and diverge into multiple reasoning paths. (b) For positive samples, we redistribute the probability mass from the top-1 candidate to the top-K candidates, mitigating overconcentration. (c) For negative samples, we apply a strong penalty to the top-1 candidate and a weaker penalty to non-top-1 candidates to prevent the squeezing effect, thereby avoiding sharp distributions and facilitating model exploration.

Figure4: Intuition of the proposed method.

Identifying Informative Tokens to Optimize

To improve logical inference in RLVR, we identify the subset of tokens that contribute most to reasoning, which often include high-entropy "forking" tokens. These tokens drive the reasoning process, while others mainly maintain grammatical fluency. We focus on tokens with entropy greater than a threshold \( \tau \), and replace the standard gradient \( \gamma_{i,l} \) with a modified gradient \( \tilde{\gamma}_{i,l} \):

\[ \tilde{\gamma}_{i,l} = \begin{cases} \gamma^{\text{pos}}_{i,l}, & \text{if } \mathcal{H}(\pi_{\theta}(\cdot \mid s_{i,l})) > \tau,~ A_{i,l} > 0, \\ \gamma^{\text{neg}}_{i,l}, & \text{if } \mathcal{H}(\pi_{\theta}(\cdot \mid s_{i,l})) > \tau,~ A_{i,l} < 0, \\ \gamma_{i,l}, & \text{if } \mathcal{H}(\pi_{\theta}(\cdot \mid s_{i,l})) \leq \tau,~ \forall A_{i,l}. \end{cases} , \]

where \( \gamma_{i,l} = {\pi_\theta\!\big(y_{i,l}\mid s_{i,l}\big)}/ {\pi_{\theta_{\text{ref}}}\!\big(y_{i,l}\mid s_{i,l}\big)}\).

Redistributing Positive Gradients Among Top-K Candidates

For tokens that contribute positively (correct responses), we aim to redistribute the gradient among the top-K candidates to prevent over-concentration of probability on the rank-1 candidate. The gradient for positive tokens is adjusted as follows:

\[ \gamma^\text{pos}_{i,l} = (1 - \alpha)\,\gamma_{i,t} \;+\; \frac{\alpha}{|\mathcal{I}_\text{topk}|} \sum_{k\in\mathcal{I}_\text{topk}} \mathrm{sg}\!\left(\frac{\gamma_{i,l}}{\gamma^{(k)}_{i,l}}\right)\, \gamma^{(k)}_{i,l}, \quad \alpha \in [0,1]. \]

This leads to a smoother probability distribution, encouraging broader exploration and increasing diversity in responses.

Restraining Negative Gradients on Top-1 Candidate

Negative gradients for incorrect responses tend to sharpen the distribution, exacerbating the "squeezing effect" where the rank-1 candidate absorbs most of the probability mass. To counter this, we apply a stronger penalty to the rank-1 candidate, controlling the relative strength of negative gradients:

\[ \gamma^{\text{neg}}_{i,l} = \lambda \cdot \gamma_{i,l} \quad \text{if } y_{i,l} \text{ is rank-1, with } \lambda > 1. \]

Results

In our experiments, we demonstrate SimKO’s effectiveness on multiple challenging math and logic benchmarks, showing consistent improvements in pass@K without sacrificing pass@1.

Table1: Average pass@256 results for Qwen2.5-Math-7B, Qwen2.5-7B, and Llama3.2-3B-Instruct on MATH500, AIME 2024/25, Minerva math, Olympiadbench, and AMC23 Datasets.

Table2: Pass@K results for Qwen2.5-7B on Synlogic and BBH Datasets.

BibTeX


      @article{peng2025simko,
        title={SimKO: Simple Pass@K Policy Optimization},
        author={Peng, Ruotian and Ren, Yi and Yu, Zhouliang and Liu, Weiyang and Wen, Yandong},
        journal={arXiv preprint arXiv:2510.14807},
        year={2025}
     }

SimKO: Simple pass@K policy Optimization