Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR’s exploration.
Here is an illustrative example where the model generates the \( l \)-th token given the context \( s_{i,l} = (x_i, y_{<l}) \). In Figure2 (a), two valid reasoning paths begin with \( y_{l}^{(1)} \) and \( y_{l}^{(2)} \), where \( y_{l}^{(k)} \) represents the rank-\( k \) candidate under the probability distribution \( \pi_\theta(\cdot \mid s_{i,l}) \). A model with strong exploration capabilities will distribute its probability more evenly between \( \pi_\theta(y_{l}^{(1)} \mid s_{i,l}) \) and \( \pi_\theta(y_{l}^{(2)} \mid s_{i,l}) \), as demonstrated in the upper panel of Figure2 (b). In contrast, when exploration is poor, the model tends to over-concentrate its probability mass on a single path. Figure2 (c) further highlights that entropy alone is insufficient to capture the model's exploration dynamics. Even when entropy is the same, the model may be in a better exploration state in some cases, which is illustrated by the more uniform distribution in the "Good exploration" scenario compared to the "Bad exploration" scenario. This suggests that a more nuanced metric than entropy is needed to evaluate exploration quality.
We introduce a new metric, \( \Lambda^{(k)} \), to monitor the token probability distribution within the top-K candidates. We also conduct ablation studies with Positive Sample Reinforce (PSR) and Negative Sample Reinforce (NSR) strategies to understand the impact of gradient manipulations. As shown in the Figure3 (a), GRPO training results in a heavy concentration of probability mass on a single candidate. This effect is further exacerbated by PSR, while NSR helps mitigate it to some extent. Crucially, the data (Figure3 (a)-(d)) reveal an inverse relationship between the concentration of probability mass and pass@K performance: as the concentration increases, pass@K accuracy decreases. This observation underscores that excessive exploitation limits the exploration of alternative reasoning paths. Based on these insights, we design an algorithmic approach that encourages a more balanced exploration of reasoning paths, improving performance.
We present the three key components of our SimKO method: (a) We begin by identifying the ``forking" tokens, which are high-entropy tokens, and diverge into multiple reasoning paths. (b) For positive samples, we redistribute the probability mass from the top-1 candidate to the top-K candidates, mitigating overconcentration. (c) For negative samples, we apply a strong penalty to the top-1 candidate and a weaker penalty to non-top-1 candidates to prevent the squeezing effect, thereby avoiding sharp distributions and facilitating model exploration.
To improve logical inference in RLVR, we identify the subset of tokens that contribute most to reasoning, which often include high-entropy "forking" tokens. These tokens drive the reasoning process, while others mainly maintain grammatical fluency. We focus on tokens with entropy greater than a threshold \( \tau \), and replace the standard gradient \( \gamma_{i,l} \) with a modified gradient \( \tilde{\gamma}_{i,l} \):
\[ \tilde{\gamma}_{i,l} = \begin{cases} \gamma^{\text{pos}}_{i,l}, & \text{if } \mathcal{H}(\pi_{\theta}(\cdot \mid s_{i,l})) > \tau,~ A_{i,l} > 0, \\ \gamma^{\text{neg}}_{i,l}, & \text{if } \mathcal{H}(\pi_{\theta}(\cdot \mid s_{i,l})) > \tau,~ A_{i,l} < 0, \\ \gamma_{i,l}, & \text{if } \mathcal{H}(\pi_{\theta}(\cdot \mid s_{i,l})) \leq \tau,~ \forall A_{i,l}. \end{cases} , \]where \( \gamma_{i,l} = {\pi_\theta\!\big(y_{i,l}\mid s_{i,l}\big)}/ {\pi_{\theta_{\text{ref}}}\!\big(y_{i,l}\mid s_{i,l}\big)}\).
For tokens that contribute positively (correct responses), we aim to redistribute the gradient among the top-K candidates to prevent over-concentration of probability on the rank-1 candidate. The gradient for positive tokens is adjusted as follows:
\[ \gamma^\text{pos}_{i,l} = (1 - \alpha)\,\gamma_{i,t} \;+\; \frac{\alpha}{|\mathcal{I}_\text{topk}|} \sum_{k\in\mathcal{I}_\text{topk}} \mathrm{sg}\!\left(\frac{\gamma_{i,l}}{\gamma^{(k)}_{i,l}}\right)\, \gamma^{(k)}_{i,l}, \quad \alpha \in [0,1]. \]This leads to a smoother probability distribution, encouraging broader exploration and increasing diversity in responses.
Negative gradients for incorrect responses tend to sharpen the distribution, exacerbating the "squeezing effect" where the rank-1 candidate absorbs most of the probability mass. To counter this, we apply a stronger penalty to the rank-1 candidate, controlling the relative strength of negative gradients:
\[ \gamma^{\text{neg}}_{i,l} = \lambda \cdot \gamma_{i,l} \quad \text{if } y_{i,l} \text{ is rank-1, with } \lambda > 1. \]In our experiments, we demonstrate SimKO’s effectiveness on multiple challenging math and logic benchmarks, showing consistent improvements in pass@K without sacrificing pass@1.
@article{peng2025simko,
title={SimKO: Simple Pass@K Policy Optimization},
author={Peng, Ruotian and Ren, Yi and Yu, Zhouliang and Liu, Weiyang and Wen, Yandong},
journal={arXiv preprint arXiv:2510.14807},
year={2025}
}