Budget-aware Test-time Scaling via Discriminative Verification

Kyle Montgomery, Sijun Tan, Yuqi Chen, Siyuan Zhuang, Tianjun Zhang, Raluca Ada Popa, and Chenguang Wang**

October 20, 2026 | 3 min read

<aside> 🔥

TL;DR

Test-time scaling helps models reason better by spending more compute, but how that compute is used matters. We find that lightweight discriminative verifiers, when combined with self-consistency, can match or even outperform generative verifiers under the same compute budget. This finding has implications for how we train and evaluate language agents on a budget.

📃 Paper | 🧑‍💻 Code | 🤗 HuggingFace

</aside>

How Should We Spend Our Test-Time Compute?

Test-time scaling can enhance performance by allocating more compute to a single problem at test-time, such as by sampling multiple candidate solutions and selecting the best with a verifier. Verifiers come in two forms:

Discriminative verifiers, which directly assign a scalar score to each answer.
Generative verifiers, which generate lengthy reasoning chains before deciding whether a candidate is correct.

While generative verifiers are generally more expressive, they are substantially more costly, requiring several additional long CoT generations per problem and suffering from decoding bottlenecks. Naturally, this raises the question: How should one split compute between generating and verifying candidates?

Rethinking the Balance: More Sampling, Less Verifying

In our recent work, we study how to best allocate compute during test-time scaling under various budgets. We observe that under most regimes, sampling more solutions yields greater gains than spending compute on verification. Discriminative techniques seem like a promising alternative to generative verifiers, not because they offer better verification, but because they use compute more efficiently.

fig1_aime2025_qwen32b_acc_vs_latency_m_2 (1).png

Figure 1 Hybrid discriminative verification techniques (e.g., weighted self-consistency (WSC) and pessimistic verification (PV)) outperform generative pessimistic verification (GPV) under equalized compute budgets of less than 22.5 minutes (shaded region). N is doubled at each point along the x-axis. For GPV, each solution is verified twice (M = 2).

To test this, we train a lightweight discriminative verifier with only 1.5B parameters from DeepSeek-R1-Distill-Qwen-1.5B. While the verifier alone performs poorly (BoN@N in Figure 1), augmenting its signal with that from self-consistency (SC), using techniques like weighted self-consistency (WSC) or pessimistic verification (PV), recovers much of the accuracy of generative verification at significantly lower cost. In fact, under equalized budgets up to $2.2 \times 10^{16}$ FLOPs (~22.5 minutes on an H100), these hybrid discriminative techniques outperform generative verification on AIME2025.

What This Means for Smarter Language Agents

DeepSWE showed that language agents can benefit from test-time scaling and external verification, resulting in much higher performance on SWEBench-Verified. Our latest work provides a framework for how to allocate compute most effectively in such systems.

A similar compute tradeoff arises during on-policy reinforcement learning, particularly on hard-to-verify tasks (e.g., open-ended math proofs), where verification naturally becomes a source of supervision. A generative verifier may assign a more reliable reward, but at a higher cost, limiting the number of feasible rollouts per problem. Conversely, discriminative techniques allow for broader exploration but provide noisier feedback. Balancing this tradeoff is key to scaling reinforcement learning for reasoning tasks where correctness itself must be inferred through verification.

How Should We Spend Our Test-Time Compute?

Rethinking the Balance: More Sampling, Less Verifying

What This Means for Smarter Language Agents

Citation