ThinkTwice

Official implementation for ThinkTwice, a two-phase extension of Group Relative Policy Optimization (GRPO) that jointly optimizes LLMs to solve reasoning problems and refine their answers. In each training cycle, ThinkTwice first trains the model on the reasoning task and subsequently on revising its responses, using a consistent correctness reward without external guidance.

Requirements

Hardware: 2+ NVIDIA GPUs (tested on A100/H100)
Software: Linux, CUDA 12.x, Conda

Setup

1. Create the Training Environment

conda create -n verl python=3.11 -y
conda activate verl
pip install -e verl/
pip install flash-attn --no-build-isolation

2. Prepare Benchmark Data

The evaluation benchmarks are built from HuggingFace datasets. Run the preparation scripts to generate parquet files under scratch/:

conda activate verl
python math_eval/ppc/math500.py
python math_eval/ppc/aime2024.py
python math_eval/ppc/amc.py
python math_eval/ppc/minerva_math.py
python math_eval/ppc/olympiadbench.py

The training data (scratch/hendrycks_math/train.parquet) and combined validation set (scratch/math_combined/test.parquet) should also be prepared before training.

3. Download Model Weights

Download the base model weights to a local directory:

Model	HuggingFace ID
Qwen3-4B-Instruct-2507	`Qwen/Qwen3-4B-Instruct-2507`
OLMo-3-7B-Instruct	`allenai/OLMo-3-7B-Instruct`

The model paths are configured at the top of each training script (actor_rollout_ref.model.path). Update them to point to your local copies.

Training

All training scripts are self-contained and one-click runnable. They activate the conda environment, configure Ray, and launch the trainer with the appropriate Hydra overrides.

Qwen3-4B

Trains Qwen3-4B-Instruct-2507 with ThinkTwice:

bash verl/run_thinktwice_qwen3.sh

OLMo-3-7B

Trains OLMo-3-7B-Instruct with ThinkTwice:

bash verl/run_thinktwice_olmo3.sh

Evaluation

Pass@k Evaluation for Reasoning and Self-Refinement

Generates multiple samples per problem and estimates pass@k (k=1,2,4,8,16,32,and more) for both base responses and self-refinement responses:

python math_eval/reward/evaluate_passatk.py

Cross-Refinement Evaluation

Evaluates each model as a refinement model applied to base solutions generated by every other model.

python math_eval/reward/evaluate_cross_refinement.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
math_eval		math_eval
scratch		scratch
verl		verl
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ThinkTwice

Requirements

Setup

1. Create the Training Environment

2. Prepare Benchmark Data

3. Download Model Weights

Training

Qwen3-4B

OLMo-3-7B

Evaluation

Pass@k Evaluation for Reasoning and Self-Refinement

Cross-Refinement Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ThinkTwice

Requirements

Setup

1. Create the Training Environment

2. Prepare Benchmark Data

3. Download Model Weights

Training

Qwen3-4B

OLMo-3-7B

Evaluation

Pass@k Evaluation for Reasoning and Self-Refinement

Cross-Refinement Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages