Skip to content

CSSLab/ThinkTwice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ThinkTwice

Official implementation for ThinkTwice, a two-phase extension of Group Relative Policy Optimization (GRPO) that jointly optimizes LLMs to solve reasoning problems and refine their answers. In each training cycle, ThinkTwice first trains the model on the reasoning task and subsequently on revising its responses, using a consistent correctness reward without external guidance.

Requirements

  • Hardware: 2+ NVIDIA GPUs (tested on A100/H100)
  • Software: Linux, CUDA 12.x, Conda

Setup

1. Create the Training Environment

conda create -n verl python=3.11 -y
conda activate verl
pip install -e verl/
pip install flash-attn --no-build-isolation

2. Prepare Benchmark Data

The evaluation benchmarks are built from HuggingFace datasets. Run the preparation scripts to generate parquet files under scratch/:

conda activate verl
python math_eval/ppc/math500.py
python math_eval/ppc/aime2024.py
python math_eval/ppc/amc.py
python math_eval/ppc/minerva_math.py
python math_eval/ppc/olympiadbench.py

The training data (scratch/hendrycks_math/train.parquet) and combined validation set (scratch/math_combined/test.parquet) should also be prepared before training.

3. Download Model Weights

Download the base model weights to a local directory:

Model HuggingFace ID
Qwen3-4B-Instruct-2507 Qwen/Qwen3-4B-Instruct-2507
OLMo-3-7B-Instruct allenai/OLMo-3-7B-Instruct

The model paths are configured at the top of each training script (actor_rollout_ref.model.path). Update them to point to your local copies.


Training

All training scripts are self-contained and one-click runnable. They activate the conda environment, configure Ray, and launch the trainer with the appropriate Hydra overrides.

Qwen3-4B

Trains Qwen3-4B-Instruct-2507 with ThinkTwice:

bash verl/run_thinktwice_qwen3.sh

OLMo-3-7B

Trains OLMo-3-7B-Instruct with ThinkTwice:

bash verl/run_thinktwice_olmo3.sh

Evaluation

Pass@k Evaluation for Reasoning and Self-Refinement

Generates multiple samples per problem and estimates pass@k (k=1,2,4,8,16,32,and more) for both base responses and self-refinement responses:

python math_eval/reward/evaluate_passatk.py

Cross-Refinement Evaluation

Evaluates each model as a refinement model applied to base solutions generated by every other model.

python math_eval/reward/evaluate_cross_refinement.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors