CodeQ: Self-Improving Code Debugging via MCTS and DPO

An autonomous code debugging agent that combines Monte Carlo Tree Search (MCTS) with Direct Preference Optimization (DPO) to iteratively improve bug-fixing capabilities. Built on Qwen2.5-Coder-7B-Instruct and evaluated on DebugBench.

Key Results

Mode	Base Model	+ DPO Round 2
Full Rewrite (single pass)	43.9%	43.9%
MCTS Rewrite	81.3% (n=123)	84.0% (n=50)

DPO gains transfer exclusively to MCTS mode — the policy learns search-compatible improvements rather than memorizing rewrites.

Architecture

CodeQ distributes inference and training across two H100 nodes with a shared filesystem:

flowchart TB
    subgraph MachineA["Machine A — MCTS Inference (4-bit)"]
        MCTS["MCTS Search Engine"]
        Qwen4["Qwen2.5-Coder-7B\n(4-bit quantized)"]
        Gen["Candidate Generation\n(full-rewrite actions)"]
        Eval["Execution-Based\nVerification"]
        MCTS --> Gen --> Qwen4
        Qwen4 --> Eval
        Eval -->|reward signal| MCTS
    end

    subgraph MachineB["Machine B — DPO Training (bf16)"]
        Pairs["Preference Pair\nConstruction"]
        DPO["DPO Training\n(TRL 0.29.1, bf16)"]
        Ckpt["Checkpoint\nExport"]
        Pairs --> DPO --> Ckpt
    end

    subgraph Shared["Shared Filesystem"]
        Data["MCTS Trajectories\n& Preference Pairs"]
        Model["Model Checkpoints"]
    end

    MCTS -->|winning/losing trajectories| Data
    Data --> Pairs
    Ckpt --> Model
    Model -->|updated policy| Qwen4

    style MachineA fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
    style MachineB fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
    style Shared fill:#0f3460,stroke:#16213e,color:#e0e0e0

Self-Improvement Loop

MCTS Search — Machine A runs tree search over candidate rewrites, scoring each via execution against test cases
Trajectory Collection — Winning (pass) and losing (fail) rewrites are paired per problem
DPO Training — Machine B trains the policy on preference pairs using Direct Preference Optimization
Policy Update — The improved checkpoint is loaded back into the MCTS inference engine
Repeat — Each round generates harder preference signal as the policy improves

Design Decisions

Full-rewrite action space over line-level edits. Line-level edits achieved only ~10% solve rate due to compounding error across multi-line fixes. Switching to full-rewrite generation raised the base MCTS rate to 81.3% — the single largest improvement in the project.
4-bit inference / bf16 training split. Quantized inference on Machine A keeps MCTS rollouts fast; full-precision training on Machine B preserves gradient quality.
fp32 logit upcast. bf16 training produced NaN losses from logit overflow. Upcasting the DPO logit computation to fp32 resolved this completely.

Ablation Studies

By Bug Category

Category	Rewrite (base)	MCTS (base)	MCTS (+ DPO)
Syntax	61.9%	95.0%	95.0%
Logic	45.8%	90.0%	85.0%
Reference	55.9%	80.0%	80.0%
Multiple	31.8%	90.0%	85.0%

MCTS saturates on syntax errors (95%) where the search space is narrow. The largest rewrite→MCTS gains appear on multiple-fault bugs (31.8% → 90%), where iterative search over full rewrites avoids the combinatorial explosion of line-level patches.

By Difficulty

Difficulty	Rewrite (base→DPO)	MCTS (base→DPO)
Easy	56.8% → 56.8%	90% → 90%
Medium	40.7% → 44.4%	90% → 90%
Hard	34.4% → 34.4%	80% → 85%

DPO improves performance where it matters most: hard problems under MCTS search (80% → 85%). Easy and medium problems are already saturated by MCTS alone.

By Rollout Budget

Rollouts	MCTS (base)	MCTS (+ DPO)
1	80%	78%
2	80%	80%
5	80%	82%
10	84%	84%
20	84%	86%

Performance plateaus at ~10 rollouts. The DPO policy shows a slight advantage at higher rollout budgets (86% at 20 vs 84%), suggesting DPO produces candidates that are more distinguishable under extended search.

Key Engineering Findings

Finding	Impact
Full-rewrite action space	10% → 81.3% solve rate
81% data duplication in training set	Discovered and deduplicated before Round 2
fp32 logit upcast for DPO	Fixed NaN loss under bf16 training
TRL pinned to 0.29.1	Later versions introduced breaking changes to DPO trainer
DPO does not transfer to single-pass mode	Training data is MCTS trajectories — policy learns search-compatible improvements, not standalone rewrite quality

Project Structure

codeq/
├── src/                    # Core source modules
│   ├── mcts.py             # MCTS search engine (UCB1, expand, backprop)
│   ├── agent.py            # Full-rewrite action generation and parsing
│   ├── critic.py           # AI self-critique (temp=0.2 scoring)
│   ├── sandbox.py          # Docker sandbox (no net, 512MB, 30s timeout)
│   ├── preferences.py      # Preference pair construction from trajectories
│   ├── train_dpo.py        # DPO training (TRL 0.29.1, fp32 logit upcast)
│   ├── evaluate.py         # DebugBench evaluation harness
│   ├── merge_lora.py       # LoRA adapter merge into base model
│   └── utils.py            # Shared helpers, logging, config loading
├── configs/                # Hyperparameter configs (YAML — never hardcoded)
├── scripts/                # Shell scripts (collect, train, sync, evaluate)
├── tests/                  # Unit tests (mock GPU/Docker)
├── reports/                # Ablation results and analysis
└── requirements.txt

Setup

Requirements

2× NVIDIA H100 GPUs (or equivalent) with shared filesystem
Python 3.10+
CUDA 12.1+

Installation

git clone https://github.com/tathadn/codeq.git
cd codeq
pip install -r requirements.txt

Note: TRL must be pinned to 0.29.1. Later versions introduce breaking changes to the DPO trainer.

Running MCTS Inference

# On Machine A (4-bit quantized inference)
python -m src.mcts \
    --model_path <checkpoint_dir> \
    --quantize 4bit \
    --rollouts 10 \
    --dataset debugbench

Running DPO Training

# On Machine B (bf16 full-precision training)
python -m src.train_dpo \
    --pairs_path data/trajectories/round2_pairs.jsonl \
    --base_model Qwen/Qwen2.5-Coder-7B-Instruct \
    --bf16 \
    --fp32_logit_upcast \
    --output_dir checkpoints/round2

Benchmark

Evaluated on DebugBench, a benchmark of real-world buggy code spanning syntax, logic, reference, and multi-fault errors across easy, medium, and hard difficulty levels.

Portfolio Context

CodeQ is part of a portfolio of AI/ML engineering projects:

CodeQ — Autonomous code debugging via MCTS + DPO (this project)
VisionTriage — Multimodal bug triage from screenshots (Qwen2.5-VL-7B + QLoRA) (in progress)
Speculative Decoding — Inference optimization benchmark suite with adaptive draft length (planned)

Together: train models to find and fix bugs → triage bugs from visual reports → serve models efficiently.

Links

Technical Blog Post — Detailed writeup of the architecture, refactor story, and engineering findings
Website — Full project portfolio

Citation

@misc{debnath2026codeq,
  author       = {Tathagata Debnath},
  title        = {CodeQ: Self-Improving Code Debugging via Monte Carlo Tree Search and Direct Preference Optimization},
  year         = {2026},
  url          = {https://github.com/tathadn/codeq}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeQ: Self-Improving Code Debugging via MCTS and DPO

Key Results

Architecture

Self-Improvement Loop

Design Decisions

Ablation Studies

By Bug Category

By Difficulty

By Rollout Budget

Key Engineering Findings

Project Structure

Setup

Requirements

Installation

Running MCTS Inference

Running DPO Training

Benchmark

Portfolio Context

Links

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
reports		reports
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CodeQ: Self-Improving Code Debugging via MCTS and DPO

Key Results

Architecture

Self-Improvement Loop

Design Decisions

Ablation Studies

By Bug Category

By Difficulty

By Rollout Budget

Key Engineering Findings

Project Structure

Setup

Requirements

Installation

Running MCTS Inference

Running DPO Training

Benchmark

Portfolio Context

Links

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages