Skip to content

tathadn/codeq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeQ: Self-Improving Code Debugging via MCTS and DPO

An autonomous code debugging agent that combines Monte Carlo Tree Search (MCTS) with Direct Preference Optimization (DPO) to iteratively improve bug-fixing capabilities. Built on Qwen2.5-Coder-7B-Instruct and evaluated on DebugBench.

Key Results

Mode Base Model + DPO Round 2
Full Rewrite (single pass) 43.9% 43.9%
MCTS Rewrite 81.3% (n=123) 84.0% (n=50)

DPO gains transfer exclusively to MCTS mode — the policy learns search-compatible improvements rather than memorizing rewrites.

Architecture

CodeQ distributes inference and training across two H100 nodes with a shared filesystem:

flowchart TB
    subgraph MachineA["Machine A — MCTS Inference (4-bit)"]
        MCTS["MCTS Search Engine"]
        Qwen4["Qwen2.5-Coder-7B\n(4-bit quantized)"]
        Gen["Candidate Generation\n(full-rewrite actions)"]
        Eval["Execution-Based\nVerification"]
        MCTS --> Gen --> Qwen4
        Qwen4 --> Eval
        Eval -->|reward signal| MCTS
    end

    subgraph MachineB["Machine B — DPO Training (bf16)"]
        Pairs["Preference Pair\nConstruction"]
        DPO["DPO Training\n(TRL 0.29.1, bf16)"]
        Ckpt["Checkpoint\nExport"]
        Pairs --> DPO --> Ckpt
    end

    subgraph Shared["Shared Filesystem"]
        Data["MCTS Trajectories\n& Preference Pairs"]
        Model["Model Checkpoints"]
    end

    MCTS -->|winning/losing trajectories| Data
    Data --> Pairs
    Ckpt --> Model
    Model -->|updated policy| Qwen4

    style MachineA fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
    style MachineB fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
    style Shared fill:#0f3460,stroke:#16213e,color:#e0e0e0
Loading

Self-Improvement Loop

  1. MCTS Search — Machine A runs tree search over candidate rewrites, scoring each via execution against test cases
  2. Trajectory Collection — Winning (pass) and losing (fail) rewrites are paired per problem
  3. DPO Training — Machine B trains the policy on preference pairs using Direct Preference Optimization
  4. Policy Update — The improved checkpoint is loaded back into the MCTS inference engine
  5. Repeat — Each round generates harder preference signal as the policy improves

Design Decisions

  • Full-rewrite action space over line-level edits. Line-level edits achieved only ~10% solve rate due to compounding error across multi-line fixes. Switching to full-rewrite generation raised the base MCTS rate to 81.3% — the single largest improvement in the project.
  • 4-bit inference / bf16 training split. Quantized inference on Machine A keeps MCTS rollouts fast; full-precision training on Machine B preserves gradient quality.
  • fp32 logit upcast. bf16 training produced NaN losses from logit overflow. Upcasting the DPO logit computation to fp32 resolved this completely.

Ablation Studies

By Bug Category

Category Rewrite (base) MCTS (base) MCTS (+ DPO)
Syntax 61.9% 95.0% 95.0%
Logic 45.8% 90.0% 85.0%
Reference 55.9% 80.0% 80.0%
Multiple 31.8% 90.0% 85.0%

MCTS saturates on syntax errors (95%) where the search space is narrow. The largest rewrite→MCTS gains appear on multiple-fault bugs (31.8% → 90%), where iterative search over full rewrites avoids the combinatorial explosion of line-level patches.

By Difficulty

Difficulty Rewrite (base→DPO) MCTS (base→DPO)
Easy 56.8% → 56.8% 90% → 90%
Medium 40.7% → 44.4% 90% → 90%
Hard 34.4% → 34.4% 80% → 85%

DPO improves performance where it matters most: hard problems under MCTS search (80% → 85%). Easy and medium problems are already saturated by MCTS alone.

By Rollout Budget

Rollouts MCTS (base) MCTS (+ DPO)
1 80% 78%
2 80% 80%
5 80% 82%
10 84% 84%
20 84% 86%

Performance plateaus at ~10 rollouts. The DPO policy shows a slight advantage at higher rollout budgets (86% at 20 vs 84%), suggesting DPO produces candidates that are more distinguishable under extended search.

Key Engineering Findings

Finding Impact
Full-rewrite action space 10% → 81.3% solve rate
81% data duplication in training set Discovered and deduplicated before Round 2
fp32 logit upcast for DPO Fixed NaN loss under bf16 training
TRL pinned to 0.29.1 Later versions introduced breaking changes to DPO trainer
DPO does not transfer to single-pass mode Training data is MCTS trajectories — policy learns search-compatible improvements, not standalone rewrite quality

Project Structure

codeq/
├── src/                    # Core source modules
│   ├── mcts.py             # MCTS search engine (UCB1, expand, backprop)
│   ├── agent.py            # Full-rewrite action generation and parsing
│   ├── critic.py           # AI self-critique (temp=0.2 scoring)
│   ├── sandbox.py          # Docker sandbox (no net, 512MB, 30s timeout)
│   ├── preferences.py      # Preference pair construction from trajectories
│   ├── train_dpo.py        # DPO training (TRL 0.29.1, fp32 logit upcast)
│   ├── evaluate.py         # DebugBench evaluation harness
│   ├── merge_lora.py       # LoRA adapter merge into base model
│   └── utils.py            # Shared helpers, logging, config loading
├── configs/                # Hyperparameter configs (YAML — never hardcoded)
├── scripts/                # Shell scripts (collect, train, sync, evaluate)
├── tests/                  # Unit tests (mock GPU/Docker)
├── reports/                # Ablation results and analysis
└── requirements.txt

Setup

Requirements

  • 2× NVIDIA H100 GPUs (or equivalent) with shared filesystem
  • Python 3.10+
  • CUDA 12.1+

Installation

git clone https://github.com/tathadn/codeq.git
cd codeq
pip install -r requirements.txt

Note: TRL must be pinned to 0.29.1. Later versions introduce breaking changes to the DPO trainer.

Running MCTS Inference

# On Machine A (4-bit quantized inference)
python -m src.mcts \
    --model_path <checkpoint_dir> \
    --quantize 4bit \
    --rollouts 10 \
    --dataset debugbench

Running DPO Training

# On Machine B (bf16 full-precision training)
python -m src.train_dpo \
    --pairs_path data/trajectories/round2_pairs.jsonl \
    --base_model Qwen/Qwen2.5-Coder-7B-Instruct \
    --bf16 \
    --fp32_logit_upcast \
    --output_dir checkpoints/round2

Benchmark

Evaluated on DebugBench, a benchmark of real-world buggy code spanning syntax, logic, reference, and multi-fault errors across easy, medium, and hard difficulty levels.

Portfolio Context

CodeQ is part of a portfolio of AI/ML engineering projects:

  1. CodeQ — Autonomous code debugging via MCTS + DPO (this project)
  2. VisionTriage — Multimodal bug triage from screenshots (Qwen2.5-VL-7B + QLoRA) (in progress)
  3. Speculative Decoding — Inference optimization benchmark suite with adaptive draft length (planned)

Together: train models to find and fix bugs → triage bugs from visual reports → serve models efficiently.

Links

  • Technical Blog Post — Detailed writeup of the architecture, refactor story, and engineering findings
  • Website — Full project portfolio

Citation

@misc{debnath2026codeq,
  author       = {Tathagata Debnath},
  title        = {CodeQ: Self-Improving Code Debugging via Monte Carlo Tree Search and Direct Preference Optimization},
  year         = {2026},
  url          = {https://github.com/tathadn/codeq}
}

License

MIT

About

An AI agent that teaches itself to fix bugs — MCTS explores debugging strategies, DPO trains on what works. Pipelined across two H100 nodes: one for 4-bit inference and trajectory collection, one for full bf16 LoRA fine-tuning. Built on Qwen2.5-Coder-7B, evaluated on DebugBench. Inspired by Agent Q (Putta et al., 2024).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors