An autonomous code debugging agent that combines Monte Carlo Tree Search (MCTS) with Direct Preference Optimization (DPO) to iteratively improve bug-fixing capabilities. Built on Qwen2.5-Coder-7B-Instruct and evaluated on DebugBench.
| Mode | Base Model | + DPO Round 2 |
|---|---|---|
| Full Rewrite (single pass) | 43.9% | 43.9% |
| MCTS Rewrite | 81.3% (n=123) | 84.0% (n=50) |
DPO gains transfer exclusively to MCTS mode — the policy learns search-compatible improvements rather than memorizing rewrites.
CodeQ distributes inference and training across two H100 nodes with a shared filesystem:
flowchart TB
subgraph MachineA["Machine A — MCTS Inference (4-bit)"]
MCTS["MCTS Search Engine"]
Qwen4["Qwen2.5-Coder-7B\n(4-bit quantized)"]
Gen["Candidate Generation\n(full-rewrite actions)"]
Eval["Execution-Based\nVerification"]
MCTS --> Gen --> Qwen4
Qwen4 --> Eval
Eval -->|reward signal| MCTS
end
subgraph MachineB["Machine B — DPO Training (bf16)"]
Pairs["Preference Pair\nConstruction"]
DPO["DPO Training\n(TRL 0.29.1, bf16)"]
Ckpt["Checkpoint\nExport"]
Pairs --> DPO --> Ckpt
end
subgraph Shared["Shared Filesystem"]
Data["MCTS Trajectories\n& Preference Pairs"]
Model["Model Checkpoints"]
end
MCTS -->|winning/losing trajectories| Data
Data --> Pairs
Ckpt --> Model
Model -->|updated policy| Qwen4
style MachineA fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
style MachineB fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
style Shared fill:#0f3460,stroke:#16213e,color:#e0e0e0
- MCTS Search — Machine A runs tree search over candidate rewrites, scoring each via execution against test cases
- Trajectory Collection — Winning (pass) and losing (fail) rewrites are paired per problem
- DPO Training — Machine B trains the policy on preference pairs using Direct Preference Optimization
- Policy Update — The improved checkpoint is loaded back into the MCTS inference engine
- Repeat — Each round generates harder preference signal as the policy improves
- Full-rewrite action space over line-level edits. Line-level edits achieved only ~10% solve rate due to compounding error across multi-line fixes. Switching to full-rewrite generation raised the base MCTS rate to 81.3% — the single largest improvement in the project.
- 4-bit inference / bf16 training split. Quantized inference on Machine A keeps MCTS rollouts fast; full-precision training on Machine B preserves gradient quality.
- fp32 logit upcast. bf16 training produced NaN losses from logit overflow. Upcasting the DPO logit computation to fp32 resolved this completely.
| Category | Rewrite (base) | MCTS (base) | MCTS (+ DPO) |
|---|---|---|---|
| Syntax | 61.9% | 95.0% | 95.0% |
| Logic | 45.8% | 90.0% | 85.0% |
| Reference | 55.9% | 80.0% | 80.0% |
| Multiple | 31.8% | 90.0% | 85.0% |
MCTS saturates on syntax errors (95%) where the search space is narrow. The largest rewrite→MCTS gains appear on multiple-fault bugs (31.8% → 90%), where iterative search over full rewrites avoids the combinatorial explosion of line-level patches.
| Difficulty | Rewrite (base→DPO) | MCTS (base→DPO) |
|---|---|---|
| Easy | 56.8% → 56.8% | 90% → 90% |
| Medium | 40.7% → 44.4% | 90% → 90% |
| Hard | 34.4% → 34.4% | 80% → 85% |
DPO improves performance where it matters most: hard problems under MCTS search (80% → 85%). Easy and medium problems are already saturated by MCTS alone.
| Rollouts | MCTS (base) | MCTS (+ DPO) |
|---|---|---|
| 1 | 80% | 78% |
| 2 | 80% | 80% |
| 5 | 80% | 82% |
| 10 | 84% | 84% |
| 20 | 84% | 86% |
Performance plateaus at ~10 rollouts. The DPO policy shows a slight advantage at higher rollout budgets (86% at 20 vs 84%), suggesting DPO produces candidates that are more distinguishable under extended search.
| Finding | Impact |
|---|---|
| Full-rewrite action space | 10% → 81.3% solve rate |
| 81% data duplication in training set | Discovered and deduplicated before Round 2 |
| fp32 logit upcast for DPO | Fixed NaN loss under bf16 training |
| TRL pinned to 0.29.1 | Later versions introduced breaking changes to DPO trainer |
| DPO does not transfer to single-pass mode | Training data is MCTS trajectories — policy learns search-compatible improvements, not standalone rewrite quality |
codeq/
├── src/ # Core source modules
│ ├── mcts.py # MCTS search engine (UCB1, expand, backprop)
│ ├── agent.py # Full-rewrite action generation and parsing
│ ├── critic.py # AI self-critique (temp=0.2 scoring)
│ ├── sandbox.py # Docker sandbox (no net, 512MB, 30s timeout)
│ ├── preferences.py # Preference pair construction from trajectories
│ ├── train_dpo.py # DPO training (TRL 0.29.1, fp32 logit upcast)
│ ├── evaluate.py # DebugBench evaluation harness
│ ├── merge_lora.py # LoRA adapter merge into base model
│ └── utils.py # Shared helpers, logging, config loading
├── configs/ # Hyperparameter configs (YAML — never hardcoded)
├── scripts/ # Shell scripts (collect, train, sync, evaluate)
├── tests/ # Unit tests (mock GPU/Docker)
├── reports/ # Ablation results and analysis
└── requirements.txt
- 2× NVIDIA H100 GPUs (or equivalent) with shared filesystem
- Python 3.10+
- CUDA 12.1+
git clone https://github.com/tathadn/codeq.git
cd codeq
pip install -r requirements.txtNote: TRL must be pinned to
0.29.1. Later versions introduce breaking changes to the DPO trainer.
# On Machine A (4-bit quantized inference)
python -m src.mcts \
--model_path <checkpoint_dir> \
--quantize 4bit \
--rollouts 10 \
--dataset debugbench# On Machine B (bf16 full-precision training)
python -m src.train_dpo \
--pairs_path data/trajectories/round2_pairs.jsonl \
--base_model Qwen/Qwen2.5-Coder-7B-Instruct \
--bf16 \
--fp32_logit_upcast \
--output_dir checkpoints/round2Evaluated on DebugBench, a benchmark of real-world buggy code spanning syntax, logic, reference, and multi-fault errors across easy, medium, and hard difficulty levels.
CodeQ is part of a portfolio of AI/ML engineering projects:
- CodeQ — Autonomous code debugging via MCTS + DPO (this project)
- VisionTriage — Multimodal bug triage from screenshots (Qwen2.5-VL-7B + QLoRA) (in progress)
- Speculative Decoding — Inference optimization benchmark suite with adaptive draft length (planned)
Together: train models to find and fix bugs → triage bugs from visual reports → serve models efficiently.
- Technical Blog Post — Detailed writeup of the architecture, refactor story, and engineering findings
- Website — Full project portfolio
@misc{debnath2026codeq,
author = {Tathagata Debnath},
title = {CodeQ: Self-Improving Code Debugging via Monte Carlo Tree Search and Direct Preference Optimization},
year = {2026},
url = {https://github.com/tathadn/codeq}
}MIT