We build reinforcement-learning systems for code generation. A learned policy routes between frozen LLMs — small open-source models, working together — to deliver large-model quality at a fraction of the compute.
An RL layer that sits on top of any foundation model. Independent of whose model wins.
We treat code generation as a reinforcement learning problem — with verifiable rewards from test execution — and learn a small policy that decides which frozen LLM to invoke for each query.
| Action space | vocabulary (~128K tokens) |
| Credit assignment | thousands of tokens per reward |
| Training cost | requires fine-tuning the base model |
| Model dependency | tied to one model generation |
| Action space | a small pool of open-source LLMs |
| Credit assignment | one decision per attempt — dense |
| Training cost | no base-model fine-tuning |
| Model dependency | any new LLM plugs in as a new arm |
Different small models have different blind spots — and the blind spots barely overlap. A learned router exploits that diversity. The result is a coalition of small models that beats their bigger siblings, at a fraction of the cost.
From our paper draft (EMNLP 2026 submission, June 2026):
Published and in-progress work behind the system.
Contextual bandits over frozen LLMs deliver large-model quality at small-model cost. Introduces the cross-arm fix-rate analysis and a learned outcome model for predicting which arm best repairs which failure.
Hidden-state probes for arm selection. Trajectory-aware features for multi-step bandit routing. The bridge from one-shot to interactive coding policies.
A benchmark of operational-error cases that single-shot evaluations miss, with an action space that includes asking for clarification.
Identifies softmax re-normalization (not discarded tokens) as the real source of sparse-attention quality loss. Replaces it with learned scaling (~300 scalars/layer): within 1–2 PPL of full attention at 3% sparsity, 56.5× attention speedup at 1M tokens, length-portable 16K → 1M without retraining.
GAIN (multiplicative domain adaptation, 5 models 774M–70B), Sparse Focus attention (8.6× speedup at 1M tokens, zero kernel changes), Thin Keys KV-cache reduction (75% savings at ~2% quality cost — ~60% more concurrent users on the same hardware).
Three rungs. Each is a research milestone and a product step.
Reinforcement-learning provenance from the Alberta lineage.
PhD, University of Alberta (2015) — supervised by Rich Sutton (Turing Award 2024) and Csaba Szepesvári (co-author of Bandit Algorithms). Background in model-based reinforcement learning, contextual bandits, and efficient inference.
Prior research contributions include ARTS (sparse-attention re-normalization, 56.5× speedup at 1M tokens), GAIN (multiplicative domain adaptation), Sparse Focus attention, and Thin Keys KV-cache reduction.
We work with partners across code-assistant companies, cloud inference providers, and the research community.
If you operate a code assistant, cloud LLM inference, or developer platform — we'd like to talk about deployment and joint evaluation.
Reach out →We're building open evaluation infrastructure and publishing the methodology. Collaborations welcome, especially on bandit theory, world models, and code benchmarks.
Reach out →Writing about RL for code? Looking at the inference-cost landscape? We're happy to walk through the methodology and the results.
Reach out →