Coding as
Reinforcement Learning.

We build reinforcement-learning systems for code generation. A learned policy routes between frozen LLMs — small open-source models, working together — to deliver large-model quality at a fraction of the compute.

An RL layer that sits on top of any foundation model. Independent of whose model wins.

+8 pp
above Qwen-2.5-Coder-7B on HumanEval
5-arm coalition: 74.4% vs 66.5%
30%
of the 7B's per-query compute
one ~2.1B model runs per query, not 7B
more failures fixed by routing
same hard cases: 7% self-iter vs 43% routed

What we do

We treat code generation as a reinforcement learning problem — with verifiable rewards from test execution — and learn a small policy that decides which frozen LLM to invoke for each query.

Today's default: token-level RL (DeepSeek-R1, o-series)

Action = next token. Policy = the LLM itself.
Action spacevocabulary (~128K tokens)
Credit assignmentthousands of tokens per reward
Training costrequires fine-tuning the base model
Model dependencytied to one model generation

Our approach: system-level RL (TeTu AI)

Action = which frozen LLM to invoke. Policy = the router.
Action spacea small pool of open-source LLMs
Credit assignmentone decision per attempt — dense
Training costno base-model fine-tuning
Model dependencyany new LLM plugs in as a new arm

Different small models have different blind spots — and the blind spots barely overlap. A learned router exploits that diversity. The result is a coalition of small models that beats their bigger siblings, at a fraction of the cost.

Results

From our paper draft (EMNLP 2026 submission, June 2026):

Research

Published and in-progress work behind the system.

CORE PAPER

Bandit-Routed Coalitions for Code Generation

EMNLP 2026 (submitted)

Contextual bandits over frozen LLMs deliver large-model quality at small-model cost. Introduces the cross-arm fix-rate analysis and a learned outcome model for predicting which arm best repairs which failure.

METHODS

Learning to Prompt Frozen Code Models

in preparation, NeurIPS 2026 / ICLR 2027 target

Hidden-state probes for arm selection. Trajectory-aware features for multi-step bandit routing. The bridge from one-shot to interactive coding policies.

DATASET

Operational Errors in Coding Agents

in preparation, ICLR 2027 target

A benchmark of operational-error cases that single-shot evaluations miss, with an action space that includes asking for clarification.

EFFICIENT INFERENCE

ARTS: From 25% to 3% Tokens — Breaking the Sparse Attention Barrier

preprint forthcoming; code release pending

Identifies softmax re-normalization (not discarded tokens) as the real source of sparse-attention quality loss. Replaces it with learned scaling (~300 scalars/layer): within 1–2 PPL of full attention at 3% sparsity, 56.5× attention speedup at 1M tokens, length-portable 16K → 1M without retraining.

EFFICIENT INFERENCE

Prior work on efficient inference

arXiv 2603–2604 series, 2026

GAIN (multiplicative domain adaptation, 5 models 774M–70B), Sparse Focus attention (8.6× speedup at 1M tokens, zero kernel changes), Thin Keys KV-cache reduction (75% savings at ~2% quality cost — ~60% more concurrent users on the same hardware).

Where we are going

Three rungs. Each is a research milestone and a product step.

TODAY
Contextual bandit routing
LinUCB on hidden states, 5–20 arms.
Beats single 7B-class models on HumanEval and MBPP at ~30% of their per-query compute. Live integration with code-assistant partners and cloud inference providers.
NEXT 6 MONTHS
Multi-step MDP
K-step routing with trajectory features.
Cross-arm iteration with full trajectory state — the router learns to switch between arms based on error structure, not just the prompt. Targets harder, multi-step coding tasks.
12 MONTHS
Planning in a learned world model
Dyna-style RL on top of the outcome model.
Simulate which arm will succeed before paying real inference cost. The router becomes a planner. Model-agnostic by design — independent of which foundation model wins.

Team

Reinforcement-learning provenance from the Alberta lineage.

Hengshuai Yao — Founder

PhD, University of Alberta (2015) — supervised by Rich Sutton (Turing Award 2024) and Csaba Szepesvári (co-author of Bandit Algorithms). Background in model-based reinforcement learning, contextual bandits, and efficient inference.

Prior research contributions include ARTS (sparse-attention re-normalization, 56.5× speedup at 1M tokens), GAIN (multiplicative domain adaptation), Sparse Focus attention, and Thin Keys KV-cache reduction.

Get involved

We work with partners across code-assistant companies, cloud inference providers, and the research community.

Partners

If you operate a code assistant, cloud LLM inference, or developer platform — we'd like to talk about deployment and joint evaluation.

Reach out →

Researchers

We're building open evaluation infrastructure and publishing the methodology. Collaborations welcome, especially on bandit theory, world models, and code benchmarks.

Reach out →

Press & everyone else

Writing about RL for code? Looking at the inference-cost landscape? We're happy to walk through the methodology and the results.

Reach out →