Reinforcement Coding.

We build reinforcement-learning systems for code generation. A learned bandit policy routes coding queries between LLMs — delivering large-model quality at a fraction of the compute.

An RL layer that sits on top of every foundation model.

+8 pp

above Qwen-2.5-Coder-7B on HumanEval

5-arm coalition: 74.4% vs 66.5%

30%

of the 7B's per-query compute

one ~2.1B model runs per query, not 7B

6×

more failures fixed by routing

same hard cases: 7% self-iter vs 43% routed

What we do

We treat code generation as a reinforcement learning problem — with verifiable rewards from test execution — and learn a small policy that decides which LLM to invoke for each query.

Today's default: token-level RL (DeepSeek-R1, o-series)

Action = next token. Policy = the LLM itself.

Action space	vocabulary (~128K tokens)
Credit assignment	thousands of tokens per reward
Training cost	requires fine-tuning the base model
Model dependency	tied to one model generation

Our approach: system-level RL (TeTu AI)

Action = which LLM to invoke. Policy = the router.

Action space	a small pool of open-source LLMs
Credit assignment	one decision per attempt — dense
Training cost	no base-model fine-tuning
Model dependency	any new LLM plugs in as a new arm

Different small models have different blind spots — and the blind spots barely overlap. A learned router exploits that diversity. The result is a coalition of small models that beats their bigger siblings, at a fraction of the cost.

Results

From our paper draft (EMNLP 2026 submission, June 2026):

HumanEval: 5-arm coalition 74.4% vs Qwen-2.5-Coder-7B 66.5% — +8 pp at 0.30× per-query compute.
MBPP: routed coalition 78%; EvalPlus +6 pp on 4 of 5 benchmarks.
Cross-arm retry beats self-iteration by 5–6× across all 5 source models (Phi-1, Phi-1.5, Phi-2, Qwen-Coder, DeepSeek-Coder).
Each added arm raises the ceiling monotonically: 2 arms already beat 7B; 5 arms hit 74%.
An arm-similarity metric (ρ) predicts cross-model fix rate (r = −0.72) — the learned outcome model generalizes.

Research

Published and in-progress work behind the system.

CORE PAPER

Bandit-Routed Coalitions for Code Generation

EMNLP 2026 (submitted)

Contextual bandits over LLMs deliver large-model quality at small-model cost. Introduces the cross-arm fix-rate analysis and a learned outcome model for predicting which arm best repairs which failure.

METHODS

Learning to Prompt Frozen Code Models

in preparation, NeurIPS 2026 / ICLR 2027 target

Hidden-state probes for arm selection. Trajectory-aware features for multi-step bandit routing. The bridge from one-shot to interactive coding policies.

DATASET

Operational Errors in Coding Agents

in preparation, ICLR 2027 target

A benchmark of operational-error cases that single-shot evaluations miss, with an action space that includes asking for clarification.

EFFICIENT INFERENCE

ARTS: From 25% to 3% Tokens — Breaking the Sparse Attention Barrier

preprint forthcoming; code release pending

Identifies softmax re-normalization (not discarded tokens) as the real source of sparse-attention quality loss. Replaces it with learned scaling (~300 scalars/layer): within 1–2 PPL of full attention at 3% sparsity, 56.5× attention speedup at 1M tokens, length-portable 16K → 1M without retraining.

EFFICIENT INFERENCE

Prior work on efficient inference

arXiv 2603–2604 series, 2026

GAIN (multiplicative domain adaptation, 5 models 774M–70B), Sparse Focus attention (8.6× speedup at 1M tokens, zero kernel changes), Thin Keys KV-cache reduction (75% savings at ~2% quality cost — ~60% more concurrent users on the same hardware).

Where we are going

Three rungs. Each is a research milestone and a product step.

TODAY

Contextual bandit routing

LinUCB on hidden states, 5–20 arms.

Beats single 7B-class models on HumanEval and MBPP at ~30% of their per-query compute. Live integration with code-assistant partners and cloud inference providers.

NEXT 6 MONTHS

Multi-step MDP

K-step routing with trajectory features.

Cross-arm iteration with full trajectory state — the router learns to switch between arms based on error structure, not just the prompt. Targets harder, multi-step coding tasks.

12 MONTHS

Planning in a learned world model

Dyna-style RL on top of the outcome model.

Simulate which arm will succeed before paying real inference cost. The router becomes a planner. Model-agnostic by design — independent of which foundation model wins.

Team

Reinforcement-learning provenance from the Alberta lineage.

Hengshuai Yao — Founder

PhD, University of Alberta (2015) — supervised by Rich Sutton (Turing Award 2024) and Csaba Szepesvári (co-author of Bandit Algorithms). Background in model-based reinforcement learning, contextual bandits, and efficient inference. Area Chair, ICML 2026.

Prior research contributions include ARTS (sparse-attention re-normalization, 56.5× speedup at 1M tokens), GAIN (multiplicative domain adaptation), Sparse Focus attention, and Thin Keys KV-cache reduction.

hengshuai@tetu.ai Google Scholar

Get involved

We work with partners across code-assistant companies, cloud inference providers, and the research community.

Partners

If you operate a code assistant, cloud LLM inference, or developer platform — we'd like to talk about deployment and joint evaluation.

Reach out →

Researchers

We're building open evaluation infrastructure and publishing the methodology. Collaborations welcome, especially on bandit theory, world models, and code benchmarks.

Reach out →

Press & everyone else

Writing about RL for code? Looking at the inference-cost landscape? We're happy to walk through the methodology and the results.

Reach out →