Yunhao (Robin) Tang

Selected Publications

Besides building frontier models, I also enjoy researching science. My past work focused on the understanding and developments of deep reinforcement learning algorithms and systems, spanning the following non-exhaustive list of topics:

See Google Scholar for the full list.


LlamaRL

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training

Bo Wu*, Sid Wang*, Yunhao Tang*, Jia Ding*, et al.

Arxiv

LlamaRL is the first distributed asynchronous off-policy RL stack that powers large-scale Llama training internal to Llama research.

KL divergence

On a few pitfalls in KL divergence gradient estimation for RL

Yunhao Tang, Rémi Munos.

Arxiv / Tinker citation

We highlight two pitfalls on building KL gradient estimate for RL applications, as well as their practical impact.

JEPO

Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data

Yunhao Tang, Sid Wang, Lovish Madaan, Rémi Munos.

Arxiv, NeurIPS 2025

We extend RL training beyond verifiable reward with Jensen's Evidence lower bound Policy Optimization (JEPO).

Asymmetric REINFORCE

Asymmetric REINFORCE for off-policy reinforcement learning: balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Rémi Munos.

Arxiv, NeurIPS 2025

We analyze an asymmetric version of REINFORCE algorithm, which naturally appears in the off-policy case.

Inference time RL

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

Yunhao Tang*, Kunhao Zheng*, Gabriel Synnaeve, Rémi Munos*.

Arxiv, ICML 2025

Simple inference aware finetuning algorithms can greatly improve test time performance.

AGRO

RL-finetuning LLMs from on- and off-policy data with a single algorithm

Yunhao Tang*, Taco Cohen, David W. Zhang, Michal Valko, Rémi Munos*.

Arxiv, AISTATS 2026

We propose Any Generation Reward Optimization (AGRO), a fine-tuning algorithm that leverages data from arbitrary sources for model alignment.

Soft Policy Optimization

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

Taco Cohen*, David W. Zhang*, Kunhao Zheng, Yunhao Tang, Rémi Munos, Gabriel Synnaeve.

Arxiv

Soft policy optimization outperforms PPO on code generation tasks.

Online vs Offline

Understanding the Performance Gap between Online and Offline Alignment Algorithms

Yunhao Tang, Daniel Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Avila Pires, Michal Valko, Yong Cheng, Will Dabney.

Arxiv

Is online RL really necessary for AI alignment, or do offline algorithms suffice?

Scalable oversight

On scalable oversight with weak LLMs judging strong LLMs

Zachary Kenton*, Noah Y. Siegel*, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah.

Arxiv, NeurIPS 2024

We benchmark important existing scalable-oversight protocols in a comprehensive suite of QA tasks.

Categorical DRL

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A. Erdogdu, Amir M. Farahmand.

ICML 2025

We seek to close another important theory-practice gap in distributional RL.

DRO

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Rémi Munos, Bilal Piot.

Arxiv

When human feedback is pointwise rather than pairwise, we propose direct reward optimization (DRO).

Minimax DRL

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

Mark Rowland, Li Kevin Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, Will Dabney.

NeurIPS 2024

We show a minimax optimal model based algorithm for distributional RL.

Online PO

Human Alignment of Large Language Models Through Online Preference Optimization

Daniele Calandriello, Daniel Guo, Rémi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot.

Arxiv, ICML 2024

Online preference optimization as an alignment technique is intimately related to Nash equilibrium.

GPO

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richmond, Michal Valko, Bernardo Avila Pires, Bilal Piot.

Arxiv, ICML 2024

GPO unifies alignment algorithms such as DPO, IPO and SLiC as special cases.

NLHF

Nash Learning from Human Feedback

Rémi Munos*, Michal Valko*, Daniele Calandriello*, Mohammad Gheshlaghi Azar*, Mark Rowland*, Daniel Guo*, Yunhao Tang*, Matthieu Geist*, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot*.

Arxiv, ICML 2024

In aligning large language models, we search for Nash Equilibrium naturally defined via pairwise human feedback.

Off-policy DQ

Off-policy Distributional Q(λ): Distributional RL without Importance Sampling

Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Avila Pires, Will Dabney.

Arxiv, AISTATS 2026

Another addition to the family of off-policy distributional RL algorithms, without the need for importance sampling.

Distributional SR

A Distributional Analogue to the Successor Representation

Harley Wiltzer*, Jesse Farebrother*, Arthur Greton, Yunhao Tang, André Barreto, Will Dabney, Marc G. Bellemare, Mark Rowland.

Arxiv, ICML 2024

We shed light on what distributional equivalence of successor representations look like.

VA-learning

VA-learning as a more efficient alternative to Q-learning

Yunhao Tang, Rémi Munos, Mark Rowland, Michal Valko.

Arxiv, ICML 2023

VA-learning as a more sample efficient alternative to Q-learning, closely related to dueling architecture.

DoMo-AC

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

Yunhao Tang*, Tadashi Kozuno*, Mark Rowland, Anna Harutyunyan, Rémi Munos, Bernardo Avila Pires, Michal Valko.

Arxiv, ICML 2023

An off-policy actor-critic algorithm based on multi-step policy improvement and evaluation.

TD representation

Towards a better understanding of representation dynamics under TD-learning

Yunhao Tang, Rémi Munos.

Arxiv, ICML 2023

A characterization on how TD-learning learns representations.

QCA

Quantile Credit Assignment

Thomas Mesnard, Wenqi Chen, Alaa Saade, Yunhao Tang, Mark Rowland, Theophane Weber, Clare Lyle, Audrunas Gruslys, Michal Valko, Will Dabney, Georg Ostrovski, Eric Moulines, Rémi Munos.

Arxiv, ICML 2023, Oral

Efficient credit assignment should account for external factors outside of agent's control.

Quantile TD

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Mark Rowland, Yunhao Tang, Clare Lyle, Rémi Munos, Marc G. Bellemare, Will Dabney.

Arxiv, ICML 2023

Quantile TD can outperform TD in mean value prediction.

Self-predictive

Understanding Self-Predictive Learning for Reinforcement Learning

Yunhao Tang, Zhaohan Daniel Guo, Pierre Harvey Richemond, Bernardo Avila Pires, Yash Chandak, Rémi Munos, Mark Rowland, Mohammad Gheshlaghi Azar, Charline Le Lan, Clare Lyle, Andras Gyorgy, Shantanu Thakoor, Will Dabney, Bilal Piot, Daniele Calandriello, Michal Valko.

Arxiv, ICML 2023

Self-predictive learning is related to gradient-based spectral decomposition of the transition dynamics.

QTD analysis

An Analysis of Quantile Temporal-Difference Learning

Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney.

Arxiv, JMLR

A first proof of the convergence of quantile TD-learning.

Multi-step DRL

The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning

Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Avila Pires, Will Dabney, Marc G. Bellemare.

Arxiv, NeurIPS 2022

Fundamental differences between value-based TD-learning and distributional TD-learning.

BYOL-Explore

BYOL-Explore: Exploration by Bootstrapped Prediction

Zhaohan Daniel Guo*, Shantanu Thakoor*, Miruna Pislar*, Bernardo Avila Pires*, Florent Altche*, Corentin Tallec*, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Rémi Munos, Mohammad Gheshlaghi Azar*, Bilal Piot*.

Arxiv, NeurIPS 2022

Self-prediction loss is a surprisingly useful signal for exploration.

Meta RL variance

Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning

Yunhao Tang.

arXiv, ICML 2022

Certain deliberate bias in gradient estimators could significantly reduce variance for meta RL.

Meta RL

Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation

Yunhao Tang*, Tadashi Kozuno*, Mark Rowland, Rémi Munos, Michal Valko.

arXiv / Code, NeurIPS 2021

A unifying framework for high-order derivatives of value functions via off-policy evaluation.

Peng Q-lambda

Revisiting Peng's Q(λ) for Modern Reinforcement Learning

Tadashi Kozuno*, Yunhao Tang*, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel.

ArXiv / Code, ICML 2021

Our analysis sheds light on why uncorrected updates should work in practice.

hEM

Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning

Yunhao Tang, Alp Kucukelbir.

paper / arXiv, AISTATS 2021

Hindsight Expectation Maximization (hEM) combines supervised learning with hindsight goal sampling.

SIL

Self-Imitation Learning via Generalized Lower Bound Q-learning

Yunhao Tang.

paper / arXiv, NeurIPS 2020

We shed light on the connections between self-imitation learning and n-step Q-learning.

MCTS

Monte-Carlo Tree Search as Regularized Policy Optimization

Jean-Bastien Grill*, Florent Altche*, Yunhao Tang*, Thomas Hubert, Michal Valko, Ioannis Antonoglou, Rémi Munos.

arXiv / video, ICML 2020

An interpretation of MCTS as policy optimization, leading to improvements over AlphaZero and MuZero.

TayPO

Taylor Expansion Policy Optimization

Yunhao Tang, Michal Valko, Rémi Munos.

arXiv / video / media 1 / media 2, ICML 2020

TayPO generalizes policy optimization objectives to high-order extensions for gains on large-scale distributed agents.

RL for IP

Reinforcement Learning for Integer Programming: Learning to Cut

Yunhao Tang, Shipra Agrawal, Yuri Faenza.

arXiv / video, ICML 2020

Cutting plane algorithms as sequential decision making for generic integer programming.