Yunhao (Robin) Tang

I am a research scientist at DeepMind London. I am a core contributor to the Gemini post-training and I research the science and engineering of deep reinforcement learning. Previously, I was a two-time intern at DeepMind Paris hosted by Remi Munos. I obtained my PhD at Columbia University in New York City.

Email  /  CV (updated Oct, 2021)  /  Google Scholar  /  Twitter

profile photo
News
  • [7/2024] Great to be part of the project led by my excellent collaborators to benchmark scalable-oversight protocols. Check it out here.
  • [5/2024] We have a new paper that hypothesis-tests the importance of on-policy sampling in language model alignment, check it out here!
  • [5/2024] Four papers are accepted at ICML 2024. Thank you for the heavy lifting to my coauthors.
  • [3/2024] Gemini 1.5 is announced. Check out the tech report here.
  • [12/2023] Gemini is launched. Great to be a core contributor to Gemini, the most powerful multi-modal large language model developed by Google DeepMind. Check out the tech report here.
  • [4/2023] Ten papers are accepted at ICML 2023. Thank you and congratulations to my coauthors.
Research

My recent work focuses on the understanding and developments of reinforcement learning algorithms and systems.

  • Reinforcement learning from human feedback [1] - [2] - [3] - [4]
  • Representation learning [1] - [2] - [3]
  • Distributional reinforcement learning [1] - [2] - [3]
  • Search and credit assignment [1] - [2] - [3]
  • Off-policy reinforcement learning [1] - [2] - [3]
  • Stochastic gradient estimation [1] - [2] - [3]

For a selected subset of publications, see highlighted papers below.

Understanding the Performance Gap between Online and Offline Alignment Algorithms
Yunhao Tang, Daniel Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Remi Munos, Bernardo Avila Pires, Michal Valko, Yong Cheng, Will Dabney
Arxiv,

Is online RL really necessary for AI alignment, or do offline algorithms suffice? The answer seems to be yes according to our careful ablations.

On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton*, Noah Y. Siegel*, Janos Kramar, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah
Arxiv,

We have benchmarked important existing scalable-oversight protocols in a comprehensive suite of QA tasks, opening the path for further future investigation.

Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot
Arxiv

When human feedback is pointwise rather than pairwise, we propose direct reward optimization (DRO) as the alignment algorithm.

Human Alignment of Large Language Models Through Online Preference Optimization
Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot
Arxiv, ICML 2024

Online preference optimization as an alignment technique turns out to be intimately related to Nash equilibrium, besides being a competitive algorithm for RLHF.

Generalized Preference Optimization: A Unified Approach to Offline Alignment
Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Remi Munos, Mark Rowland, Pierre Harvey Richmond, Michal Valko, Bernardo Avila Pires, Bilal Piot
Arxiv, ICML 2024

GPO unifies alignment algorithms such as DPO, IPO and SLiC as special cases. The insight, interestingly, is based on classic literature on convex losses for binary classification. At the end of the day, all algorithmic variants have similar performance-regularization trade-off though their natural strengths of regularization differ.

Gemini: A Family of Highly Capable Multimodal Models
Gemini team, Google DeepMind.
Tech report Arxiv

One of the most powerful multi-modal large language models thus far in the world.

Nash Learning from Human Feedback
Remi Munos*, Michal Valko*, Daniele Calandriello*, Mohammad Gheshlaghi Azar*, Mark Rowland*, Daniel Guo*, Yunhao Tang*, Matthieu Geist*, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup and Bilal Piot*
Arxiv, ICML 2024

In aligning large language models, we search for Nash Equilibrium naturally defined via the pairwise human feedback. This approach is more general purpose, imposes fewer assumptions on reward modeling, and performs better than canonical RLHF.

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Khimya Khetarpal, Daniel Zhaohan Guo, Bernardo Avila Pires, Yunhao Tang, Clare Lyle, Mark Rowland, Nicolas Heess, Diana Borsa, Arthur Guez, Will Dabney
Arxiv
Arxiv

We further bridge the theory-practice gap between action-conditional self-predictive learning for RL applications.

Off-policy Distributional Q(lambda): Distributional RL without Importance Sampling
Yunhao Tang, Mark Rowland, Remi Munos, Bernardo Avila Pires, Will Dabney
Arxiv

We introduce another addition to the family of off-policy distributional RL algorithms, importantly, without the need for importance sampling.

A Distributional Analogue to the Successor Representation
Harley Wiltzer*, Jesse Farebrother*, Arthur Greton, Yunhao Tang, André Barreto, Will Dabney, Marc G. Bellemare, Mark Rowland
Arxiv
Arxiv, ICML 2024

We shed light on what distributional equivalence of successor representations look like, and the algorithmic applications arising from the insights.

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model
Mark Rowland, Kevin Wenliang Li, Remi Munos, Clare Lyle, Yunhao Tang, Will Dabney
Arxiv
Arxiv

We show how a model based approach to distributional RL achieves a new minimax bound for sample complexity.

VA-learning as a more efficient alternative to Q-learning
Yunhao Tang, Remi Munos, Mark Rowland, Michal Valko
Arxiv
ICML 2023

We propose VA-learning as a more sample efficient alternative to Q-learning. The sample efficiency stems from the value sharing between different actions. Intriguingly, VA-learning closely relates to dueling architecture.

DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
Yunhao Tang*, Tadashi Kozuno*, Mark Rowland, Anna Harutyunyan, Remi Munos, Bernardo Avila Pires, Michal Valko
Arxiv
ICML 2023

We design an off-policy actor-critic algorithm based on multi-step policy improvement and policy evaluation. This algorithm improves state-of-the-art IMPALA baseline.

Towards a better understanding of representation dynamics under TD-learning
Yunhao Tang, Remi Munos
Arxiv
ICML 2023

We provide a characterization on how TD-learning learns representations, relating random reward based TD-learning with spectral decomposition of the transition matrix.

Representations and Exploration for Deep Reinforcement Learning using Singular Value Decomposition
Yash Chandak, Shantanu Thakoor, Daniel Guo, Yunhao Tang, Remi Munos, Will Dabney, Diana Borsa
Arxiv
ICML 2023

We build novel and principled represetation learning algorithm based on singular value decomposition of transition matrix. This also results in an intrinsic reward naturally related to visitation counts.

Quantile Credit Assignment
Thomas Mesnard, Wenqi Chen, Alaa Saade, Yunhao Tang, Mark Rowland, Theophane Weber, Clare Lyle, Audrunas Gruslys, Michal Valko, Will Dabney, Georg Ostrovski, Eric Moulines, Remi Munos
Arxiv
ICML 2023, Oral

Efficient credit assignment should account for external factors outside of agent's control, or more informally, the level of luck. We formalize such intuitions into quantile credit assignment.

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation
Mark Rowland, Yunhao Tang, Clare Lyle, Remi Munos, Marc G. Bellemare, Will Dabney
Arxiv
ICML 2023

We show that in certain cases quantile TD outperforms TD in mean value prediction. This hints at a general potential of distributional RL to outperform mean-based RL at its own game.

The Edge of Orthogonality: A Simple View of What Makes BYOL Tick
Pierre H. Richemond, Allison Tam, Yunhao Tang, Florian Strub, Bilal Piot, Felix Hill
Arxiv
paper / ICML 2023

We provide explanations on why BYOL works in the supervised learning setting with relatively simple mathematical toolkits.

Understanding Self-Predictive Learning for Reinforcement Learning
Yunhao Tang, Zhaohan Daniel Guo, Pierre Harvey Richemond, Bernardo Avila Pires, Yash Chandak, Remi Munos, Mark Rowland, Mohammad Gheshlaghi Azar, Charline Le Lan, Clare Lyle, Andras Gyorgy, Shantanu Thakoor, Will Dabney, Bilal Piot, Daniele Calandriello, Michal Valko
Arxiv
ICML 2023

Self-predictive learning is a popular representation learning algorithm in RL, which learns a latent representation by predicting (bootstrapping) its own future latents. Intuitively, the algorithm should not work as it can collapse to trivial solutions. We identify algorithmic components to prevent the collapse and show that self-preditive learning is related to gradient-based spectral decomposition of the transition dynamics.

An Analysis of Quantile Temporal-Difference Learning
Mark Rowland, Remi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney
JMLR
arXiv

We provide a first proof of the convergence of quantile TD-learning, a distributional RL algorithm that drives multiple recent empirical breakthroughs.

The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning
Yunhao Tang, Mark Rowland, Remi Munos, Bernardo Avila Pires, Will Dabney, Marc G. Bellemare
Arxiv
arXiv

We identify a few intriguing and fundamental differences between value-based TD-learning and distributional TD-learning.

BYOL-Explore: Exploration by Bootstrapped Prediction
Zhaohan Daniel Guo*, Shantanu Thakoor*, Miruna Pislar*, Bernardo Avila Pires*, Florent Altche*, Corentin Tallec*, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Remi Munos, Mohammad Gheshlaghi Azar*, Bilal Piot*
Arxiv
arXiv

We find that self-prediction loss is a surprisingly useful signal for exploration in extremely challenging deep RL domains. Our method: BYOL-explore, partially cracks a wide range of extremely hard exploration problems much more efficiently than prior methods.

KL-Entropy-Regularized Reinforcement Learning with a Generative Model is Minimax Optimal
Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, Jincheng Mei, Pierre Menard, Mohammad Gheshlaghi Azar, Michal Valko, Remi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvari
Arxiv
paper / arXiv

We show that mirror descent value iteration (MDVI) when combined with a generative model is minimax optimal.

Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning
Yunhao Tang
International Conference on Machine Learning (ICML), Baltimore, USA, 2022
arXiv

We find that certain deliberate bias in gradient estimators could significantly reduce variance for meta RL.

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses
Daniil Tiapkin, Denis Belomestny, Eric Moulines, Alexey Naumov, Sergey Samsonov, Yunhao Tang, Michal Valko, Pierre Menard
International Conference on Machine Learning (ICML), Baltimore, USA, 2022
paper / arXiv

We propose Bayes-UCBVI, an efficient exploration algorithm based on Bayesian bootstrap.

Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation
Yunhao Tang*, Tadashi Kozuno*, Mark Rowland, Remi Munos, Michal Valko
Neural Information Processing Systems (NeurIPS), Virtual, 2021
arXiv Code

How to estimate high-order derivatives of value functions? We propose a unifying framework with off-policy evaluation. Direct differentiations of off-policy estimates produce estimates to high-order derivatives of value functions, and instantiate many prior methods as special cases.

Marginalized Operators for Off-policy Reinforcement Learning
Yunhao Tang, Mark Rowland, Remi Munos, Michal Valko
International Conference on Artificial Intelligence and Statistics (AISTATS), 2022
arXiv

Can we combine marginalized estimation with contractive operators for off-policy learning?

Taylor Expansion of Discount Factors
Yunhao Tang, Mark Rowland, Remi Munos, Michal Valko
International Conference on Machine Learning (ICML), Virtual, 2021
arXiv

In deep RL practices, we estimate discounted value functions with a small discount factors, yet at evaluation time we care about the undiscounted objective with a large effective discount factor. We make clear the connections between value functions of different discount factors, and partially justify some ubiquitous deep RL heuristics.

Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning
Tadashi Kozuno*, Yunhao Tang*, Mark Rowland, Remi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel
International Conference on Machine Learning (ICML), Virtual, 2021
ArXiv Code

Uncorrected multi-step updates such as n-step Q-learning are ubiquitous in modern deep RL practices. We revisit Peng's Q($\lambda$), a classic uncorrected multi-step variant. Our analysis sheds light on why uncorrected updates should work in practice. The empirical result also suggests significant gains on benchmark tasks.

Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning
Yunhao Tang, Alp Kucukelbir
International Conference on Artificial Intelligence and Statistics (AISTATS), Virtual, 2021
paper / arXiv

We propose Hindsight Expectation Maximization (hEM), an EM algorithm for goal-conditioned RL problem which combines supervised learning through the M-step and hindsight goal sampling through the E-step. We also make an intimate connection between hindsight replay and importance sampling for rare event simulations.

Self-Imitation Learning via Generalized Lower Bound Q-learning
Yunhao Tang
Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2020
paper / arXiv

Why is self-imitation learning efficient? We shed light on its connections to n-step Q-learning and show that part of its gains might be attributed to trade-offs in RL operators. We also propose a n-step extension of self-imitation learning which incorporates the strengths of both n-step updates and lower-bound learning.

Monte-Carlo Tree Search as Regularized Policy Optimization
Jean-Bastien Grill*, Florent Altche*, Yunhao Tang*, Thomas Hubert, Michal Valko, Ioannis Antonoglou, Remi Munos
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv / video

We establish an interpretation of MCTS as policy optimization. This interpretation leads to algorithmic variants which naturally improve over MCTS-based baselines such as AlphaZero and MuZero.

Taylor Expansion Policy Optimization
Yunhao Tang, Michal Valko, Remi Munos
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv / video / media 1 / media 2

We estabilish the intimate connections between trust region policy search and off-policy evaluation. The new algorithm TayPO generalizes policy optimization objectives to high-order extentions which leads to gains on large-scale distributed agents.

Reinforcement Learning for Integer Programming: Learning to Cut
Yunhao Tang, Shipra Agrawal, Yuri Faenza
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv / video

We formulate cutting plane algorithms as a sequential decision making problem for generic integer pgroamming. The cutting plane agent learned via RL improves over human-designed heuristics and benfits downstream applications such as branch-and-cut.

Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies
Yunhao Tang, Krzysztof Choromanski
Workshop on Deep Reinforcement Learning, Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2020
paper / arXiv

We carry out online hyper-parameter adaptation of off-policy learning algorithms with Evolutionary Strategies. The overall algorithm outperforms baselines with static hyper-parameters.

Guiding Evolutionary Strategies with Off-policy Actor-critic
Yunhao Tang
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), London, United Kingdom, 2021
paper

We use off-policy policy gradient to guide the random search of evolutionary algorithms. This provides more informative updates which outperform baselines.

Learning to Score Behaviors for Guided Policy Optimization
Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang*, Anna Choromanska, Krzysztof Choromanski, Michael Jordan
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv

We propose to semantically score trajectories of RL agents with Wasserstein distance. The new algorithmic variant improves upon previous alternative distance metrics.

Variance Reduction for Evolutionary Strategies via Structured Control Variates
Yunhao Tang, Krzysztof Choromanski, Alp Kucukelbir
International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020
paper / arXiv

We propose a RL-specific control variate for variance reduction of Evolutionary Strategies. This method outperforms previous general purpose variance reduction techniques.

Discrete Action On-Policy Learning with Action-Value Critic
Yuguang Yue, Yunhao Tang, Mingzhang Yin and Mingyuan Zhou
International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020
paper / arXiv

We propose a novel variance reduction technique based on the Augment-Reinforce-Swap-Merge (ARSM) framework for on-policy optimization in RL.

Practical Nonisotropic Monte Carlo Sampling in High Dimensions via Determinantal Point Processes
Krzysztof Choromanski*, Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang*
International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020
paper / arXiv

We propose a sampling scheme which incorporates determinantal point process into Evolutionary Strategies for variance reduction.

Discretizing Continuous Action Space for On-Policy Optimization
Yunhao Tang, Shipra Agrawal
Association for the Advancement of Artificial Intelligence (AAAI), New York, NY, USA, 2020
paper / arXiv / code

The simple idea of discretizing the continuous control action space into a discrete space works decently.

ES-MAML: Simple Hessian-Free Meta Learning
Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, Yunhao Tang
International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020
arXiv

Evolutionary Strategies can be applied to meta-learning and works decently compared to popular gradient-based alternatives. The new formulation also easily scales with computational resources.

Augment-Reinforce-Merge Policy Gradient for Binary Stochastic Policy
Yunhao Tang, Mingzhang Yin, Mingyuan Zhou
arXiv 2019
arXiv

We show that when combined with the Augment-Reinforce-Merge (ARM) technique for variance reduction, the performance of policy gradient algorithms with binary action space could be significantly improved.

Provably Robust Blackbox Optimization for Reinforcement Learning
Krzysztof Choromanski*, Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang, Deepali Jain, Yuxiang Yang, Atil Iscen, Jasmine Hsu, Vikas Sindhwani
Conference on Robot Learninng (CORL), spotlight, Okasa, Japan, 2019
paper / arXiv

We propose to combine robust regression with Evolutionary Strategies. This new formulation naturally entails sample reuse and improves the robustness of the overall algorithm.

From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization
Krzysztof Choromanski*, Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang*, Vikas Sindhwani
Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019
paper / arXiv

We propose to constrain the random search employed by Evolutionary Strategies into adaptive subspaces, which entails improved sample efficiency.

KAMA-NNs: Low-Dimensional Rotation based Neural Networks
Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*, Yunhao Tang*
International Conference on Artificial Intelligence and Statistics (AISTATS), Okinawa, Japan, 2019
paper / supplementary

We design a novel neural network architecture based on Kac's random walk. This new architecture greatly reduces the number of parameters yet retains competitive performance on RL tasks compared to fully-connected architecture.

Orthogonal Estimation of Wasserstein Distances
Mark Rowland*, Jiri Hron*, Yunhao Tang*, Krzysztof Choromanski, Tamas Sarlos, Adrian Weller
International Conference on Artificial Intelligence and Statistics (AISTATS), Okinawa, Japan, 2019
paper / supplementary / arXiv

We propose an orthogonal sampling method for computing sliced Wasserstein distance. This new technique entails variance reduction and imporves training of downstream tasks.

Exploration by Distributional Reinforcement Learning
Yunhao Tang, Shipra Agrawal
International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 2018
arXiv / code

We carry out approximate Thomson sampling with randomized DQN for efficient exploration in RL.

The source code of the website is from here.