Yunhao (Robin) Tang

I am a research scientist at DeepMind London. Previously, I obtained my Ph.D. from Columbia University, where I was very fortunate to be advised by Prof. Shipra Agrawal. Before that, I received my M.S. in Financial Engineering at Columbia University and my B.S. in Physics at Fudan University.

In Summer 2021, I am an intern at the Deep RL team in DeepMind Paris remotely from New York. In Fall 2019, I had a wonderful time working as an intern at DeepMind Paris hosted by Remi Munos. I worked on topics related to large-scale distributed reinforcement learning agents, mostly off-policy, some model-free and some model-based.

Email  /  CV  /  Google Scholar  /  Twitter

profile photo
News
  • [9/2021] One paper accepted at NeurIPS 2021. Please check it out here!
  • [8/2021] I defended my PhD thesis. Many thanks to my advisor, my committee members, my mentors and collaborators.
  • [2/2021] Two papers accepted at ICML 2021. Many thanks to my collaborators!
  • [2/2021] I have the great pleasure to present our previous work on RL for integer programming at the IPAM workshop held by UCLA. Both the video and slides are available here.
  • [1/2021] One paper accepted at AISTATS 2021. Please check it out here!
  • [9/2020] One paper accepted at NeurIPS 2020. Please check it out here!
  • [6/2020] Four papers accepted at ICML 2020. I am grateful to having all the wonderful coauthors!
  • [5/2020] Our paper wins the most popular poster and honorable mention of the best paper at MIP 2020. The video presentation is here.
Research

I have a broad interest in machine learning in general, in particular in reinforcement learning.

For a selected subset of publications, see highlighted papers below.

Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation
Yunhao Tang*, Tadashi Kozuno*, Mark Rowland, Remi Munos, Michal Valko
Neural Information Processing Systems (NeurIPS), Virtual, 2021
arXiv

How to estimate high-order derivatives of value functions? We propose a unifying framework with off-policy evaluation. Direct differentiations of off-policy estimates produce estimates to high-order derivatives of value functions, and instantiate many prior methods as special cases.

Marginalized Operators for Off-policy Reinforcement Learning
Yunhao Tang, Mark Rowland, Remi Munos, Michal Valko
Workshop on Reinforcement Learning Theory, International Conference on Machine Learning, 2021
paper

Can we combine marginalized estimation with contractive operators for off-policy learning?

Taylor Expansion of Discount Factors
Yunhao Tang, Mark Rowland, Remi Munos, Michal Valko
International Conference on Machine Learning (ICML), Virtual, 2021
arXiv

In deep RL practices, we estimate discounted value functions with a small discount factors, yet at evaluation time we care about the undiscounted objective with a large effective discount factor. We make clear the connections between value functions of different discount factors, and partially justify some ubiquitous deep RL heuristics.

Revisiting Peng's Q($\lambda$) for Modern Reinforcement Learning
Tadashi Kozuno*, Yunhao Tang*, Mark Rowland, Remi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel
International Conference on Machine Learning (ICML), Virtual, 2021
ArXiv Code

Uncorrected multi-step updates such as n-step Q-learning are ubiquitous in modern deep RL practices. We revisit Peng's Q($\lambda$), a classic uncorrected multi-step variant. Our analysis sheds light on why uncorrected updates should work in practice. The empirical result also suggests significant gains on benchmark tasks.

Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning
Yunhao Tang, Alp Kucukelbir
International Conference on Artificial Intelligence and Statistics (AISTATS), Virtual, 2021
paper / arXiv

We propose Hindsight Expectation Maximization (hEM), an EM algorithm for goal-conditioned RL problem which combines supervised learning through the M-step and hindsight goal sampling through the E-step. We also make an intimate connection between hindsight replay and importance sampling for rare event simulations.

Self-Imitation Learning via Generalized Lower Bound Q-learning
Yunhao Tang
Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2020
paper / arXiv

Why is self-imitation learning efficient? We shed light on its connections to n-step Q-learning and show that part of its gains might be attributed to trade-offs in RL operators. We also propose a n-step extension of self-imitation learning which incorporates the strengths of both n-step updates and lower-bound learning.

Monte-Carlo Tree Search as Regularized Policy Optimization
Jean-Bastien Grill*, Florent Altche*, Yunhao Tang*, Thomas Hubert, Michal Valko, Ioannis Antonoglou, Remi Munos
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv / video

We establish an interpretation of MCTS as policy optimization. This interpretation leads to algorithmic variants which naturally improve over MCTS-based baselines such as AlphaZero and MuZero.

Taylor Expansion Policy Optimization
Yunhao Tang, Michal Valko, Remi Munos
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv / video / media 1 / media 2

We estabilish the intimate connections between trust region policy search and off-policy evaluation. The new algorithm TayPO generalizes policy optimization objectives to high-order extentions which leads to gains on large-scale distributed agents.

Reinforcement Learning for Integer Programming: Learning to Cut
Yunhao Tang, Shipra Agrawal, Yuri Faenza
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv / video

We formulate cutting plane algorithms as a sequential decision making problem for generic integer pgroamming. The cutting plane agent learned via RL improves over human-designed heuristics and benfits downstream applications such as branch-and-cut.

Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies
Yunhao Tang, Krzysztof Choromanski
Workshop on Deep Reinforcement Learning, Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2020
paper / arXiv

We carry out online hyper-parameter adaptation of off-policy learning algorithms with Evolutionary Strategies. The overall algorithm outperforms baselines with static hyper-parameters.

Guiding Evolutionary Strategies with Off-policy Actor-critic
Yunhao Tang
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), London, United Kingdom, 2021
paper

We use off-policy policy gradient to guide the random search of evolutionary algorithms. This provides more informative updates which outperform baselines.

Learning to Score Behaviors for Guided Policy Optimization
Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang*, Anna Choromanska, Krzysztof Choromanski, Michael Jordan
International Conference on Machine Learning (ICML), Vienna, Austria, 2020
paper / arXiv

We propose to semantically score trajectories of RL agents with Wasserstein distance. The new algorithmic variant improves upon previous alternative distance metrics.

Variance Reduction for Evolutionary Strategies via Structured Control Variates
Yunhao Tang, Krzysztof Choromanski, Alp Kucukelbir
International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020
paper / arXiv

We propose a RL-specific control variate for variance reduction of Evolutionary Strategies. This method outperforms previous general purpose variance reduction techniques.

Discrete Action On-Policy Learning with Action-Value Critic
Yuguang Yue, Yunhao Tang, Mingzhang Yin and Mingyuan Zhou
International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020
paper / arXiv

We propose a novel variance reduction technique based on the Augment-Reinforce-Swap-Merge (ARSM) framework for on-policy optimization in RL.

Practical Nonisotropic Monte Carlo Sampling in High Dimensions via Determinantal Point Processes
Krzysztof Choromanski*, Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang*
International Conference on Artificial Intelligence and Statistics (AISTATS), Palermo, Italy, 2020
paper / arXiv

We propose a sampling scheme which incorporates determinantal point process into Evolutionary Strategies for variance reduction.

Discretizing Continuous Action Space for On-Policy Optimization
Yunhao Tang, Shipra Agrawal
Association for the Advancement of Artificial Intelligence (AAAI), New York, NY, USA, 2020
paper / arXiv / code

The simple idea of discretizing the continuous control action space into a discrete space works decently.

ES-MAML: Simple Hessian-Free Meta Learning
Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, Yunhao Tang
International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020
arXiv

Evolutionary Strategies can be applied to meta-learning and works decently compared to popular gradient-based alternatives. The new formulation also easily scales with computational resources.

Augment-Reinforce-Merge Policy Gradient for Binary Stochastic Policy
Yunhao Tang, Mingzhang Yin, Mingyuan Zhou
arXiv 2019
arXiv

We show that when combined with the Augment-Reinforce-Merge (ARM) technique for variance reduction, the performance of policy gradient algorithms with binary action space could be significantly improved.

Provably Robust Blackbox Optimization for Reinforcement Learning
Krzysztof Choromanski*, Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang, Deepali Jain, Yuxiang Yang, Atil Iscen, Jasmine Hsu, Vikas Sindhwani
Conference on Robot Learninng (CORL), spotlight, Okasa, Japan, 2019
paper / arXiv

We propose to combine robust regression with Evolutionary Strategies. This new formulation naturally entails sample reuse and improves the robustness of the overall algorithm.

From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization
Krzysztof Choromanski*, Aldo Pacchiano*, Jack Parker-Holder*, Yunhao Tang*, Vikas Sindhwani
Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019
paper / arXiv

We propose to constrain the random search employed by Evolutionary Strategies into adaptive subspaces, which entails improved sample efficiency.

KAMA-NNs: Low-Dimensional Rotation based Neural Networks
Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*, Yunhao Tang*
International Conference on Artificial Intelligence and Statistics (AISTATS), Okinawa, Japan, 2019
paper / supplementary

We design a novel neural network architecture based on Kac's random walk. This new architecture greatly reduces the number of parameters yet retains competitive performance on RL tasks compared to fully-connected architecture.

Orthogonal Estimation of Wasserstein Distances
Mark Rowland*, Jiri Hron*, Yunhao Tang*, Krzysztof Choromanski, Tamas Sarlos, Adrian Weller
International Conference on Artificial Intelligence and Statistics (AISTATS), Okinawa, Japan, 2019
paper / supplementary / arXiv

We propose an orthogonal sampling method for computing sliced Wasserstein distance. This new technique entails variance reduction and imporves training of downstream tasks.

Exploration by Distributional Reinforcement Learning
Yunhao Tang, Shipra Agrawal
International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 2018
arXiv / code

We carry out approximate Thomson sampling with randomized DQN for efficient exploration in RL.

Additional Preprints and Peer-reviewed Workshop Papers
Reinforcement Learning with Chromatic Networks
Xingyou Song, Krzysztof Choromanski, Jack Parker-Holder, Yunhao Tang, Wenbo Gao, Aldo Pacchiano, Tamas Sarlos, Deepali Jain, Yuxiang Yang
Workshop on Neural Architecture Search, International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020
arXiv

We show that to achieve decent performance on continuous control RL benchmark tasks only requires a small number of parameters.

Combining Model-based and Model-free Reinforcement Learning through Evolution
Yunhao Tang
Deep Reinforcement Learning Workshop, Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2018
paper

We show that combinng Evolutionary Strategies with model-based RL methods has the potential of brining the best of both worlds.

Variational Auto-encoding Contexts for Control
Yunhao Tang, Xiya Cao
Inference to Control Workshop, Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2018
paper

We show that incorporating generative modeling techniques into the RL loss improves the performance for certain partially-observable tasks.

Boosting Trust Region Policy Optimization with Normalizing Flows Policy
Yunhao Tang, Shipra Agrawal
Invertible Neural Nets and Normalizing Flows Workshop, International Conference on Machine Learning (ICML), Long Beach, California, USA, 2019
Deep Reinforcement Learning Workshop, Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018
arXiv

We show that the performance of trust region policy search could be signicantly improved via expressive policy classes, e.g. normalizing flows.

Variational Deep Q Network
Yunhao Tang, Alp Kucukelbir
Bayesian Deep Learning Workshop, Neural Information Processing Systems (NeurIPS), Long Beach, California, USA, 2017
arXiv / code

We show that introducing structured randomness in DQN agents entails more efficient exploration.

Implicit Policy for Reinforcement Learning
Yunhao Tang, Shipra Agrawal
arXiv 2018
arXiv

We demonstrate the utility of implicit distributions as RL policy.

Teaching assistant
IEOR 4525, Machine Learning for OR and FE, Spring 2020

IEOR 8100, Reinforcement Learning, Spring 2019 and Spring 2018

IEOR 4525, Machine Learning for OR and FE, Fall 2018

COMS 6998, Probabilistic Programming, Fall 2018

IEOR 6711, Stochastic Modeling I, Fall 2017
The source code of the website is from here.