The “Jungle” of Reinforcement Learning: Policy Evaluation versus Policy Learning when On and Off Policy

Reinforcement learning is a mathematical framework for data-driven control, and defines a situation in which an autonomous agent seeks to learn to take actions based on rewards it observes that come from the environment it traverses. In particular, its a formalism for trying to teach a learning agent to try to follow temporally related incentives, like a trail of bread crumbs.

bread.pngIn light of the smashing success of AlphaGo during the summer of 2016, reinforcement learning has received increasing attention from disparate academic communities such as machine learning, artificial intelligence, control systems, operations research, and applied mathematics. However, the research communities view the same problem fairly differently, and as a result there is a noticeable gap between methods for reinforcement learning which are theoretically sound versus practically useful. Compounding this “practicality gap” is the imprecise language which is used to describe practical approaches reinforcement learning, e.g., when describing “off policy policy evaluation,” what do we really mean, mathematically? The purpose of this blog post is to clarify the taxonomy of different problems in the field, and what are the challenges one must face when in each setting. Before doing so, I’ll review a few basic concepts so we are on the same page.

Reinforcement learning is defined by a setting in which an autonomous agent moves through some environment (state space), which is typically a compact subset of Euclidean space. At some time, the agent takes an action which belongs to some other compact Euclidean subset, and then the agent moves to a new state. The way this transition happens is random, but is dependent on the previous state and action taken. Then a reward is revealed by the environment, which informs us of how good the action was. See the cartoon below, taken from one of my recent presentations:mdp_extended.pngHere I am discussing the most general case where state and action spaces are continuous, which is in contrast to the context in which classical dynamic programming theory was developed. The average accumulation of rewards when starting in a given state is called its value:value.pngThe goal in reinforcement learning is to chose the action sequence so as to accumulate the most rewards on average. Thus, our target is always to find the action sequence which gives us the most value:optimal_value.pngNow that we are all on the same page, we can talk about what are the difficulties of actually determining the optimal policy. I’ll organize the discussion according to the following table:policy_table0.png

  1. Top left: To actually optimize the value function with respect to our action sequence, we need to first have access to the value function, i.e., the above equation requires the one before it as a prerequisite. This is can be stated intuitively as policy improvement cannot happen without first knowing how to evaluate a policy. But to evaluate a policy, the agent must move around its environment according to a fixed policy for awhile, meaning that a lot of behavior occurs that’s not actually valuable. This setting of learning the value associated with a fixed policy is called policy evaluation. This may only be guaranteed to converge when the policy (in the form of state-action-reward triples) we evaluate is associated with value function we are learning, i.e., on-policy. On the positive side, it is possible to find stochastic approximation algorithms that converge almost surely and determine the value function for a fixed policy. This fact fundamentally hinges on the fact that the state-action-reward triples are independent and identically distributed (i.i.d.), and the fact that it’s possible to reformulate this task as a stochastic convex optimization problem.
  2. Bottom left: When trying to determine the value function when the data does not follow the policy we are evaluating, we are in murky territory. People keep around “training data sets” consisting of state-action-reward triples and keep running their value function estimation scheme until “convergence” but it is not at all clear that convergence is possible, since any stochastic descent direction or fixed point iteration will necessarily be biased. This is because when we compute expectations over data, we are computing expectations over data that is different from that which defines the value function! Put another way, we lose the crucial i.i.d. property of our data. A lot of heuristics and complicated tricks are used to circumvent this issue, such as importance sampling, biased replay buffers, probabilistic exploration schemes, encoding Bayesian priors on the value function, etc. But we lack a good theoretical understanding of these tools in terms of how they actually help or hurt us in determining which actions have the most value.

    To summarize, when we are learning the value function associated with the actions the agent actually follows, this is on-policy policy evaluation. When we try to learn the value function not associated with the actions chosen, we’re in the murky territory of stochastic optimization with an expectation over data not seen.RL_table1.png

  3. Now, let’s move to the second column, still in the second row (bottom right). Policy learning when off policy is the province of classical Q learning and TD Backgammon. The idea here is that the agent, by taking random actions, and then learning their associated value, will eventually learn the optimal value function. However, a fundamental difference between this setting and the first column is that all formulations of policy learning eventually boil down to a non-convex optimization problem, with the exception of stochastic fixed point iteration (classical TD(0)-style Q learning), but it’s an open question whether fixed point iteration is stable under function approximation. Function approximation, unfortunately, is required when operating in continuous spaces. This non-convexity, rather than being an artifact of the way we define a nonlinear statistical model, as in deep learning, is instead intrinsic to the problem formulation, as it is for the Traveling Salesman Problem or Cauchy-Schwarz Quadratic Mutual Information.
    • Non-convexity in policy learning is a challenging problem because even if we design a method that converges with probability 1, it could converge to an arbitrarily bad stationary point. To avoid this, people try to tamper with the random i.i.d. nature of the exploratory policy used in off-policy learning by biasing it towards states that are “more likely” to yield high rewards. But doing so can break the stochastic contraction/descent properties of the learning method. This is one example of the explore-exploit tradeoff. To learn a good policy, we have to randomly fully explore the space. But doing so could take a very long time during which we get very little reward. So instead we bias the action distribution. But this bias can cause the learning process to diverge.
  4. Lastly, the top-right corner, policy learning when on-policy, is best equipped to mitigate problems associated with the explore-exploit tradeoff. However, doing so comes at the cost of data now exhibiting temporal dependence. That is, now the learning process must address the fact that it is observing correlated time-series data, a setting under which theoretical understanding of stochastic optimization methods mostly breaks down. Some efforts to understand how to do supervised learning from time-series in very specialized settings, when one limits the amount how non-i.i.d. the data may be, have been developed. But serious mathematical sophistication required to understand this setting, and the analysis pretty much says: we can’t change the policy too quickly while learning, otherwise the value function never converges. This data distribution difficulty compounds the difficulties of non-convexity and/or stochastic fixed point iteration with function approximation that are present in the off-policy setting. However, somewhat remarkably, experimentally, on-policy policy learning can obtain very good results…if it does not diverge.

To sum up the above discussion, let’s now look at the filled in table:RL_table.pngThe takeaway from this blog post is that (1) we would like to overcome non-convexity in policy learning. The only way to do this is with stochastic fixed point iteration, but we don’t know how to make this work with function approximation, which is required for continuous spaces; (2) non-convexity is a major contributor to the phenomenon called the explore-exploit tradeoff, for if we had convexity, through random exploration alone we would always eventually find the best policy; (3) on-policy policy learning can experimentally obtain very good results, but theoretically it requires treating trajectories through the space as time-series, a setting for which most understanding of statistical learning methods breaks down; and (4) hacks such as biasing action selection towards states that yield higher reward (greedy actions) can help us mitigate the explore-exploit tradeoff but we possibly do so at the cost of our learning process diverging.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s