The “Jungle” of Reinforcement Learning: Policy Evaluation versus Policy Learning when On and Off Policy

Reinforcement learning is a mathematical framework for data-driven control, and defines a situation in which an autonomous agent seeks to learn to take actions based on rewards it observes that come from the environment it traverses. In particular, its a formalism for trying to teach a learning agent to try to follow temporally related incentives, like a trail of bread crumbs.

bread.pngIn light of the smashing success of AlphaGo during the summer of 2016, reinforcement learning has received increasing attention from disparate academic communities such as machine learning, artificial intelligence, control systems, operations research, and applied mathematics. However, the research communities view the same problem fairly differently, and as a result there is a noticeable gap between methods for reinforcement learning which are theoretically sound versus practically useful. Compounding this “practicality gap” is the imprecise language which is used to describe practical approaches reinforcement learning, e.g., when describing “off policy policy evaluation,” what do we really mean, mathematically? The purpose of this blog post is to clarify the taxonomy of different problems in the field, and what are the challenges one must face when in each setting. Before doing so, I’ll review a few basic concepts so we are on the same page.

Reinforcement learning is defined by a setting in which an autonomous agent moves through some environment (state space), which is typically a compact subset of Euclidean space. At some time, the agent takes an action which belongs to some other compact Euclidean subset, and then the agent moves to a new state. The way this transition happens is random, but is dependent on the previous state and action taken. Then a reward is revealed by the environment, which informs us of how good the action was. See the cartoon below, taken from one of my recent presentations:mdp_extended.pngHere I am discussing the most general case where state and action spaces are continuous, which is in contrast to the context in which classical dynamic programming theory was developed. The average accumulation of rewards when starting in a given state is called its value:value.pngThe goal in reinforcement learning is to chose the action sequence so as to accumulate the most rewards on average. Thus, our target is always to find the action sequence which gives us the most value:optimal_value.pngNow that we are all on the same page, we can talk about what are the difficulties of actually determining the optimal policy. I’ll organize the discussion according to the following table:policy_table0.png

  1. Top left: To actually optimize the value function with respect to our action sequence, we need to first have access to the value function, i.e., the above equation requires the one before it as a prerequisite. This is can be stated intuitively as policy improvement cannot happen without first knowing how to evaluate a policy. But to evaluate a policy, the agent must move around its environment according to a fixed policy for awhile, meaning that a lot of behavior occurs that’s not actually valuable. This setting of learning the value associated with a fixed policy is called policy evaluation. This may only be guaranteed to converge when the policy (in the form of state-action-reward triples) we evaluate is associated with value function we are learning, i.e., on-policy. On the positive side, it is possible to find stochastic approximation algorithms that converge almost surely and determine the value function for a fixed policy. This fact fundamentally hinges on the fact that the state-action-reward triples are independent and identically distributed (i.i.d.), and the fact that it’s possible to reformulate this task as a stochastic convex optimization problem.
  2. Bottom left: When trying to determine the value function when the data does not follow the policy we are evaluating, we are in murky territory. People keep around “training data sets” consisting of state-action-reward triples and keep running their value function estimation scheme until “convergence” but it is not at all clear that convergence is possible, since any stochastic descent direction or fixed point iteration will necessarily be biased. This is because when we compute expectations over data, we are computing expectations over data that is different from that which defines the value function! Put another way, we lose the crucial i.i.d. property of our data. A lot of heuristics and complicated tricks are used to circumvent this issue, such as importance sampling, biased replay buffers, probabilistic exploration schemes, encoding Bayesian priors on the value function, etc. But we lack a good theoretical understanding of these tools in terms of how they actually help or hurt us in determining which actions have the most value.

    To summarize, when we are learning the value function associated with the actions the agent actually follows, this is on-policy policy evaluation. When we try to learn the value function not associated with the actions chosen, we’re in the murky territory of stochastic optimization with an expectation over data not seen.RL_table1.png

  3. Now, let’s move to the second column, still in the second row (bottom right). Policy learning when off policy is the province of classical Q learning and TD Backgammon. The idea here is that the agent, by taking random actions, and then learning their associated value, will eventually learn the optimal value function. However, a fundamental difference between this setting and the first column is that all formulations of policy learning eventually boil down to a non-convex optimization problem, with the exception of stochastic fixed point iteration (classical TD(0)-style Q learning), but it’s an open question whether fixed point iteration is stable under function approximation. Function approximation, unfortunately, is required when operating in continuous spaces. This non-convexity, rather than being an artifact of the way we define a nonlinear statistical model, as in deep learning, is instead intrinsic to the problem formulation, as it is for the Traveling Salesman Problem or Cauchy-Schwarz Quadratic Mutual Information.
    • Non-convexity in policy learning is a challenging problem because even if we design a method that converges with probability 1, it could converge to an arbitrarily bad stationary point. To avoid this, people try to tamper with the random i.i.d. nature of the exploratory policy used in off-policy learning by biasing it towards states that are “more likely” to yield high rewards. But doing so can break the stochastic contraction/descent properties of the learning method. This is one example of the explore-exploit tradeoff. To learn a good policy, we have to randomly fully explore the space. But doing so could take a very long time during which we get very little reward. So instead we bias the action distribution. But this bias can cause the learning process to diverge.
  4. Lastly, the top-right corner, policy learning when on-policy, is best equipped to mitigate problems associated with the explore-exploit tradeoff. However, doing so comes at the cost of data now exhibiting temporal dependence. That is, now the learning process must address the fact that it is observing correlated time-series data, a setting under which theoretical understanding of stochastic optimization methods mostly breaks down. Some efforts to understand how to do supervised learning from time-series in very specialized settings, when one limits the amount how non-i.i.d. the data may be, have been developed. But serious mathematical sophistication required to understand this setting, and the analysis pretty much says: we can’t change the policy too quickly while learning, otherwise the value function never converges. This data distribution difficulty compounds the difficulties of non-convexity and/or stochastic fixed point iteration with function approximation that are present in the off-policy setting. However, somewhat remarkably, experimentally, on-policy policy learning can obtain very good results…if it does not diverge.

To sum up the above discussion, let’s now look at the filled in table:RL_table.pngThe takeaway from this blog post is that (1) we would like to overcome non-convexity in policy learning. The only way to do this is with stochastic fixed point iteration, but we don’t know how to make this work with function approximation, which is required for continuous spaces; (2) non-convexity is a major contributor to the phenomenon called the explore-exploit tradeoff, for if we had convexity, through random exploration alone we would always eventually find the best policy; (3) on-policy policy learning can experimentally obtain very good results, but theoretically it requires treating trajectories through the space as time-series, a setting for which most understanding of statistical learning methods breaks down; and (4) hacks such as biasing action selection towards states that yield higher reward (greedy actions) can help us mitigate the explore-exploit tradeoff but we possibly do so at the cost of our learning process diverging.


A Digression on Narrow and General Artificial Intelligence

In my last blog post, I went on a diatribe on how far current technology is from anything resembling the machines of Blade Runner or Terminator. Admittedly, some of this diatribe was fueled by disdain for those who make strong declarations about topics for which they are unqualified. But this diatribe was equally fueled by the scientific limits of what is currently possible. Since evidence is more convincing than disdain, I’ve chosen to focus on providing some as the focus of a follow up blog post.

Specifically, I’ll focus on what in the lore we call Narrow Artificial Intelligence (e.g., an iPhone or the software equipped in most modern automobiles) and General Artificial Intelligence (think HAL from 2001: A Space Odyssey), and the technical challenges of transforming the former into the later. What is it that makes HAL so much harder to program than Siri or your learning thermostat? That’s the scope of this discussion. Let’s focus on the broad contours of how speech recognition systems work, due to their increasing ubiquity among consumer electronics, and the fact that most intelligent systems begin from a similar pipeline.

  1. First we need to accumulate training data, possibly on the order of billions of sample points.
    • This is affordable for speech systems. For designing management strategies or robotic controllers, it is definitely not.
  2. Now we take all this data and try to find a more illuminating representation of it (say, using clustering, principle component analysis (PCA), or a multi-layer variant of PCA called an autoencoder). To complete this step, typically one must analyze the sample covariance matrix of the data and solve some numerical optimization problem involving it. Alternatives to PCA/autoencoders include taking all data which is similar according to some distance and putting it in the same bag (k-means/nearest neighbor clustering). This is what Netflix used to do for recommending similar movies to what you’ve watched recently.
    • This signal representation step is called unsupervised learning, and is really about adopting the right “perspective” of the data domain. Think of this as your underlying worldview. It’s a lot easier to make decisions if you first learn to decide what are the defining characteristics of a given context.
  3. Now we need to build the statistical decision maker, i,.e., the supervised learning step. This step was briefly discussed in my last post, and follows ideas dating back to Sir R. A. Fischer. Now that we have designed our data representation (step 1), we need to decide how to map data to decisions. This happens by first making sure that all of our data has labels (apple or orange, happy or sad), or more generically target variables. Then we postulate that our decision function takes a specific form, say, a mixture of Gaussians centered at our input data (kernel methods) or a deep convolutional network. Then, we specify a merit criterion which is small when our decision function maps data to be close to its target variable, and we try to choose our decisions such that this loss function is minimized on average over all data.
    • Actually solving this minimization problem and it’s interplay with the choice of the functional form of our decision function is challenging. As a result, optimization for supervised learning is a very active area of research among scientists spanning the fields of statistics, machine learning, signal processing, controls, operations research, and applied mathematics. I dedicate much of my professional time to this challenge. Typically we build upon iterative numerical schemes such as gradient descent or Newton’s method, and their stochastic variants which operate on small subsets of data per step.

Now that we have outlined what is the standard pipeline for learning systems as they exist today, which people intuitively think of as intelligent, we can begin to broach why this intelligence is Narrow, not General. All of the aforementioned machinery is used to build a decision system that spits out an answer of the form apple or orange, or a sequence of apples versus a sequence of oranges. This works well enough for simple tasks like building up a text response to a spoke statement when we combine it in a piecemeal fashion with other similarly designed statistically generated decisions. However, fundamental challenges remain when trying to “teach” a system higher level knowledge as compared with simple one-layer Q&A:

  1. We don’t know how to combine decision functions (estimators) in a mathematically stable/convergent way. I can teach my learning system to recognize a cat exceptionally well from a bunch of images of cat and no cat. However, we know that cats belong to a phylogenetic tree, which begins at kingdom and proceeds to phylum, …, genus, species. Each of these classifications must be done at a branch of the tree with a distinct decision function that should be daisy-chained from its root to its leaf.  We have NO IDEA how to do this in a way that makes sense from the perspective of mathematical statistics. Some efforts in finance and operations research on two-stage (and multi-stage) stochastic programming have tried to answer this question but can only be addressed for very restrictive cases. But, without devolving into a detailed discussion of multi-stage stochastic programming, this bottleneck exists because of fundamental limits of our current optimization algorithms, which may be insurmountable. The way around these limits that has appeared in practical systems like an Autopilot or Siri is to “hand-tune” when estimators fit into more complicated systems, but this only works well when each component of the system is well-understood, as in classical aerospace applications. For general data-driven learning systems, this is far from true.
  2. Suppose for the sake of argument that issue 1 is solved, and that we can have an agent learn a hierarchical decision system. Then we are faced with how to encode learning higher-level behavior like a principle or a value into statistical learning systems. Here I mean principle or value in the common understanding, as in “our society values ambition and vision.” One possible solution would be to encode an intangible quality like virtue or bravery or honesty mathematically as some sort of constraint on the way the system makes decisions and augmenting the above learning procedures to address these constraints. But then engineers will find themselves in the murky business of mathematically encoding human emotions and principles, which in my humble opinion is impossible. While psychologists have endeavored to find quantitative measures of human emotions, there’s still a huge gap between “we observed something in a controlled lab experiment” and “this is how a certain human behavioral phenomenon occurs in general.” And therefore, most of the quantitative models developed by psychologists cannot be applied to develop learning systems as if they are just another engineering tool.
  3. Once the preceding two problems are taken care of (which may take decades), we are still faced with the open-ended question: if we string together enough binary classifications or regressions together, does higher-level reasoning emerge? In essence, what I am asking is, do most questions of “why?” or “how?” break down into a string of “what?” or “how much?” questions? The answer most people would give is sometimes yes, sometimes no. The answer to “why did the cookies burn?” can be easily addressed with a sequence of “what” questions, but the answer to “why is faith in democracy waning globally?” is much less straightforward. We are arbitrarily far from having a machine learn from experience how to answer the later question, and this is what I claim produces the chasm between Narrow AI and General AI.

I should caveat the preceding discussion by the fact that there are many ways of approaching the construction of machine intelligence from data and experience. I have chosen to adopt a perspective based on mathematical statistics, optimization, and machine learning, since these fields are driving forces behind recent advances at Google, Microsoft, Facebook, etc. The perspective/criticism from members of other research communities, especially computer vision, natural language processing, linguistics, and psychology, would help to qualify the above discussion, and hopefully spawn the development of new ideas to advance the state of machine understanding.

The Chasm between Artificial Intelligence in the Public Imagination and Technology Today

The prognostications of Silicon Valley moguls such as Elon Musk, Peter Thiel, and company would have you believe we are standing on the edge of an artificial intelligence apocalypse, with smart machines coming to eat everyone’s lunch and undermine the viability of the economy as it functions today. This anxiety is exacerbated by recent troubling news of the way Big Tech is manipulating things we hold dear, such as local organic dispensers or civil democratic discourse. Combined with a steady diet of dystopian motion picture depictions of intelligent machines such as 2001: A Space Odyssey, Bladerunner, Her, Ex Machina, The Terminator, AI, Alien: Covenant, and so on, one may come away with the impression that the magnates of Big Tech are correct: the future of humanity is slipping out of the grasp of humans and into that of some nameless,  faceless Machine Learning systems.

I’ve spent the past several years during graduate school and onward becoming intimately familiar with the mathematical underpinnings of learning theory and how it can be used to design autonomous learning systems. What I can tell you is that we are exceptionally far from designing anything resembling intelligence in the popular usage of the word. Today’s learning machines can successfully learn from their sensory feeds to address rudimentary questions such as “am I looking at an apple?” with a high degree of accuracy. This is called supervised learning, or statistical inference, and dates back to Sir R. A. Fischer in 1917. The key difference between then and now is increased data availability and computing infrastructure, but the bones of how the learning occurs is nearly as basic as it was in 1917.

Within the next five to ten years will be able to reliably and from streaming information have a learning system, not just perform object recognition (“is that an apple”?), but learn to autonomously navigate its environment through past trial and error experiences. To be specific, we will be able to teach an autonomous agent to ask the right questions that lead it to find the apple. This is called reinforcement learning, and has been around at least since Richard Bellman in 1954, but the framework has gained salience more recently, again due to increases in computing power and data availability. The folks at Google DeepMind, among others, are hard at work on making this a reality for simple videogames, and from that we will be able to extrapolate the techniques to problems faced by robotic systems.

But what no engineered combination of supervised or reinforcement learning systems can tackle is how to mathematically formulate the meaning of higher level questions or abstract notions. There is an insurmountable barrier between “how should you steer to the apple?” and “what is the meaning or purpose of an apple?” that makes it beyond the realm of engineered systems. Workarounds based on regurgitating information from Wikipedia or Google ignore the issue of how a machine should actually form a representation of higher-level knowledge based on what it’s trying to do. There really is nothing that both makes mathematical sense and captures the way abstract concepts actually can define the real world and our experience of it.

I would like to go out on a limb at this point and claim that we will never be able to assign a numerical value to intangible concepts like trust, friendship, creativity, and so on. Without a formal theory of how learning systems should process and understand abstract concepts, no engineered systems will do so. We may see exceptions within our lifetime, but these will not go beyond a “special sauce” cobbling together of different features to dupe us into believing we are observing intelligence when we really are not. Siri and Alexa don’t understand you. Not really. They just built up an intricate predictive statistical model of your speech patterns and browsing habits. But there is no higher level understanding.

Equipped with this understanding of the limitations of the artificial intelligence of our imagination, we can shift focus to how technology is currently evolving. Products like the iPhone, HomePod, EchoXBox One, etc., promise consumers intelligent companionship that can understand when important events are upcoming in their lives and help them find recipes in real time for things they are cooking. For all sorts of rudimentary tasks like recounting information from the web or inside your smartphone apps, they are well-equipped to do so. But this impressive functionality breaks down when you replace basic informational questions with ones that require any sort of creativity and synthesis. For higher-level knowledge, there’s nothing more than regurgitating Wikipedia.

These products are a lot more rudimentary than the marketing campaigns for Amazon Echo, HomePod, etc. would have you believe. Big Tech has a vested interest in convincing the public that autonomous learning systems are much more advanced than they are, so that in five to ten years we are more likely to buy into the rush towards autonomous cars. Automobile companies are pouring billions of dollars into the development of AI for autonomous driving, ignoring the sheer number of higher-level questions that are required during our daily driving experience, such as how to provide an intuitive and adaptive response to the behavior of a cyclist in a neighboring lane on the road. This should give us cause for concern.

More pernicious than a few malfunctioning fancy cars are the long-term economic effects that industrial automation has wrought on the working classes of Western countries. Automation has been so effective because it allows autonomous systems to operate only within the realm of low-level questions and repetitive mechanical tasks, for which identifying the apple and navigating to it are enough.  As countries have automated factories, warehouses, cashiers, and other components of retail, decent working people have watched their livelihoods evaporate. We cannot ask companies to reduce efficiency and incur greater operating costs just due to our bleeding hearts. The task we are faced with is how to redirect working people to tasks that are well-suited to their skill set, and too challenging for industrial engineers to automate. The AI Apocalypse may not be coming, but the bifurcation of advanced industrialized economies is already here.