In my last blog post, I went on a diatribe on how far current technology is from anything resembling the machines of Blade Runner or Terminator. Admittedly, some of this diatribe was fueled by disdain for those who make strong declarations about topics for which they are unqualified. But this diatribe was equally fueled by the scientific limits of what is currently possible. Since evidence is more convincing than disdain, I’ve chosen to focus on providing some as the focus of a follow up blog post.
Specifically, I’ll focus on what in the lore we call Narrow Artificial Intelligence (e.g., an iPhone or the software equipped in most modern automobiles) and General Artificial Intelligence (think HAL from 2001: A Space Odyssey), and the technical challenges of transforming the former into the later. What is it that makes HAL so much harder to program than Siri or your learning thermostat? That’s the scope of this discussion. Let’s focus on the broad contours of how speech recognition systems work, due to their increasing ubiquity among consumer electronics, and the fact that most intelligent systems begin from a similar pipeline.
- First we need to accumulate training data, possibly on the order of billions of sample points.
- This is affordable for speech systems. For designing management strategies or robotic controllers, it is definitely not.
- Now we take all this data and try to find a more illuminating representation of it (say, using clustering, principle component analysis (PCA), or a multi-layer variant of PCA called an autoencoder). To complete this step, typically one must analyze the sample covariance matrix of the data and solve some numerical optimization problem involving it. Alternatives to PCA/autoencoders include taking all data which is similar according to some distance and putting it in the same bag (k-means/nearest neighbor clustering). This is what Netflix used to do for recommending similar movies to what you’ve watched recently.
- This signal representation step is called unsupervised learning, and is really about adopting the right “perspective” of the data domain. Think of this as your underlying worldview. It’s a lot easier to make decisions if you first learn to decide what are the defining characteristics of a given context.
- Now we need to build the statistical decision maker, i,.e., the supervised learning step. This step was briefly discussed in my last post, and follows ideas dating back to Sir R. A. Fischer. Now that we have designed our data representation (step 1), we need to decide how to map data to decisions. This happens by first making sure that all of our data has labels (apple or orange, happy or sad), or more generically target variables. Then we postulate that our decision function takes a specific form, say, a mixture of Gaussians centered at our input data (kernel methods) or a deep convolutional network. Then, we specify a merit criterion which is small when our decision function maps data to be close to its target variable, and we try to choose our decisions such that this loss function is minimized on average over all data.
- Actually solving this minimization problem and it’s interplay with the choice of the functional form of our decision function is challenging. As a result, optimization for supervised learning is a very active area of research among scientists spanning the fields of statistics, machine learning, signal processing, controls, operations research, and applied mathematics. I dedicate much of my professional time to this challenge. Typically we build upon iterative numerical schemes such as gradient descent or Newton’s method, and their stochastic variants which operate on small subsets of data per step.
Now that we have outlined what is the standard pipeline for learning systems as they exist today, which people intuitively think of as intelligent, we can begin to broach why this intelligence is Narrow, not General. All of the aforementioned machinery is used to build a decision system that spits out an answer of the form apple or orange, or a sequence of apples versus a sequence of oranges. This works well enough for simple tasks like building up a text response to a spoke statement when we combine it in a piecemeal fashion with other similarly designed statistically generated decisions. However, fundamental challenges remain when trying to “teach” a system higher level knowledge as compared with simple one-layer Q&A:
- We don’t know how to combine decision functions (estimators) in a mathematically stable/convergent way. I can teach my learning system to recognize a cat exceptionally well from a bunch of images of cat and no cat. However, we know that cats belong to a phylogenetic tree, which begins at kingdom and proceeds to phylum, …, genus, species. Each of these classifications must be done at a branch of the tree with a distinct decision function that should be daisy-chained from its root to its leaf. We have NO IDEA how to do this in a way that makes sense from the perspective of mathematical statistics. Some efforts in finance and operations research on two-stage (and multi-stage) stochastic programming have tried to answer this question but can only be addressed for very restrictive cases. But, without devolving into a detailed discussion of multi-stage stochastic programming, this bottleneck exists because of fundamental limits of our current optimization algorithms, which may be insurmountable. The way around these limits that has appeared in practical systems like an Autopilot or Siri is to “hand-tune” when estimators fit into more complicated systems, but this only works well when each component of the system is well-understood, as in classical aerospace applications. For general data-driven learning systems, this is far from true.
- Suppose for the sake of argument that issue 1 is solved, and that we can have an agent learn a hierarchical decision system. Then we are faced with how to encode learning higher-level behavior like a principle or a value into statistical learning systems. Here I mean principle or value in the common understanding, as in “our society values ambition and vision.” One possible solution would be to encode an intangible quality like virtue or bravery or honesty mathematically as some sort of constraint on the way the system makes decisions and augmenting the above learning procedures to address these constraints. But then engineers will find themselves in the murky business of mathematically encoding human emotions and principles, which in my humble opinion is impossible. While psychologists have endeavored to find quantitative measures of human emotions, there’s still a huge gap between “we observed something in a controlled lab experiment” and “this is how a certain human behavioral phenomenon occurs in general.” And therefore, most of the quantitative models developed by psychologists cannot be applied to develop learning systems as if they are just another engineering tool.
- Once the preceding two problems are taken care of (which may take decades), we are still faced with the open-ended question: if we string together enough binary classifications or regressions together, does higher-level reasoning emerge? In essence, what I am asking is, do most questions of “why?” or “how?” break down into a string of “what?” or “how much?” questions? The answer most people would give is sometimes yes, sometimes no. The answer to “why did the cookies burn?” can be easily addressed with a sequence of “what” questions, but the answer to “why is faith in democracy waning globally?” is much less straightforward. We are arbitrarily far from having a machine learn from experience how to answer the later question, and this is what I claim produces the chasm between Narrow AI and General AI.
I should caveat the preceding discussion by the fact that there are many ways of approaching the construction of machine intelligence from data and experience. I have chosen to adopt a perspective based on mathematical statistics, optimization, and machine learning, since these fields are driving forces behind recent advances at Google, Microsoft, Facebook, etc. The perspective/criticism from members of other research communities, especially computer vision, natural language processing, linguistics, and psychology, would help to qualify the above discussion, and hopefully spawn the development of new ideas to advance the state of machine understanding.