On Bayesian vs. Frequentist Learning

The dominant paradigm in modern machine learning is frequentist: we are given a batch of data, in the form of feature vectors and binary labels or real values, define some statistical model, and then define a loss function which is small when estimates given by this model are close to the target variables. The resulting model is then found by minimizing the loss averaged over data with respect to model parameters. This procedure subsumes model fitting in linear regression/classification, a Gaussian mixture model, or a deep network. The reason statisticians would consider this approach as frequentist, as opposed to Bayesian, is that information regarding the data distribution is not wielded in any way to inform how to select the model, and we typically have a uniform prior over distinct data points.

In this context, model selection is typically done based on domain expertise, rather than explicit metrics or structural properties of the data distribution. There are a few reasons for this:

  1. One could argue that model selection (neural architecture specification, kernel hyper-parameters, etc.) is a heuristic attempt to encode information about the distribution;
  2. Even if one could encode priors explicitly about the data distribution into the model training phase, for instance, to weight data points according to their importance, then it turns a standard numerical optimization procedure such as stochastic gradient method into a series of maximum a posteriori (MAP) updates, which may be computationally intractable.

Therefore, in order to employ information about arbitrary data distributions correctly in the modern model fitting paradigm, succinct representations of posterior distributions are required, as well as manners to make MAP updates efficient.

It’s well known that tracking only a posterior mean and covariance of the distribution of a hidden parameter may be done by a Kalman filter when the relationship between observations and hidden states are linear. This fact has led to recent efforts to use Kalman filters in ML. However, linearity is often a restrictive assumption, and much of modern advances in perception and speech recognition have been driven by nonlinear modeling. It therefore makes sense to integrate posterior distributions defined by nonlinear input-output relationships into model fitting.

When going beyond linear input-output models, tools such as Gaussian Processes (GPs) and Monte Carlo (MC) methods become of interest, which facilitate representing either Gaussian or arbitrary observation models, respectively. These tools are known to satisfy Bayesian analogues of the universal approximation properties of neural networks. Unfortunately, they suffer from an instance of the curse of dimensionality when the training sample size is large or samples arrive in a sequential fashion. In particular, computing the posterior mean and covariance of a Gaussian process has complexity cubic in the sample size. Similarly, computing the filtering density of a Monte Carlo method, a.k.a. a particle filter, also scales with the number of particles which must go to infinity to attain statistical consistency.

Therefore, in order to make Bayesian representations of posterior distributions applicable to the era of big data analytics, the computational challenges associated with large sample/particle sizes must be surmounted. In particular, we need finite memory approximations of GPs and MC methods that are near statistical consistency. While a number of memory-reduction techniques for both tools exist, their validation is predominately experimental [see Ch. 8 of the Rasmussen GP textbook], and a conceptual understanding of how to trade off memory and statistical consistency remains an active area of research.

In an upcoming companion blog post, we’ll give some perspective and discuss conceptual/practical tradeoffs of memory-reduction techniques of Bayesian methods.

Leave a comment