**Autoencoders****Density estimation****Markov chain****Graphical model****Restricted Boltzmann Machines****Training an RBM****Applications of RBM**

*Contributed by: Arun K LinkedIn Profile: https://www.linkedin.com/in/arunsme/ *

In Machine learning, supervised learning methods are used when the objective is to learn mapping between the attributes and the target in the data. When the objective is to identify the underlying structure or the pattern in the data, unsupervised learning methods are useful. Some of the popular unsupervised learning methods are Clustering, Dimensionality reduction, Association mining, Anomaly detection and Generative models. Each of these techniques have a different pattern recognition objective such as identifying latent grouping, identifying latent space, finding irregularities in the data, density estimation or generating new samples from the data. In the current article we will focus on generative models, specifically Boltzmann Machine (BM), its popular variant Restricted Boltzmann Machine (RBM), working of RBM and some of its applications. Before deep-diving into details of BM, we will discuss some of the fundamental concepts that are vital to understanding BM.

**Autoencoders**

An Autoencoder is a neural network that learns two functions; 1) **Encode**: create a compressed or encoded representation of the input data, 2) **Decode**: recreate the input data from the encoded representation. The recreated representation should be close to the original input. The encoder function is typically referred to as reducing the data in observed space to latent space. Figure 1 shows a typical architecture of an autoencoder. In this architecture, it is indicated that the input six-dimensional observed space is reduced to two-dimensional latent space. Autoencoders learn the parameters of the network during back propagation similar to supervised learning networks but the difference is in the cost function. While supervised learning networks use target variable values in the cost function, autoencoders use the input values.

Figure 1. Typical representation of autoencoders

Once an autoencoder is trained, the encoder part of the network can be discarded and the decoder part can be used to generate new data in the observed space by creating random samples of data in latent space and mapping them to observed space. This is the core idea of generative models. A brief account of autoencoders is presented due to the similarity between autoencoders and Boltzmann Machine (BM). Like Autoencoders, BMs are useful to extract latent space from the data. The difference is in the architecture, the representation of the latent space and the training process.

**Density Estimation**

The association between a random continuous variable ‘x’ and the probability of it assuming specific values ‘p(x)’ is referred to as the probability density function or simply ‘density’. Figure 2 shows a typical density function. Knowing the probability density for a random variable can be useful to determine how likely the random variable is to assume a specific value. From the density plot in figure 2 it is easy to know that the variable x is more likely to assume a value of 50 and less likely to assume a value of 65.

Figure 2. example of a density function.

In practice, we may not be able to assess or observe all possible outcomes of a random variable due to which we generally do not know the actual density function. In such conditions, we must rely on approximating the density function from a sample of observations. Approximating a density function using a sample of observations is referred to as ‘Density estimation’. Learning density estimate from the training samples is fundamental to generative models.

Two types of density estimations are generally used in generative models; **Explicit Density** **Estimation (EDE)** and **Implicit Density Estimation (IDE)**. In EDE, predefined density functions are used to approximate the relationship between observations and their probability. The observed data is fit to predefined function by manipulating a fixed set of parameters of the function. An example is trying to fit given data to normal distribution using mean and the standard deviations of the samples. This type of density estimation is also known as parametric density estimation. In IDE, predefined density functions are not used. Instead an algorithm is used to approximate the probability distribution of the data. Kernel density approximation is an example of this type. Though the IDE methods use parameters for approximation, they cannot be directly manipulated the way they are in EDE. Figure 3 shows the taxonomy of different generative models based on the type of density estimation used.

Figure 3. Taxonomy of generative models (Image source [1])

Generative Adversial Network (GAN) is an Implicit density based generative model. Variational Autoencoder (VAE) and Boltzmann Machine (BM) are the explicit density based generative models.

**Markov chain**

A Markov chain is a probabilistic model used to estimate a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In a Markov chain, the future state depends only on the present state and not on the past states. An example of Markov’s process is show in figure 4. The position of the randomly walking person at instant t+1 is dependent on the current state t and not on the previous states (t-1, t-2, …..). This behavior is referred to as Markov property.

Figure 4. Random walk: Markov process (image source [2])

**Graphical model**

A graphical probabilistic model is a graphical representation used to expresses the conditional dependency between random variables. A graphical model has two components in it; Vertices and edges. The vertices indicate the state of random variable and the edge indicates direction of transformation. Figure 5 shows two main types of computational graphs; directed and undirected.

Figure 5. Directed and Undirected graph models

In directed graph, the state of the variable can transform in one direction. In the directed graph in figure 5, the state of the variable can transform from A to B or C to D, indicated by the direction of the edge and not from D to C or B to A. Edges are directed arrows in Directed graph. In undirected graph, there is no specific direction for the state of the variable to transform. In the undirected graph in figure 5, the state of the variable can transform from A to B or B to A, or from C to D or D to A. Edges are plain arcs in undirected graph. Figure 6 shows an undirected graphical model of a Markov process of diet habit of a baby. The graph model is used to indicate a baby’s choice for the next meal with the associated probabilities. The baby’s choice of next meal depends solely on what it is eating now and not what it ate earlier. The probability of choosing a specific food for next meal is calculated based on historic observations.

Figure 6. Undirected graph model of a Markov process

A set of random variables having Markov property and described by an undirected graph is referred to as **Markov Random Field (MRF)** or **Markov network**. In other words, a random field is said to be a Markov random field if it satisfies Markov property. BM is a type of MRF.

We now have a grasp on some of the fundamental concepts to understand BM. **A Boltzmann Machine (BM) is a probabilistic generative undirected graph model that satisfies Markov property. BMs learn the probability density from the input data** **to generating new samples from the same distribution**. A BM has an input or visible layer and one or several hidden layers. There is no output layer. Figure 6 shows a typical architecture of a BM with single hidden layer.

Figure 6. Typical architecture of Boltzmann Machine

The neurons in the network learn to make stochastic decisions about whether to turn on or off based on the data fed to the network during training. This helps the BM discover and model the complex underlying patterns in the data. A vital difference between BM and other popular neural net architectures is that the neurons in BM are connected not only to neurons in other layers but also to neurons within the same layer. Essentially, every neuron is connected to every other neuron in the network. This imposes a stiff challenge in training a BM and this version of BM, referred to as ‘Unrestricted Boltzmann Machine’ has very little practical use. Eliminating the connections between the neurons in the same layer relaxes the challenges in training the network and such networks are called as Restricted Boltzmann Machine (RBM). In practice, RBMs are used in verity of applications due to simpler training process compared to BMs

**Restricted Boltzmann Machines**

As indicated earlier, RBM is a class of BM with single hidden layer and with a bipartite connection. This means every neuron in the visible layer is connected to every neuron in the hidden layer but the neurons in the same layer are not connected to each other. Figure 7 shows a typical architecture of an RBM. Note the differences in the connections between the neurons in figures 6 and 7. This is the only difference between the unrestricted BM and RBM.

Figure 7. Typical architecture of Restricted Boltzmann Machine

**Training an RBM**

Let’s consider a simple RMB with 3 neurons in the visible layer and 2 neurons in the hidden layer as shown in figure 8. An RBM has two sets of biases; one set for the visible layer represented by ‘*a** _{i}*’ (

*a*

_{1}*, a*

_{2}*, a*

*) and one set for the hidden layer represented by ‘*

_{3}*b*

*’ (*

_{j}*b*

_{1}*, b*

*) in figure 8. The weights of the network are represented by ‘*

_{2}*ω*

*’. There is a total of six weights in the network*

_{ij}**ω**= [

*ω*

_{11}*,*

*ω*

_{12}*,*

*ω*

_{21}*,*

*ω*

_{22}*,*

*ω*

_{31}*,*

*ω*

*]*

_{32}*.*During the forward pass, the latent space output ht is estimated using the value of visible layer from previous iteration vt-1. During the backward pass the visible layer output or the reconstructed values vt is estimated using latent space vector ht. The formula used are shown in the figure and the function ‘

*f*’ is the activation function used (generally sigmoid). ‘

*t*’ is the iteration number. Note that v0 corresponds to the input matrix [x1, x2,x3]. Also, since the network is symmetric the weights ij=ji.

Figure 8. Forward and backward passes in RBM

The difference between the initial input v0 and the reconstructed value vt is referred to as reconstruction error. The learning objective in RBM is to update the weights and biases iteratively such that the reconstruction error is minimized, similar to that in autoencoders. This is diagrammatically represented for a bivariate distribution in figure 9.

Figure 9. Representation of actual and estimated distributions and the reconstruction error

It is important to note that, while the supervised models follow discriminative learning approach in which the model is trained to predict a single value, the RBMs follow a **generative learning** approach in which the **model is trained to predict a set of values or the distribution**.

To quantify the difference between the actual and the estimated distributions, KL-Divergence or Kullback–Leibler divergence score (*D** _{KL}*) is used. The equation to calculate the score is given below. Smaller the reconstruction error, lower the KL-Divergence score.

DKLP(x)||Q([x])=P(x)ln P(x)Q(x)

KL-Divergence measures the non-overlapping areas under the two distributions and the RBM’s optimization algorithm tries to minimize this difference by changing the weights so that the reconstructed distribution matches closely to the input distribution. The cost function used for training RBMs is called ‘Contrastive Divergence’ function. A detailed account of this cost function and process of training RBMs is presented in Geoffrey Hinton’s Guide on training RBMs.

**Applications of RBM**

During the early days of deep learning, RBMs were used to build a variety of applications such as Dimensionality reduction, Recommender systems, Topic modelling. However, in recent times, RBMs have been almost replaced by **Generative Adversarial Networks** (GANs) or Variation Autoencoder (VAEs) in different machine learning applications.