Learning Restricted Boltzmann Machines using Difference of Convex Functions Optimization
Probabilistic generative models learn useful representations from unlabeled data which can be used for subsequent problem-specific tasks, such as classification, regression or information retrieval. One such energy based probabilistic generative model is the Restricted Boltzmann machine (RBM) which forms the building block for several deep generative models. However, it is difficult to learn RBMs because the computation of the gradient of the RBM's log-likelihood function involves the intractable partition function (the normalizing constant in the RBM's distribution function). Therefore, developing efficient algorithms to learn RBMs is an important research direction.
In this talk, I explore the maximum likelihood learning of two types of RBMs, namely, the binary-binary RBMs (BB-RBM) and the Gaussian-binary RBMs (GB-RBM). Firstly, for the BB-RBM, I will demonstrate how to exploit the property that its log-likelihood function could be expressed as a difference of convex functions w.r.t. its parameters and devise a stochastic variant of the difference of convex functions (DC) optimization algorithm, termed stochastic-DCP (S-DCP). We shall see that in this algorithm, the convex optimization problem at each iteration is approximately solved through a few iterations of stochastic gradient descent. In this presentation, I also discuss how the contrastive divergence (CD) algorithm, the current standard algorithm for learning RBMs, can be derived as a special case of the S-DCP algorithm and a simple extension of the S-DCP algorithm to accommodate standard techniques used to improve the learning of RBMs such as centering and diagonal scaling.
Although it is well documented in the literature that learning GB-RBMs is more challenging compared to BB-RBMs, we find that the S-DCP algorithm could be extended to learn GB-RBMs also. Therefore, I will first present the proof that the GB-RBM's log-likelihood function is also a difference of convex functions w.r.t. the weights and hidden biases, under the assumption that the conditional distribution of the visible units have a fixed variance. We shall furthermore see how to modify the S-DCP algorithm to also learn the variance parameter of visible units instead of assuming it to be fixed.
In this presentation, I shall demonstrate that the S-DCP algorithm and its variants provide a faster convergence and achieve a higher log-likelihood compared to the baseline algorithms, for both the types of RBMs, through extensive empirical studies on a number of benchmark datasets.
*This seminar is supported by the AMED Brain/MINDS Beyond program.
Public events of RIKEN Center for Advanced Intelligence Project (AIP)Join community