yell / boltzmann-machines

Boltzmann Machines in TensorFlow with examples
MIT License
849 stars 135 forks source link
ais annealed-importance-sampling boltzmann-machines contrastive-divergence-algorithm dbm deep-learning energy-based-model gibbs-sampling keras machine-learning mcmc pcd rbm restricted-boltzmann-machine sklearn-compatible tensorflow tensorflow-models variational-inference

Boltzmann Machines

This repository implements generic and flexible RBM and DBM models with lots of features and reproduces some experiments from "Deep boltzmann machines" [1], "Learning with hierarchical-deep models" [2], "Learning multiple layers of features from tiny images" [3], and some others.

Table of contents

What's Implemented

Restricted Boltzmann Machines (RBM)

Deep Boltzmann Machines (DBM)

Common features

Examples

1 RBM MNIST: script, notebook

Train Bernoulli RBM with 1024 hidden units on MNIST dataset and use it for classification.

algorithm
test error, %
RBM features + k-NN 2.88
RBM features + Logistic Regression 1.83
RBM features + SVM 1.80
RBM + discriminative fine-tuning 1.27

Another simple experiment illustrates main idea of one-shot learning approach proposed in [2]: to train generative neural network (RBM or DBM) on large corpus of unlabeled data and after that to fine-tune model only on limited amount of labeled data. Of course, in [2] they do much more complex things than simply pre-training RBM or DBM, but the difference is already noticeable:

number of labeled data pairs (train + val) RBM + fine-tuning random initialization gain
60k (55k + 5k) 98.73% 98.20% +0.53%
10k (9k + 1k) 97.27% 94.73% +2.54%
1k (900 + 100) 93.65% 88.71% +4.94%
100 (90 + 10) 81.70% 76.02% +5.68%

How to reproduce this table see here. In these experiments only RBM was tuned to have high pseudo log-likelihood on a held-out validation set. Even better results can be obtained if one will tune MLP and other classifiers.


2 DBM MNIST: script, notebook

Train 784-512-1024 Bernoulli DBM on MNIST dataset with pre-training and:

algorithm # intermediate distributions proposal (p0) logẐ log(Ẑ ± σZ) avg. test ELBO tightness of test ELBO
[1] 20'000 base-rate? [5] 356.18 356.06, 356.29 -84.62 about 0.5 nats
this example 200'000 uniform 1040.39 1040.18, 1040.58 -86.37
this example 20'000 uniform 1040.58 1039.93, 1041.03 -86.59

One can probably get better results by tuning the model slightly more. Also couple of nats could have been lost because of single-precision (for both training and AIS estimation).

number of labeled data pairs (train + val) DBM + fine-tuning random initialization gain
60k (55k + 5k) 98.68% 98.28% +0.40%
10k (9k + 1k) 97.11% 94.50% +2.61%
1k (900 + 100) 93.54% 89.14% +4.40%
100 (90 + 10) 83.79% 76.24% +7.55%

How to reproduce this table see here.

Again, MLP is not tuned. With tuned MLP and slightly more tuned generative model in [1] they achieved 0.95% error on full test set.
Performance on full training set is slightly worse compared to RBM because of harder optimization problem + possible vanishing gradients. Also because the optimization problem is harder, the gain when not much datapoints are used is typically larger.
Large number of parameters is one of the most crucial reasons why one-shot learning is not (so) successful by utilizing deep learning only. Instead, it is much better to combine deep learning and hierarchical Bayesian modeling by putting HDP prior over units from top-most hidden layer as in [2].


3 DBM CIFAR-10 "Naïve": script, notebook

(Simply) train 3072-5000-1000 Gaussian-Bernoulli-Multinomial DBM on "smoothed" CIFAR-10 dataset (with 1000 least significant singular values removed, as suggested in [3]) with pre-training and:

Despite poor-looking G-RBM features, classification performance after discriminative fine-tuning is much larger than reported backprop from random initialization [3], and is 5% behind best reported result using RBM (with twice larger number of hidden units). Note also that G-RBM is modified for DBM pre-training (notes or [1] for details):

algorithm
test accuracy, %
Best known MLP w/o data augmentation: 8 layer ZLin net [6] 69.62
Best known method using RBM (w/o data augmentation?): 10k hiddens + fine-tuning [3] 64.84
Gaussian RBM + discriminative fine-tuning (this example) 59.78
Pure backprop 3072-5000-10 on smoothed data (this example) 58.20
Pure backprop 782-10k-10 on PCA whitened data [3] 51.53


4 DBM CIFAR-10: script, notebook

Train 3072-7800-512 G-B-M DBM with pre-training on CIFAR-10, augmented (x10) using shifts by 1 pixel in all directions and horizontal mirroring and using more advanced training of G-RBM which is initialized from pre-trained 26 small RBM on patches of images, as in [3].
Notice how some of the particles are already resemble natural images of horses, cars etc. and note that the model is trained only on augmented CIFAR-10 (490k images), compared to 4M images that were used in [2].

I also trained for longer with

python dbm_cifar.py --small-l2 2e-3 --small-epochs 120 --small-sparsity-cost 0 \
                    --increase-n-gibbs-steps-every 20 --epochs 80 72 200 \
                    --l2 2e-3 0.01 1e-8 --max-mf-updates 70

While all RBMs have nicer features, this means that they overfit more than previously, and thus overall DBM performance is slightly worse.

The training with all pre-trainings takes quite a lot of time, but once trained, these nets can be used for other (similar) datasets/tasks.
Discriminative performance of Gaussian RBM now is very close to state of the art (having 7800 vs. 10k hidden units), and data augmentation given another 4% of test accuracy:

algorithm
test accuracy, %
Gaussian RBM + discriminative fine-tuning + augmentation (this example) 68.11
Best known method using RBM (w/o data augmentation?): 10k hiddens + fine-tuning [3] 64.84
Gaussian RBM + discriminative fine-tuning (this example) 64.38
Gaussian RBM + discriminative fine-tuning (example #3) 59.78

How to reproduce this table see here.


How to use examples

Use scripts for training models from scratch, for instance

$ python rbm_mnist.py -h

(...)

usage: rbm_mnist.py [-h] [--gpu ID] [--n-train N] [--n-val N]
                    [--data-path PATH] [--n-hidden N] [--w-init STD]
                    [--vb-init] [--hb-init HB] [--n-gibbs-steps N [N ...]]
                    [--lr LR [LR ...]] [--epochs N] [--batch-size B] [--l2 L2]
                    [--sample-v-states] [--dropout P] [--sparsity-target T]
                    [--sparsity-cost C] [--sparsity-damping D]
                    [--random-seed N] [--dtype T] [--model-dirpath DIRPATH]
                    [--mlp-no-init] [--mlp-l2 L2] [--mlp-lrm LRM [LRM ...]]
                    [--mlp-epochs N] [--mlp-val-metric S] [--mlp-batch-size N]
                    [--mlp-save-prefix PREFIX]

optional arguments:
  -h, --help            show this help message and exit
  --gpu ID              ID of the GPU to train on (or '' to train on CPU)
                        (default: 0)
  --n-train N           number of training examples (default: 55000)
  --n-val N             number of validation examples (default: 5000)
  --data-path PATH      directory for storing augmented data etc. (default:
                        ../data/)
  --n-hidden N          number of hidden units (default: 1024)
  --w-init STD          initialize weights from zero-centered Gaussian with
                        this standard deviation (default: 0.01)
  --vb-init             initialize visible biases as logit of mean values of
                        features, otherwise (if enabled) zero init (default:
                        True)
  --hb-init HB          initial hidden bias (default: 0.0)
  --n-gibbs-steps N [N ...]
                        number of Gibbs updates per weights update or sequence
                        of such (per epoch) (default: 1)
  --lr LR [LR ...]      learning rate or sequence of such (per epoch)
                        (default: 0.05)
  --epochs N            number of epochs to train (default: 120)
  --batch-size B        input batch size for training (default: 10)
  --l2 L2               L2 weight decay coefficient (default: 1e-05)
  --sample-v-states     sample visible states, otherwise use probabilities w/o
                        sampling (default: False)
  --dropout P           probability of visible units being on (default: None)
  --sparsity-target T   desired probability of hidden activation (default:
                        0.1)
  --sparsity-cost C     controls the amount of sparsity penalty (default:
                        1e-05)
  --sparsity-damping D  decay rate for hidden activations probs (default: 0.9)
  --random-seed N       random seed for model training (default: 1337)
  --dtype T             datatype precision to use (default: float32)
  --model-dirpath DIRPATH
                        directory path to save the model (default:
                        ../models/rbm_mnist/)
  --mlp-no-init         if enabled, use random initialization (default: False)
  --mlp-l2 L2           L2 weight decay coefficient (default: 1e-05)
  --mlp-lrm LRM [LRM ...]
                        learning rate multipliers of 1e-3 (default: (0.1,
                        1.0))
  --mlp-epochs N        number of epochs to train (default: 100)
  --mlp-val-metric S    metric on validation set to perform early stopping,
                        {'val_acc', 'val_loss'} (default: val_acc)
  --mlp-batch-size N    input batch size for training (default: 128)
  --mlp-save-prefix PREFIX
                        prefix to save MLP predictions and targets (default:
                        ../data/rbm_)

or download pretrained ones with default parameters using models/fetch_models.sh,
and check notebooks for corresponding inference / visualizations etc. Note that training is skipped if there is already a model in model-dirpath, and similarly for other experiments (you can choose different location for training another model).


Memory requirements


Download models and stuff

All models from all experiments can be downloaded by running models/fetch_models.sh or manually from Google Drive.
Also, you can download additional data (fine-tuned models' predictions, fine-tuned weights, means and standard deviations for datasets for examples #3, #4) using data/fetch_additional_data.sh

TeX notes

Check also my supplementary notes (or dropbox) with some historical outlines, theory, derivations, observations etc.

How to install

By default, the following commands install (among others) tensorflow-gpu~=1.3.0. If you want to install tensorflow without GPU support, replace corresponding line in requirements.txt. If you have already tensorflow installed, comment that line.

git clone https://github.com/monsta-hd/boltzmann-machines.git
cd boltzmann-machines
pip install -r requirements.txt

See here how to run from a virtual environment.
See here how to run from a docker container.

To run some notebooks you also need to install JSAnimation:

git clone https://github.com/jakevdp/JSAnimation
cd JSAnimation
python setup.py install

After installation, tests can be run with:

make test

All the necessary data can be downloaded with:

make data

Common installation issues

ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory.
TensorFlow 1.3.0 assumes cuDNN v6.0 by default. If you have different one installed, you can create symlink to libcudnn.so.6 in /usr/local/cuda/lib64 or /usr/local/cuda-8.0/lib64. More details here.

Possible future work

Contributing

Feel free to improve existing code, documentation or implement new feature (including those listed in Possible future work). Please open an issue to propose your changes if they are big enough.

References

[1] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In: Artificial Intelligence and Statistics, pages 448–455, 2009. [PDF]

[2] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. Learning with hierarchical-deep models. IEEE transactions on pattern analysis and machine intelligence, 35(8):1958–1971, 2013. [PDF]

[3] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. [PDF]

[4] G. Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926,

  1. [PDF]

[5] R. Salakhutdinov and I. Murray. On the quantitative analysis of Deep Belief Networks. In A. McCallum and S. Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008), pages 872–879. Omnipress, 2008 [PDF]

[6] Lin Z, Memisevic R, Konda K. How far can we go without convolution: Improving fully-connected networks, ICML 2016. [arXiv]

[7] G. Montavon and K.-R. Müller. Deep boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade, pages 621–637. Springer, 2012. [PDF]