Replicating on mxnet - too much memory for GPU

ilkarman commented 8 years ago

I'm having a bit of difficulty replicating this cool model using the mxnet package (instead of torch). I'm not sure if it's something to do with my implementation of it, or the mxnet package - however it takes up way more than 3GB of memory.

My version is here and below is a direct paste of the model:

Edit: fixed with updated params

    input_x = mx.sym.Variable('data')  # placeholder for input
    input_y = mx.sym.Variable('softmax_label')  # placeholder for output
    #6 Convolutional layers
    #1. alphabet x 1014
    conv1 = mx.symbol.Convolution(
        data=input_x, kernel=(7, 69), num_filter=NUM_FILTERS)
    relu1 = mx.symbol.Activation(
        data=conv1, act_type="relu")
    pool1 = mx.symbol.Pooling(
        data=relu1, pool_type="max", kernel=(3, 1), stride=(1, 1))
    #2. 336 x 256
    conv2 = mx.symbol.Convolution(
        data=pool1, kernel=(7, 1), num_filter=NUM_FILTERS)
    relu2 = mx.symbol.Activation(
        data=conv2, act_type="relu")
    pool2 = mx.symbol.Pooling(
        data=relu2, pool_type="max", kernel=(3, 1), stride=(1, 1))
    #3. 110 x 256
    conv3 = mx.symbol.Convolution(
        data=pool2, kernel=(3, 1), num_filter=NUM_FILTERS)
    relu3 = mx.symbol.Activation(
        data=conv3, act_type="relu")
    #4. 108 x 256
    conv4 = mx.symbol.Convolution(
        data=relu3, kernel=(3, 1), num_filter=NUM_FILTERS)
    relu4 = mx.symbol.Activation(
        data=conv4, act_type="relu")
    #5. 106 x 256
    conv5 = mx.symbol.Convolution(
        data=relu4, kernel=(3, 1), num_filter=NUM_FILTERS)
    relu5 = mx.symbol.Activation(
        data=conv5, act_type="relu")
    #6. 104 x 256
    conv6 = mx.symbol.Convolution(
        data=relu5, kernel=(3, 1), num_filter=NUM_FILTERS)
    relu6 = mx.symbol.Activation(
        data=conv6, act_type="relu")
    pool6 = mx.symbol.Pooling(
        data=relu6, pool_type="max", kernel=(3, 1), stride=(1, 1))
    #34 x 256
    flatten = mx.symbol.Flatten(data=pool6)
    #3 Fully-connected layers
    #7.  8704
    fc1 = mx.symbol.FullyConnected(
        data=flatten, num_hidden=1024)
    act_fc1 = mx.symbol.Activation(
        data=fc1, act_type="relu")
    drop1 = mx.sym.Dropout(act_fc1, p=0.5)
    #8. 1024
    fc2 = mx.symbol.FullyConnected(
        data=drop1, num_hidden=1024)
    act_fc2 = mx.symbol.Activation(
        data=fc2, act_type="relu")
    drop2 = mx.sym.Dropout(act_fc2, p=0.5)
    #9. 1024
    fc3 = mx.symbol.FullyConnected(
        data=drop2, num_hidden=NOUTPUT)
    crepe = mx.symbol.SoftmaxOutput(
        data=fc3, label=input_y, name="softmax")

zhangxiangxiao commented 8 years ago

I am not an expert in MXNET, but its underlying matrix library MSHADOW uses C++ template expression extensively to auto-tune numerical computation, which cannot completely avoid temporary allocations. Therefore, it occupies more memory than hand-tuned software like Torch 7.

It is probably better to ask about this in github repos or forums related to MXNET.

ilkarman commented 8 years ago

Thanks Zhang! It turns out that I had mistyped the kernel sizes (e.g. (7,7) instead of (7,vocab_dim) - the updated model does seem to run = however, I see what you mean about the extra temp memory allocations causing bloating. It is also difficult to distribute this across GPUs on mxnet (it seems the model is not really big enough to allow that - at least the 256 filter one)

I was curious - did you really run this for 5000 epochs?

I am currently running this on a Tesla K80 (just one) with a bigger batch size of 256 (since GPU has 11GB ram). And it takes me around 750 minutes to run one epoch (on the 3.6 mill training observations) - so I can get around 2 epochs a day.

I was also curious if you remember the training accuracies you were getting in the first few epochs? I think after the second epoch I am still on below 0.55

zhangxiangxiao commented 8 years ago

Tesla K80 is equivalent of two down-clocked Tesla K40, each core being slower for about 20%. Then, Tesla K40 is down-clocked version of Geforce GTX Titan Black (with more memory and ECC memory support) that is about 20% slower. At the same time, Titan Black is slower than the old TItan X for about 20%, and 50% for the new Titan X.

Therefore, it is expected that you cannot do it fast on Tesla K80, despite the price tag of that chip.

That said, you can use CuDNN temporal convolution for GPUs with computing capacity > 3.0, which runs 17 times faster than the Crepe code which uses the default cunn temporal convolution.

When I did the experiments, there was no CuDNN, no Titan X and the experiment for dbpedia takes about a week for 10 eras, each era being 5000 epoches.

ilkarman commented 8 years ago

I see that's very interesting.

Well the rate that mxnet is going it takes around 10 hours to do 1 epoch with 1 GPU and a batch-size of 128. If I use 4 GPUs and keep batch size constant it takes longer.

To hit your 50,000 epochs I would need 500,000 hours = 57 years :)

gheinrich commented 8 years ago

For the record when I train the network on a Titan X (Maxwell) using CuDNN library and Torch in DIGITS on DBPedia dataset it takes ~15 minutes per epoch (498400 training samples, 56000 validation samples, with samples truncated to max 1024 characters).

miguelgfierro commented 8 years ago

I'm running the the code in 4 K-80 in Ubuntu and I'm getting the same results as @ilkarman. It's going a little bit faster. 1000 batches takes around 30min which is kind of similar to him. I think the speed is kind of similar to what you have guys. xiangxiao, mxnet eras => epoch epoch => batch

However the accuracy is quite poor. It's started at 0.51%. Did you guys had this accuracy at the beginning and then started to rise? or the initial accuracy was better?

ilkarman commented 8 years ago

My accuracy at the end of era 2 / epoch 2 is still around 0.51 - don't know if its realistic that this jumps up to 0.9 in the next 8 epochs/eras?

gheinrich commented 8 years ago

When I train the model on DBPedia, the accuracy is already higher than 90% at epoch 2 (see)

ilkarman commented 8 years ago

@gheinrich Thanks for that - I tried that dataset (using 4 GPUs and batchsize = 128*8) and the first epoch 3995 seconds = around 65 minutes. The validation data accuracy after first round was already 0.95.

So I guess that's an easier classification and with a lot less data then Amazon.

I would be curious to know what kinds of accuracies to expect for the Amazon data in epochs 1-10

zhangxiangxiao commented 8 years ago

Note that in the paper each epoch (or era in the Crepe code) consists of 30,000 minibatches (or epoches in the Crepe code) for the Amazon datasets, as in the last column of table 3. It matters because this means the learning rate will half every 90,000 minibatches (3 epoches) instead of 15,000 for DBPedia. It took me 3 weeks to get results for the two Amazon datasets; but you can probably get it done in 1-2 days using CuDNN.

ilkarman commented 8 years ago

Sorry it turns out the mistake was mine. In haste I forgot to add a stride of 3 to the max-pooling :) It runs as expected now and the accuracy is high.

I posted some pictures here in the readme; please let me know if that's ok and then I can close this

zhangxiangxiao commented 8 years ago

I will close. The drawings look awesome!

zhangxiangxiao / Crepe

Replicating on mxnet - too much memory for GPU #20