Serialise and save the `DNC` at checkpoints

AjayTalati commented 7 years ago

Hi,

I think the training progresses well, and there does'nt seem to be any improvement after 10,000 iterations. Sometimes I get nan s very early on, and have to restart, and I also got nans after 40,000 iterations.

May I ask how to serialise and save the DNC at checkpoints? I set a check point after 100 iterations, and got the following error,

Using CUDA.
Iteration 0/50000
    Avg. Logistic Loss: 0.6931
Iteration 50/50000
    Avg. Logistic Loss: 0.6674
Iteration 100/50000
    Avg. Logistic Loss: 0.4560

Saving Checkpoint ... Traceback (most recent call last):
  File "train.py", line 183, in <module>
    torch.save(ncomputer, f)
  File "/home/ajay/anaconda3/envs/rllab3/lib/python3.5/site-packages/torch/serialization.py", line 120, in save
    return _save(obj, f, pickle_module, pickle_protocol)
  File "/home/ajay/anaconda3/envs/rllab3/lib/python3.5/site-packages/torch/serialization.py", line 186, in _save
    pickler.dump(obj)
_pickle.PicklingError: Can't pickle <class 'memory.mem_tuple'>: attribute lookup mem_tuple on memory failed

PS - Sometimes I get nan s very early on, and have to restart, and I also got nans after 40,000 iterations. I've seen this very often with Neural Turing Machines, so I guess it's inherent with these type of things?

ypxie commented 7 years ago

Hi, Thanks for pointing this out. The check pts error been fixed. For the nan error, I think it might be the optimization issue. You can try smaller learning rate.

ypxie commented 7 years ago

The nan error seems roots in the relu of the controller output, which produce all 0 key, results to nan in backward of consine distance. The latest version should be more stable.

AjayTalati commented 7 years ago

Thanks a lot :+1: - will try it, and give you feedback :)

AjayTalati commented 7 years ago

Hi, I keep playing around with hyper parameters, but even with the latest version, I still get a lot of runs with NaNs early on? When it works it's cool, bit's a bit tiring restarting again and again?

    Avg. Logistic Loss: 0.6933
Iteration 50/100000
    Avg. Logistic Loss: 0.5672
Iteration 100/100000
    Avg. Logistic Loss: 0.3351
Iteration 150/100000
    Avg. Logistic Loss: 0.2968
Iteration 200/100000
    Avg. Logistic Loss: 0.2923
Iteration 250/100000
    Avg. Logistic Loss: 0.2872
Iteration 300/100000
    Avg. Logistic Loss: 0.2894
Iteration 350/100000
    Avg. Logistic Loss: 0.2839
Iteration 400/100000
    Avg. Logistic Loss: 0.2857
Iteration 450/100000
    Avg. Logistic Loss: nan
Iteration 500/100000
    Avg. Logistic Loss: nan
Iteration 550/100000
    Avg. Logistic Loss: nan
Iteration 600/100000
    Avg. Logistic Loss: nan
Iteration 650/100000
    Avg. Logistic Loss: nan
Iteration 700/100000
    Avg. Logistic Loss: nan
Iteration 750/100000
    Avg. Logistic Loss: nan
Iteration 800/100000
    Avg. Logistic Loss: nan
Iteration 850/100000
    Avg. Logistic Loss: nan
Iteration 900/100000
    Avg. Logistic Loss: nan
Iteration 950/100000
    Avg. Logistic Loss: nan

Have you got any advice on ways to make the training stable/good hyper-parameter settings/weights initialisation? I'll keep playing around with it though.Thanks, Aj

ypxie commented 7 years ago

That's strange, I tested it several times, and it looks quite stable to me and could converge to 0.01.
Did you update all the files? you can try with a new git clone. I will also take a closer look.

AjayTalati commented 7 years ago

Fresh pull from Git, but it's working OK this run!

How many iterations do you need for 0.001 ? It usually converges to 0.01 for me, and then I get Nans

I'm going to try it on better GPU maybe its just this machine - which GPU are using?

ypxie commented 7 years ago

Glad to know that it is working, :D. Did you get 0.01 and nan from the elder version? I could get 0.001 with a smaller network config (nhid and mem_size =64 and shorter seq). It usually takes more than 15000 iterations. I am using a laptop Geforece 940m gpu. Let me know if you still get annoying nan.

AjayTalati commented 7 years ago

Erm I've got 0.01 from the latest version too, so it looks like its my GPU. What do you get if you use a CPU?

For some reason when I try on my CPU it crashes, and gives the,

AssertionError: leaf variable was used in an inplace operation

and this is with the latest pull and fresh pytorch? To be honest I don't understand why I get this with the CPU, and not when I run it with CUDA - very strange ???

ypxie commented 7 years ago

I could run it on cpu, and it's much faster than the gpu version. = = The problem I have is, the running in cpu mode will consume more and more memory gradually, some other people's also reported similar issues with pytorch.

I am using the latest torch.

AjayTalati commented 7 years ago

How did you manage to run it on CPU ????? I might try reinstalling pytorch without CUDA ???

Yes, I've seen this memory leak too when running multithread A3C !!! I manged to fix it by getting rid of all logging, and any unnecessary things stored in the training loop.

It's a pytorch, (not an algorithm thing), as I never get it in Tensorflow. There's some sort of a memory leak, which you can plug with garbage collection, but even then it still grows for long runs ???

Have you tried MxNet ?

AjayTalati commented 7 years ago

Well here's my error message with the latest version of the code, and a fresh install of PyTorch,

Iteration 0/100000Traceback (most recent call last):
  File "train.py", line 153, in <module>
    output, _ = ncomputer.forward(input_data)
  File "../../neucom/dnc.py", line 110, in forward
    interface['erase_vector']
  File "../../neucom/memory.py", line 389, in write
    allocation_weight = self.get_allocation_weight(sorted_usage, free_list)
  File "../../neucom/memory.py", line 144, in get_allocation_weight

flat_unordered_allocation_weight.cpu()

  File "/home/ajay/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 636, in scatter_
    return Scatter(dim, True)(self, index, source)
RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.

Have you got any ideas how to fix this?

ypxie commented 7 years ago

To run on cpu, I just need to set cuda to False, and it could run seamlessly. Could you try it on python 2.7? I can only think of this as the cause.

AjayTalati commented 7 years ago

I got it to run on the CPU, i.e. not get the leaf error thing, by reversing the commenting out of the lines around 144 in neucom/memory.py, (which have cpu() in them), that allowed it to run using python 3.6. But it did not train, it just converged to about 0.26, which is much worse than GPU ???

So it does work with PyTorch on both CPU and GPU, but it's a lot different to TensorFlow, even though the algorithm and code are very similar ???

I will give it ago with an install of Anaconda 2.7, and PyTorch CPU - spent so much time on this - really want to get it to work !

ypxie commented 7 years ago

if you changed that line, it should have some issues when you run it in gpu mode. That function will complain if it's inputs host on gpu at the time I wrote it.

You can just use gpu, cpu mode has some memory issues as well which need to be fixed.

AjayTalati commented 7 years ago

OK, thanks a lot :)

I won't be able to do any testing today, but please let me know, if and when you fix the CPU capability - that will be really really COOL :+1:

ypxie / pytorch-NeuCom

Serialise and save the `DNC` at checkpoints #2