Closed AjayTalati closed 7 years ago
Hi, Thanks for pointing this out. The check pts error been fixed. For the nan error, I think it might be the optimization issue. You can try smaller learning rate.
The nan error seems roots in the relu of the controller output, which produce all 0 key, results to nan in backward of consine distance. The latest version should be more stable.
Thanks a lot :+1: - will try it, and give you feedback :)
Hi, I keep playing around with hyper parameters, but even with the latest version, I still get a lot of runs with NaNs
early on? When it works it's cool, bit's a bit tiring restarting again and again?
Avg. Logistic Loss: 0.6933
Iteration 50/100000
Avg. Logistic Loss: 0.5672
Iteration 100/100000
Avg. Logistic Loss: 0.3351
Iteration 150/100000
Avg. Logistic Loss: 0.2968
Iteration 200/100000
Avg. Logistic Loss: 0.2923
Iteration 250/100000
Avg. Logistic Loss: 0.2872
Iteration 300/100000
Avg. Logistic Loss: 0.2894
Iteration 350/100000
Avg. Logistic Loss: 0.2839
Iteration 400/100000
Avg. Logistic Loss: 0.2857
Iteration 450/100000
Avg. Logistic Loss: nan
Iteration 500/100000
Avg. Logistic Loss: nan
Iteration 550/100000
Avg. Logistic Loss: nan
Iteration 600/100000
Avg. Logistic Loss: nan
Iteration 650/100000
Avg. Logistic Loss: nan
Iteration 700/100000
Avg. Logistic Loss: nan
Iteration 750/100000
Avg. Logistic Loss: nan
Iteration 800/100000
Avg. Logistic Loss: nan
Iteration 850/100000
Avg. Logistic Loss: nan
Iteration 900/100000
Avg. Logistic Loss: nan
Iteration 950/100000
Avg. Logistic Loss: nan
Have you got any advice on ways to make the training stable/good hyper-parameter settings/weights initialisation? I'll keep playing around with it though.Thanks, Aj
That's strange, I tested it several times, and it looks quite stable to me and could converge to 0.01.
Did you update all the files? you can try with a new git clone.
I will also take a closer look.
Fresh pull from Git, but it's working OK this run!
How many iterations do you need for 0.001
? It usually converges to 0.01
for me, and then I get Nans
I'm going to try it on better GPU maybe its just this machine - which GPU are using?
Glad to know that it is working, :D. Did you get 0.01 and nan from the elder version? I could get 0.001 with a smaller network config (nhid and mem_size =64 and shorter seq). It usually takes more than 15000 iterations. I am using a laptop Geforece 940m gpu. Let me know if you still get annoying nan.
Erm I've got 0.01 from the latest version too, so it looks like its my GPU. What do you get if you use a CPU?
For some reason when I try on my CPU it crashes, and gives the,
AssertionError: leaf variable was used in an inplace operation
and this is with the latest pull and fresh pytorch? To be honest I don't understand why I get this with the CPU, and not when I run it with CUDA - very strange ???
I could run it on cpu, and it's much faster than the gpu version. = = The problem I have is, the running in cpu mode will consume more and more memory gradually, some other people's also reported similar issues with pytorch.
I am using the latest torch.
How did you manage to run it on CPU ????? I might try reinstalling pytorch without CUDA ???
Yes, I've seen this memory leak too when running multithread A3C !!! I manged to fix it by getting rid of all logging, and any unnecessary things stored in the training loop.
It's a pytorch, (not an algorithm thing), as I never get it in Tensorflow. There's some sort of a memory leak, which you can plug with garbage collection, but even then it still grows for long runs ???
Have you tried MxNet ?
Well here's my error message with the latest version of the code, and a fresh install of PyTorch,
Iteration 0/100000Traceback (most recent call last):
File "train.py", line 153, in <module>
output, _ = ncomputer.forward(input_data)
File "../../neucom/dnc.py", line 110, in forward
interface['erase_vector']
File "../../neucom/memory.py", line 389, in write
allocation_weight = self.get_allocation_weight(sorted_usage, free_list)
File "../../neucom/memory.py", line 144, in get_allocation_weight
flat_unordered_allocation_weight.cpu()
File "/home/ajay/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 636, in scatter_
return Scatter(dim, True)(self, index, source)
RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.
Have you got any ideas how to fix this?
To run on cpu, I just need to set cuda to False, and it could run seamlessly. Could you try it on python 2.7? I can only think of this as the cause.
I got it to run on the CPU, i.e. not get the leaf error thing, by reversing the commenting out of the lines around 144
in neucom/memory.py
, (which have cpu()
in them), that allowed it to run using python 3.6. But it did not train, it just converged to about 0.26
, which is much worse than GPU ???
So it does work with PyTorch on both CPU and GPU, but it's a lot different to TensorFlow, even though the algorithm and code are very similar ???
I will give it ago with an install of Anaconda 2.7
, and PyTorch CPU - spent so much time on this - really want to get it to work !
if you changed that line, it should have some issues when you run it in gpu mode. That function will complain if it's inputs host on gpu at the time I wrote it.
You can just use gpu, cpu mode has some memory issues as well which need to be fixed.
OK, thanks a lot :)
I won't be able to do any testing today, but please let me know, if and when you fix the CPU capability - that will be really really COOL :+1:
Hi,
I think the training progresses well, and there does'nt seem to be any improvement after
10,000
iterations. Sometimes I getnan
s very early on, and have to restart, and I also gotnan
s after40,000
iterations.May I ask how to serialise and save the
DNC
at checkpoints? I set a check point after100
iterations, and got the following error,PS - Sometimes I get
nan
s very early on, and have to restart, and I also gotnan
s after40,000
iterations. I've seen this very often with Neural Turing Machines, so I guess it's inherent with these type of things?