Open zeyunZhao opened 2 weeks ago
I have change my environments as follows:
Pytorch
The data fed into models are:
I have shown my results as follows:
Could you help me out please?
In cajal:
When I run "mpirun -n 4 python parallel_thresholds.py", I can get the figure that you presented.
But When I run "mpirun -n 64 python parallel_thresholds.py", the results are random and very weird.
I also tried to use only 4 CPUs to generate data and train model, but I still met the same question.
When I print the value of y in bptt.py when the loss is nan, I find that there are nan in y, which it generate the nan data in generate_data.py. Do you know how can I solve this problem?
Any reply will be appreciated!
I have reinstall my environment as you recommended.
python 3.10.14 hd12c33a_0_cpython conda-forge python-dateutil 2.9.0.post0 pypi_0 pypi python_abi 3.10 5_cp310 conda-forge
pytorch 2.0.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h778d358_5 pytorch pytorch-mutex 1.0 cuda pytorch
openmpi 4.1.2 hbfc84c5_0 conda-forge
NEURON 8.2.6
But I still cannot generate normal data as you provided. The data that I generated always have "nan", no matter how many CPUs I used to generate. I have tried using 1 CPU, 4 CPUs, 16 CPUs, 32 CPUs, and 64 CPUs, the results always contain "nan' .
I print them as follows: print(dataset_name, ": ", h5_file[dataset_name][:].shape, h5_file[dataset_name][:].dtype, np.isnan(h5_file[dataset_name][:]).any())
Datasets in the HDF5 file: activations : (1, 32, 32) float32 False diameters : (1, 32, 32) float32 False e : (1, 32, 32, 53, 1001) float32 True h : (1, 32, 32, 53, 1001) float32 True m : (1, 32, 32, 53, 1001) float32 True p : (1, 32, 32, 53, 1001) float32 True parameters : (1, 32, 64) float32 False s : (1, 32, 32, 53, 1001) float32 True v : (1, 32, 32, 53, 1001) float32 True ys : (1, 32, 32) float32 False
Could you help me out with this?
"This can be a more subtle issue. You can change the denominator to (n + 1e-5), but actually ensuring that your model trains will involve checking that your training data is healthy (as I said in a previous reply, depending on the version of MPI against which you built your parallel hdf5 you could have weird floating point values in your training data that will cause the model to predict NaN), checking that you have an adequate batch size for your data; there may also be differences in the arithmetic between versions of PyTorch. I'm going to close this issue; further discussion about model training will be best done in the "Discussions" page."
I tried to generate data with only one CPU and the data format are as follows:
Datasets in the HDF5 file: activations : (1, 1, 32) float32 diameters : (1, 1, 32) float32 e : (1, 1, 32, 53, 1001) float32 h : (1, 1, 32, 53, 1001) float32 m : (1, 1, 32, 53, 1001) float32 p : (1, 1, 32, 53, 1001) float32 parameters : (1, 1, 64) float32 s : (1, 1, 32, 53, 1001) float32 v : (1, 1, 32, 53, 1001) float32 ys : (1, 1, 32) float32
and I get the same loss with nan.
tensor(nan, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=)
tensor(nan, device='cuda:0', grad_fn=)
Is it real due to the MPI?