Data quality - Githubissues

zeyunZhao commented 2 weeks ago

"This can be a more subtle issue. You can change the denominator to (n + 1e-5), but actually ensuring that your model trains will involve checking that your training data is healthy (as I said in a previous reply, depending on the version of MPI against which you built your parallel hdf5 you could have weird floating point values in your training data that will cause the model to predict NaN), checking that you have an adequate batch size for your data; there may also be differences in the arithmetic between versions of PyTorch. I'm going to close this issue; further discussion about model training will be best done in the "Discussions" page."

I tried to generate data with only one CPU and the data format are as follows:

Datasets in the HDF5 file: activations : (1, 1, 32) float32 diameters : (1, 1, 32) float32 e : (1, 1, 32, 53, 1001) float32 h : (1, 1, 32, 53, 1001) float32 m : (1, 1, 32, 53, 1001) float32 p : (1, 1, 32, 53, 1001) float32 parameters : (1, 1, 64) float32 s : (1, 1, 32, 53, 1001) float32 v : (1, 1, 32, 53, 1001) float32 ys : (1, 1, 32) float32

and I get the same loss with nan.

tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=)

Is it real due to the MPI?

zeyunZhao commented 1 week ago

I have change my environments as follows:

CUDA/12.4 Python 3.10.14

Pytorch

torch 2.5.0 torchaudio 2.5.0 torchvision 0.20.0

openmpi 4.1.2

neuron 8.2.6

h5py 3.12.1

I can reproduce the line chart that you presented in cajal, but I still cannot run axonml successfully.

The data fed into models are:

Datasets in the HDF5 file: activations : (4, 32, 32) float32 diameters : (4, 32, 32) float32 e : (4, 32, 32, 53, 1001) float32 h : (4, 32, 32, 53, 1001) float32 m : (4, 32, 32, 53, 1001) float32 p : (4, 32, 32, 53, 1001) float32 parameters : (4, 32, 64) float32 s : (4, 32, 32, 53, 1001) float32 v : (4, 32, 32, 53, 1001) float32 ys : (4, 32, 32) float32

I have shown my results as follows:

Could you help me out please?

zeyunZhao commented 1 week ago

In cajal:
When I run "mpirun -n 4 python parallel_thresholds.py", I can get the figure that you presented.

But When I run "mpirun -n 64 python parallel_thresholds.py", the results are random and very weird.

I also tried to use only 4 CPUs to generate data and train model, but I still met the same question.

zeyunZhao commented 1 week ago

When I print the value of y in bptt.py when the loss is nan, I find that there are nan in y, which it generate the nan data in generate_data.py. Do you know how can I solve this problem?

Any reply will be appreciated!

zeyunZhao commented 1 week ago

I have reinstall my environment as you recommended.

python 3.10.14 hd12c33a_0_cpython conda-forge python-dateutil 2.9.0.post0 pypi_0 pypi python_abi 3.10 5_cp310 conda-forge

pytorch 2.0.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h778d358_5 pytorch pytorch-mutex 1.0 cuda pytorch

openmpi 4.1.2 hbfc84c5_0 conda-forge

NEURON 8.2.6

But I still cannot generate normal data as you provided. The data that I generated always have "nan", no matter how many CPUs I used to generate. I have tried using 1 CPU, 4 CPUs, 16 CPUs, 32 CPUs, and 64 CPUs, the results always contain "nan' .

I print them as follows: print(dataset_name, ": ", h5_file[dataset_name][:].shape, h5_file[dataset_name][:].dtype, np.isnan(h5_file[dataset_name][:]).any())

Datasets in the HDF5 file: activations : (1, 32, 32) float32 False diameters : (1, 32, 32) float32 False e : (1, 32, 32, 53, 1001) float32 True h : (1, 32, 32, 53, 1001) float32 True m : (1, 32, 32, 53, 1001) float32 True p : (1, 32, 32, 53, 1001) float32 True parameters : (1, 32, 64) float32 False s : (1, 32, 32, 53, 1001) float32 True v : (1, 32, 32, 53, 1001) float32 True ys : (1, 32, 32) float32 False

Could you help me out with this?

wmglab-duke / axonml

Data quality #6

CUDA/12.4 Python 3.10.14

torch 2.5.0 torchaudio 2.5.0 torchvision 0.20.0

openmpi 4.1.2

neuron 8.2.6

h5py 3.12.1

I can reproduce the line chart that you presented in cajal, but I still cannot run axonml successfully.