yrj90 / mil-pain

Code and notes for experiments of mil-pain
0 stars 1 forks source link

pytorch #2

Open yrj90 opened 5 years ago

yrj90 commented 5 years ago

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp:70

/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp line=70 error=59 : device-side assert triggered

yrj90 commented 5 years ago

Problem #N-2

/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp line=70 error=59 : device-side assert triggered RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp:70

:+1: ->> In this case the error is that your label is not in 0 <= your_label < n_classes. First to check whether your labels contain negative values. Then check whether your label values are greater than the number of classes. I had my initial label '-1 and 1', then change them into '0 and 1' by adding 1, but forgot that if initial label value is 1, then the new label will be 2, which is equal to the number of classes. And this is the reason. -> solu: by adding 1 with initial label and then divide them by 2. (for saving time and computation of for loop)

yrj90 commented 5 years ago

Problem #N

clip = torch.stack(clip, 0).permute(1, 0, 2, 3) RuntimeError: expected a non-empty list of Tensors

:+1: ->> The problem lies in the way of reading image path. Previously using os.path.join(video_dir_path, video_dir_path.split(sep='/')[-1] +'{:03d}.png'.format(i)) But in the database, some images of technique errors (black img without objects appear) have been deleted, so the number in image name are not consistent. For example, in 095/tv095t2aeunaff001.png, it is not exist. In this video, the image begin with number 008. This is the problem.

Stay calm and keep active thinking. :baby:

yrj90 commented 5 years ago

Problem #N+1

Wrong path!!! /home/ryang/cuda-workspace/A3D/data/UNBCdevkit/UNBC_pain_identify/JPEGImages/124-dn124/dn124t1aaunaff/dn124t1aeaff001.png

dn124t1aaunaff/dn124t1aeaff001.png are not the same video!!! Indicate that the way of loading image path and labels are not perfect. They are not consistent.

-> The reason is that in def generate_seg(), when splitting the segments, previous method of for loop are begin with 1, namely: for i in range(1, (n_frames[j] - step), step). It is for the purpose of later use of this frame index of directing the image name and image path. (the image are named with numbers like 001, 002, so it cannot start from 0).

:+1: ->> After we change the loading method of image, we should also change here, and start the index from 0, but not 1.

yrj90 commented 5 years ago

Problem #N-1

RuntimeError: Expected object of type torch.LongTensor but found type torch.FloatTensor

:+1: ->> By changing the targets type: targets = targets.type(torch.cuda.LongTensor)

yrj90 commented 5 years ago

Using the following to check whether your code has indentation problems python -m tabnanny main.py

yrj90 commented 5 years ago

Problem #N

File "/wrk/yangruij/DONOTREMOVE/git/A3D/train.py", line 40, in train_epoch losses.update(loss.data[0], inputs.size(0)) IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

:+1: You might have to change loss.data[0] to loss.item() as indicated in the error message.

yrj90 commented 5 years ago

Problem #N

File "/wrk/yangruij/DONOTREMOVE/git/A3D/train.py", line 56, in train_epoch 'lr': optimizer.param_groups[0]['lr'] File "/wrk/yangruij/DONOTREMOVE/git/A3D/utils.py", line 38, in log assert col in values AssertionError srun: error: g105: task 0: Exited with exit code 1

:+1: ->> you should check in the main.py of initial the logger of batch and epoch. There you set the headers, if you delete some item of them in the train.py, you also need to delete the corresponding one in main.py when initialize them.

yrj90 commented 5 years ago

Run code on csc: srun -N 1 -n 1 --mem-per-cpu=48000 -t72:00:00 --gres=gpu:p100:1 -p gpu python Blstm_rawJoint.py

yrj90 commented 5 years ago

Problem N

File "/wrk/yangruij/DONOTREMOVE/git/A3D_Regression/UNBC_dataloader.py", line 183, in getitem clip = torch.stack(clip, 0).permute(1, 0, 2, 3) RuntimeError: expected a non-empty list of Tensors

:+1: -> check img-path might wrong, so did not load any images to the tensor. It is due to the mistakes made when generating ImageSets(the path in it was video path, not image path)

yrj90 commented 5 years ago

Problem # N

File "main.py", line 103, in trainval criterion = nn.MSELoss.cuda(args.gpu) File "/appl/opt/python/mlpython-3.6.3/site-packages/torch/nn/modules/module.py", line 260, in cud return self._apply(lambda t: t.cuda(device)) AttributeError: 'int' object has no attribute '_apply'

:+1: -> criterion=nn.MSELoss() ; criterion.cuda(args.gpu)

yrj90 commented 5 years ago

Prob # N+2

ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction)) RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'target

:+1: -> target = target.type(torch.cuda.FloatTensor)

yrj90 commented 5 years ago

Prob # N+3

No module named sklearn

:+1: -> If you set your pydev project with correct python interpreter path, namely, the python in conda environment, then this means in your env, there is no sklearn installed. So first, in command line, go into your env by source activate ptenv, then conda install scikit-learn. And then everything is fine.

Refer later : ananconda navigator

yrj90 commented 5 years ago

Porb # N+4

File "/home/ryang/cuda-workspace/A3D_Classify_v1/train.py", line 49, in train_epoch outputs.cpu().numpy(), targets.cpu().numpy())) RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

:+1:

  1. tensor.view() :

Returns a new tensor with the same data as the self tensor but of a different shape.

  1. self.expand_as(other)

Expand this tensor to the same size as other. self.expand_as(other) is equivalent to self.expand(other.size()).

expand(*sizes) → Tensor

Returns a new view of the self tensor with singleton dimensions expanded to a larger size. namely: it will copy the current tensor to the desired size.(e.g. from 3x1 -> 3x4)

yrj90 commented 5 years ago

Prob # N+5

F-score is ill-defined and being set to 0.0 in labels with no true samples. UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

: -> It is because this : array([[0, 1, 1, 0, 1, 1]]), where the array is a two dimensional array. But we only need the vector, so you should x[0] to get the first element in the two-dim array.

yrj90 commented 5 years ago

Experience #1

In the log files, there is always case that for acc, it records a tensor. It is due to the code acc.data[0]. We should use acc.item() instead.

Use torch.Tensor.item() to get a Python number from a tensor containing a single value:

x = torch.tensor([[1]]) x tensor([[ 1]]) x.item() 1

NOTE: This only applied to 'only one element tensor'. If the tensor contains array, need other codes.

yrj90 commented 5 years ago

Prob #1

ValueError: Classification metrics can't handle a mix of binary and continuous-multioutput targets

yrj90 commented 5 years ago

Prob # 2

RuntimeWarning: invalid value encountered in float_scalars r = r_num / r_den

It is because reporting pcc in a batch, the ground truth are all 0. And you cannot divide 0, it will return 'nan'. So it's better to report pcc after one epoch.

yrj90 commented 5 years ago

Experience 2 - running on csc

seff jobnumber to check the cpu and gpu usage and efficiecny sjstat to check the available gpu

yrj90 commented 5 years ago

Experience 3

srun -N 1 -n 1 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py

yrj90 commented 5 years ago

Prob #1

FileExistsError: [Errno 17] File exists: In the code, has made the if sentence, but still got this error: if not os.path.exists(person_test_logger): os.makedirs(person_test_logger)

:+1: ->

As of Python >=3.2, os.makedirs() can take a third optional argument exist_ok:

os.makedirs(mydir, exist_ok=True)

yrj90 commented 5 years ago

Prob #2

on csc: cuda out of memory

-> change srun -N 1 -n 4 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py to srun -N 1 -n 2 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py or srun -N 1 -n 1 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py

But, it might suggest that better to use n>1 from preventing traffic jams on csc, from the csc official side.

yrj90 commented 5 years ago

Prob #1

TypeError: tensor is not a torch image

:+1: ->

The ToTensor transform should come before the Normalize transform, since the latter expects a tensor, but the Resize transform returns an image. The ordering: resize-> ToTensor -> Norm

yrj90 commented 5 years ago

Prob #2

TypeError: batch must contain tensors, numbers, dicts or lists; found <class 'PIL.Image.Image'>

->

The error states that the DataLoader receives a PIL image. This is because there are no transforms made (transform=None) on the image. The getitem method of MyDataset passes an unprocessed PIL image to the DataLoader, whereas it should receive a tensor.

You can add a transform that creates a tensor from the PIL image by adding transform:

yrj90 commented 5 years ago

Prob #1

When using 'AdaptiveMaxPool2d':

AttributeError: 'tuple' object has no attribute 'squeeze' It is because in train.py, outputs= model(inputs), while in Networklessparam.py, return pred, h1, h2, so that the outputs will be a tuple that contains the three elements returned.

-> change to : outputs, h1, h2= model(inputs) works

yrj90 commented 5 years ago

Prob #1

transforms/transforms.py", line 49, in call img = t(img) TypeError: object() takes no parameters

-> change transforms.ToTensor to transforms.ToTensor()

yrj90 commented 5 years ago

Prob #1

t.randomize_parameters() TypeError: randomize_parameters() missing 1 required positional argument: 'self'

-> change RandomHorizontalFlip to RandomHorizontalFlip()

yrj90 commented 5 years ago

Prob # 2

in cse0003,

` best_matrix[fol, 0] = best_tst_loss

IndexError: index 5 is out of bounds for axis 0 with size 2`

-> It is because we start from person 5 (fol=5), but in best_matrix, we should start row with index 0. So here we need to get the relative value of fol. Use best_matrix[fol-k, 0] = best_tst_loss instead.

yrj90 commented 5 years ago

Experience

Just keep in mind that creating a torch.FloatTensor out of Numpy’s float64 array will be very slow. It’s better to use torch.from_numpy(arr).float(). The .float() call will be a no-op if the array is already of float32 type

yrj90 commented 5 years ago

Prob #1

File "/home/ryang/anaconda3/envs/ptenv/lib/python3.6/site-packages/matplotlib/axes/_base.py", line 231, in _xy_from_xy "have shapes {} and {}".format(x.shape, y.shape)) ValueError: x and y must have same first dimension, but have shapes (10010,) and (3003,)

-> change, use inputs.size(0) instead of fixed integer: print('Plotting PCC CURVES HERE') x = np.linspace(0, (i+1)*inputs.size(0), (i+1)*inputs.size(0))

yrj90 commented 5 years ago

Prob #1

UnboundLocalError: local variable '' referenced before assignment

-> outside function define psntstgtlabel=np.zeros(0), and inside the function need declare global psntstgtlabel, or it will be deemed as local variable.

yrj90 commented 5 years ago

Prob #1: when return label list (contains multiple label for each frame), dataloader returned value mixed with batchsize

-> e.g. label list is a 9 dim vector, but the clip is of size 39120130, so we need to convert label size 9 into 19, by using the following method:

b @ b.view(1,-1).t() # -1 expands to the number of elements in all existing dimensions (here: [3])

b @ b.expand(1,-1).t() # -1 means not changing size in that dimension (here: stay at 3)

b @ b.unsqueeze(1) # unsqueeze adds num dimensions after existing ones (here 1 dimension)