Open yrj90 opened 5 years ago
/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp line=70 error=59 : device-side assert triggered RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp:70
:+1: ->> In this case the error is that your label is not in 0 <= your_label < n_classes. First to check whether your labels contain negative values. Then check whether your label values are greater than the number of classes. I had my initial label '-1 and 1', then change them into '0 and 1' by adding 1, but forgot that if initial label value is 1, then the new label will be 2, which is equal to the number of classes. And this is the reason. -> solu: by adding 1 with initial label and then divide them by 2. (for saving time and computation of for loop)
clip = torch.stack(clip, 0).permute(1, 0, 2, 3) RuntimeError: expected a non-empty list of Tensors
:+1: ->> The problem lies in the way of reading image path. Previously using
os.path.join(video_dir_path, video_dir_path.split(sep='/')[-1] +'{:03d}.png'.format(i))
But in the database, some images of technique errors (black img without objects appear) have been deleted, so the number in image name are not consistent. For example, in 095/tv095t2aeunaff001.png
, it is not exist. In this video, the image begin with number 008. This is the problem.
Stay calm and keep active thinking. :baby:
Wrong path!!! /home/ryang/cuda-workspace/A3D/data/UNBCdevkit/UNBC_pain_identify/JPEGImages/124-dn124/dn124t1aaunaff/dn124t1aeaff001.png
dn124t1aaunaff/dn124t1aeaff001.png are not the same video!!! Indicate that the way of loading image path and labels are not perfect. They are not consistent.
-> The reason is that in def generate_seg()
, when splitting the segments, previous method of for loop are begin with 1, namely: for i in range(1, (n_frames[j] - step), step)
. It is for the purpose of later use of this frame index of directing the image name and image path. (the image are named with numbers like 001, 002, so it cannot start from 0).
:+1: ->> After we change the loading method of image, we should also change here, and start the index from 0, but not 1.
RuntimeError: Expected object of type torch.LongTensor but found type torch.FloatTensor
:+1: ->> By changing the targets type: targets = targets.type(torch.cuda.LongTensor)
Using the following to check whether your code has indentation problems
python -m tabnanny main.py
File "/wrk/yangruij/DONOTREMOVE/git/A3D/train.py", line 40, in train_epoch losses.update(loss.data[0], inputs.size(0)) IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number
:+1: You might have to change loss.data[0] to loss.item() as indicated in the error message.
File "/wrk/yangruij/DONOTREMOVE/git/A3D/train.py", line 56, in train_epoch
'lr': optimizer.param_groups[0]['lr']
File "/wrk/yangruij/DONOTREMOVE/git/A3D/utils.py", line 38, in log
assert col in values
AssertionError
srun: error: g105: task 0: Exited with exit code 1
:+1: ->> you should check in the main.py of initial the logger of batch and epoch. There you set the headers, if you delete some item of them in the train.py, you also need to delete the corresponding one in main.py when initialize them.
Run code on csc:
srun -N 1 -n 1 --mem-per-cpu=48000 -t72:00:00 --gres=gpu:p100:1 -p gpu python Blstm_rawJoint.py
File "/wrk/yangruij/DONOTREMOVE/git/A3D_Regression/UNBC_dataloader.py", line 183, in getitem clip = torch.stack(clip, 0).permute(1, 0, 2, 3) RuntimeError: expected a non-empty list of Tensors
:+1: -> check img-path might wrong, so did not load any images to the tensor. It is due to the mistakes made when generating ImageSets(the path in it was video path, not image path)
File "main.py", line 103, in trainval criterion = nn.MSELoss.cuda(args.gpu) File "/appl/opt/python/mlpython-3.6.3/site-packages/torch/nn/modules/module.py", line 260, in cud return self._apply(lambda t: t.cuda(device)) AttributeError: 'int' object has no attribute '_apply'
:+1: ->
criterion=nn.MSELoss() ; criterion.cuda(args.gpu)
ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction)) RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'target
:+1: -> target = target.type(torch.cuda.FloatTensor)
No module named sklearn
:+1: -> If you set your pydev project with correct python interpreter path, namely, the python in conda environment, then this means in your env, there is no sklearn installed. So first, in command line, go into your env by source activate ptenv
, then conda install scikit-learn
. And then everything is fine.
Refer later : ananconda navigator
File "/home/ryang/cuda-workspace/A3D_Classify_v1/train.py", line 49, in train_epoch outputs.cpu().numpy(), targets.cpu().numpy())) RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.
:+1:
tensor.view()
: Returns a new tensor with the same data as the self tensor but of a different shape.
self.expand_as(other)
Expand this tensor to the same size as other. self.expand_as(other) is equivalent to self.expand(other.size()).
expand(*sizes) → Tensor
Returns a new view of the self tensor with singleton dimensions expanded to a larger size. namely: it will copy the current tensor to the desired size.(e.g. from 3x1 -> 3x4)
F-score is ill-defined and being set to 0.0 in labels with no true samples. UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
: -> It is because this : array([[0, 1, 1, 0, 1, 1]])
, where the array is a two dimensional array. But we only need the vector, so you should x[0]
to get the first element in the two-dim array.
In the log files, there is always case that for acc, it records a tensor. It is due to the code acc.data[0]
. We should use acc.item()
instead.
Use
torch.Tensor.item()
to get a Python number from a tensor containing a single value:x = torch.tensor([[1]]) x tensor([[ 1]]) x.item() 1
NOTE: This only applied to 'only one element tensor'. If the tensor contains array, need other codes.
ValueError: Classification metrics can't handle a mix of binary and continuous-multioutput targets
RuntimeWarning: invalid value encountered in float_scalars r = r_num / r_den
It is because reporting pcc in a batch, the ground truth are all 0. And you cannot divide 0, it will return 'nan'. So it's better to report pcc after one epoch.
seff jobnumber
to check the cpu and gpu usage and efficiecny
sjstat
to check the available gpu
srun -N 1 -n 1 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py
FileExistsError: [Errno 17] File exists: In the code, has made the if sentence, but still got this error:
if not os.path.exists(person_test_logger): os.makedirs(person_test_logger)
:+1: ->
As of Python >=3.2, os.makedirs() can take a third optional argument exist_ok:
os.makedirs(mydir, exist_ok=True)
on csc: cuda out of memory
-> change srun -N 1 -n 4 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py
to
srun -N 1 -n 2 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py
or
srun -N 1 -n 1 --mem-per-cpu=36000 -t36:00:00 --gres=gpu:p100:4 -p gpu python main.py
But, it might suggest that better to use n>1 from preventing traffic jams on csc, from the csc official side.
TypeError: tensor is not a torch image
:+1: ->
The ToTensor transform should come before the Normalize transform, since the latter expects a tensor, but the Resize transform returns an image. The ordering: resize-> ToTensor -> Norm
TypeError: batch must contain tensors, numbers, dicts or lists; found <class 'PIL.Image.Image'>
->
The error states that the DataLoader receives a PIL image. This is because there are no transforms made (transform=None) on the image. The getitem method of MyDataset passes an unprocessed PIL image to the DataLoader, whereas it should receive a tensor.
You can add a transform that creates a tensor from the PIL image by adding transform:
When using 'AdaptiveMaxPool2d':
AttributeError: 'tuple' object has no attribute 'squeeze' It is because in train.py,
outputs= model(inputs)
, while in Networklessparam.py,return pred, h1, h2
, so that the outputs will be a tuple that contains the three elements returned.
-> change to : outputs, h1, h2= model(inputs)
works
transforms/transforms.py", line 49, in call img = t(img) TypeError: object() takes no parameters
-> change transforms.ToTensor
to transforms.ToTensor()
t.randomize_parameters() TypeError: randomize_parameters() missing 1 required positional argument: 'self'
-> change RandomHorizontalFlip
to RandomHorizontalFlip()
in cse0003,
` best_matrix[fol, 0] = best_tst_loss
IndexError: index 5 is out of bounds for axis 0 with size 2`
-> It is because we start from person 5 (fol=5), but in best_matrix, we should start row with index 0. So here we need to get the relative value of fol. Use best_matrix[fol-k, 0] = best_tst_loss
instead.
Just keep in mind that creating a torch.FloatTensor out of Numpy’s float64 array will be very slow. It’s better to use torch.from_numpy(arr).float(). The .float() call will be a no-op if the array is already of float32 type
File "/home/ryang/anaconda3/envs/ptenv/lib/python3.6/site-packages/matplotlib/axes/_base.py", line 231, in _xy_from_xy "have shapes {} and {}".format(x.shape, y.shape)) ValueError: x and y must have same first dimension, but have shapes (10010,) and (3003,)
-> change, use inputs.size(0) instead of fixed integer:
print('Plotting PCC CURVES HERE') x = np.linspace(0, (i+1)*inputs.size(0), (i+1)*inputs.size(0))
UnboundLocalError: local variable '' referenced before assignment
-> outside function define psntstgtlabel=np.zeros(0)
, and inside the function need declare global psntstgtlabel
, or it will be deemed as local variable.
-> e.g. label list is a 9 dim vector, but the clip is of size 39120130, so we need to convert label size 9 into 19, by using the following method:
b @ b.view(1,-1).t() #
-1
expands to the number of elements in all existing dimensions (here: [3])
b @ b.expand(1,-1).t() # -1
means not changing size in that dimension (here: stay at 3)
b @ b.unsqueeze(1) # unsqueeze adds num
dimensions after existing ones (here 1 dimension)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp:70
/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion
t >= 0 && t < n_classes
failed. THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/generic/THCTensorCopy.cpp line=70 error=59 : device-side assert triggered