RuntimeError: CUDA error: out of memory

Lxhnnn commented 3 years ago

Using device: cuda:1 Training model LSTUR LSTUR( (news_encoder): NewsEncoder( (word_embedding): Embedding(70972, 300, padding_idx=0) (category_embedding): Embedding(275, 10, padding_idx=0) (title_CNN): Conv2d(1, 10, kernel_size=(3, 300), stride=(1, 1), padding=(1, 0)) (title_attention): AdditiveAttention( (linear): Linear(in_features=10, out_features=200, bias=True) ) ) (user_encoder): UserEncoder( (gru): GRU(30, 30) ) (click_predictor): DotProductClickPredictor() (user_embedding): Embedding(50001, 30, padding_idx=0) ) Load training dataset with size 225201. Training: 0%| | 0/28150 [00:00<?, ?it/s] THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory Traceback (most recent call last): File "./src/train.py", line 297, in train() File "./src/train.py", line 188, in train minibatch["clickednews"]) File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in call_impl result = self.forward(*input, kwargs) File "/home/ant/researchInstitute/luoxianhao/Rec/NewsRecommendation-master/src/model/LSTUR/init.py", li ne 70, in forward [self.news_encoder(x) for x in candidate_news], dim=1) File "/home/ant/researchInstitute/luoxianhao/Rec/NewsRecommendation-master/src/model/LSTUR/init.py", li ne 70, in [self.news_encoder(x) for x in candidatenews], dim=1) File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in call_impl result = self.forward(*input, *kwargs) File "/home/ant/researchInstitute/luoxianhao/Rec/NewsRecommendation-master/src/model/LSTUR/news_encoder.py" , line 47, in forward category_vector = self.category_embedding(news['category'].to(device)) RuntimeError: CUDA error: out of memory Exception in thread Thread-2: Traceback (most recent call last): File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/threading.py", line 864, in run self._target(self._args, self._kwargs) File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", lin e 25, in _pin_memory_loop r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle return recvfds(s, 1)[0] File "/home/ant/anaconda3/envs/lxh1/lib/python3.6/multiprocessing/reduction.py", line 153, in recvfds msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size)) ConnectionResetError: [Errno 104] Connection reset by peer

yusanshi commented 3 years ago

Which GPU are you using and how much memory does it have? You may try to reduce the batch size.

Lxhnnn commented 3 years ago

I change the batch size to 8, and this error still occurs

yusanshi commented 3 years ago

Which GPU are you using and how much memory does it have?

Lxhnnn commented 3 years ago

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3130 C python 10735MiB | | 1 3130 C python 301MiB | | 1 42565 C python 2137MiB | +-----------------------------------------------------------------------------

use GUP1

Lxhnnn commented 3 years ago

It has 11019MiB and used 2449MiB

yusanshi commented 3 years ago

It's weird. Have you tried other models?

Lxhnnn commented 3 years ago

所有的模型都out of memory，我也不知道是什么问题，是不是跟pytorch版本有关

yusanshi commented 3 years ago

Sorry but I don't know. When I was working on the project I used PyTorch 1.5 and 1.6. I think you can:

Try with batch size = 1.
Restart the machine if possible, or kill all jobs running on all GPUs, then retry.

Lxhnnn commented 3 years ago

I changed the server and solved the problem, thank you

yusanshi / news-recommendation

RuntimeError: CUDA error: out of memory #6