yumeng5 / LOTClass

[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach
Apache License 2.0
296 stars 62 forks source link

RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable #9

Closed SUFEHeisenberg closed 3 years ago

SUFEHeisenberg commented 3 years ago

Hi @yumeng5 ! A complete and intuitive work!

I followed your work those couple days. My own Windows10 PC sports only one Quadro P2000 GPU, and the problems are solved from the issues #2 .Thans a lot for your guidements!

But the GPU is too small to keep a high speed and enough batch. So I use my GPU cluster server at my school which uses the IBM's LSF system.

Environment Info

And I encountered some bugs when I run the different shell scripts.

agnews.sh

scirpts

!/bin/sh

BSUB –gpu "num=2:mode=exclusive_process"

BSUB -n 2

BSUB -q gpu

BSUB -o LOTClass.out

BSUB -e LOTClass.err

BSUB -J LOTClass

BSUB -R "rusage[ngpus_physical=2]"

export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_VISIBLE_DEVICES=0,1

python src/train.py --dataset_dir datasets/agnews/agnews_data/ \ --label_names_file label_names.txt \ --train_file train.txt \ --test_file test.txt \ --test_label_file test_labels.txt \ --max_len 200 \ --train_batch_size 32 \ --accum_steps 2 \ --eval_batch_size 64 \ --gpus 2 \ --mcp_epochs 3 \ --self_train_epochs 1 \

OUTPUT

Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/agnews_data/', dist_port=12345, early_stop=False, eval_batch_size=64, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']} Loading encoded texts from datasets/agnews/agnews_data/train.pt Loading texts with label names from datasets/agnews/agnews_data/label_name_data.pt Loading encoded texts from datasets/agnews/agnews_data/test.pt Loading category vocabulary from datasets/agnews/agnews_data/category_vocab.pt Class 0 category vocabulary: ['politics', 'political', 'politicians', 'Politics', 'government', 'elections', 'issues', 'history', 'democracy', 'affairs', 'policy', 'politically', 'politician', 'society', 'policies', 'voters', 'people', 'debate', 'election', 'culture', 'economics', 'forces', 'relations', 'governance', 'parliament', 'leadership', 'campaign', 'problems', 'opposition', 'military', 'movements', 'diplomacy', 'war', 'polls', 'congress', 'campaigning', 'nature', 'dynamics', 'debates', 'taxes', 'struggles', 'control', 'campaigns', 'economy', 'officials', 'ideology', 'leaders', 'religion', 'geography', 'state', 'Congress', 'wars', 'corruption', 'roads', 'territory', 'voting', 'climate', 'agriculture', 'balance']

Class 1 category vocabulary: ['sports', 'sport', 'Sports', 'sporting', 'soccer', 'athletics', 'athletic', 'baseball', 'hockey', 'basketball', 'regional', 'travel', 'matches', 'coaches', 'youth', 'Sport', 'health', 'teams', 'recreational', 'team', 'medical', 'match', 'cultural', 'gaming', 'play', 'golf', 'local', 'outdoor', 'tennis', 'schools', 'league', 'radio', 'stadium', 'recreation', 'activities', 'transportation', 'club', 'wrestling', 'rugby', 'everything', 'training', 'fields', 'city', 'fans', 'leagues', 'school', 'safety', 'national', 'aquatic', 'summer', 'track', 'air', 'letters', 'rules', 'championship', 'racing', 'grounds', 'pro', 'arts', 'leisure', 'great', 'clubs', 'broadcast']

Class 2 category vocabulary: ['business', 'trade', 'Business', 'businesses', 'trading', 'commercial', 'market', 'enterprise', 'corporate', 'financial', 'sales', 'commerce', 'job', 'shop', 'economic', 'professional', 'world', 'operation', 'family', 'name', 'line', 'career', 'retail', 'firm', 'operations', 'marketing', 'good', 'work', 'private', 'personal', 'chain', 'time', 'group', 'division', 'investment', 'industrial', 'house', 'side', 'companies', 'store', 'global', 'task', 'consumer', 'shopping', 'street', 'property', 'special', 'merchant', 'part', 'department', 'town', 'real', 'traffic', 'space', 'concern', 'selling']

Class 3 category vocabulary: ['technology', 'technologies', 'Technology', 'tech', 'technological', 'equipment', 'device', 'innovation', 'system', 'information', 'generation', 'infrastructure', 'phone', 'devices', 'energy', 'capability', 'concept', 'systems', 'computer', 'hardware', 'technique', 'Internet', 'design', 'program', 'protocol', 'ability', 'technical', 'platform', 'digital', 'knowledge', 'content', 'method', 'techniques', 'strategy', 'material', 'internet', 'Tech', 'web', 'development', 'invention', 'feature', 'IT', 'project', 'facility', 'intelligence', 'process', 'card', 'wireless', 'car', 'format', 'concepts', 'gene', 'model', 'features', 'smart', 'app', 'computers', 'machine', 'also', 'talent', 'solution', 'idea', 'speed', 'algorithm', 'style']

Loading model trained via masked category prediction from datasets/agnews/agnews_data/mcp_model.pt

Start self-training. PS: Read file for stderr output of this job.

ERRORS

2021-01-29 11:43:51.382337: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 Some weights of the model checkpoint at bert-base-cased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 531, in self_train_dist model = self.set_up_dist(rank) File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 69, in set_up_dist model = self.model.to(rank) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 607, in to return self._apply(convert) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 376, in _apply param_applied = fn(param) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 605, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

and I print the “nvidia-smi” of two GPUs, and no other programms take up GPU resources, which confuses me all day along. I use different methods BUT still get the same problem.

amazon.sh

scripts

parameters same as yours, and look likes agnews.sh

OUTPUT

Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/amazon/amazon_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['bad'], 1: ['good']} Loading encoded texts from datasets/amazon/amazon_data/train.pt Loading texts with label names from datasets/amazon/amazon_data/label_name_data.pt Reading texts from datasets/amazon/amazon_data/test.txt Converting texts into tensors. PS: Read file for stderr output of this job.

ERRORS

The same as #8 . RuntimeError: The task could not be sent to the workers as it is too large forsend_bytes.

imdb.sh

scripts

parameters same as yours, and look likes agnews.sh

OUTPUT

Namespace(accum_steps=8, category_vocab_size=100, dataset_dir='datasets/imdb/imdb_data/', dist_port=12345, early_stop=False, eval_batch_size=32, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=512, mcp_epochs=4, out_file='out.txt', self_train_epochs=4.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=8, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['bad'], 1: ['good']} Loading encoded texts from datasets/imdb/imdb_data/train.pt Loading texts with label names from datasets/imdb/imdb_data/label_name_data.pt Loading encoded texts from datasets/imdb/imdb_data/test.pt Contructing category vocabulary. Class 0 category vocabulary: ['bad', 'Bad', 'wrong', 'nasty', 'worst', 'badly', 'negative', 'sad', 'sorry', 'rotten', 'low', 'violent', 'weird', 'dark', 'shit', 'crazy', 'dirty', 'serious', 'sick', 'small', 'stupid', 'scary', 'dumb', 'much', 'gross', 'foul', 'dangerous', 'crap', 'mixed', 'fast', 'sour', 'miserable', 'severe', 'lost', 'hit', 'dreadful', 'trouble', 'gone']

Class 1 category vocabulary: ['good', 'excellent', 'high', 'Good', 'wonderful', 'amazing', 'fantastic', 'fair', 'positive', 'sure', 'sound', 'quality', 'light', 'solid', 'brilliant', 'awesome', 'smart', 'happy', 'bright', 'safe', 'true', 'clean', 'rich', 'successful', 'full', 'special', 'fun', 'popular', 'sweet', 'superior', 'simple', 'average', 'superb', 'normal', 'important', 'love', 'cool', 'quick', 'easy', 'whole', 'hot', 'interesting', 'damn']

Preparing self supervision for masked category prediction. Number of documents with category indicative terms found for each category is: {0: 873, 1: 828} There are totally 1701 documents with category indicative terms.

Training model via masked category prediction. Epoch 1: Average training loss: 0.5981351137161255 Epoch 2: Average training loss: 0.23333437740802765 Epoch 3: Average training loss: 0.09165686368942261 Epoch 4: Average training loss: 0.056073933839797974

Start self-training.

PS:

Read file for stderr output of this job.

ERROR

The same as agnews.sh: RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

dbpedia.sh

scripts

parameters same as yours, and look likes agnews.sh

OUTPUT

Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/dbpedia/dbpedia_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['company'], 1: ['school', 'university'], 2: ['artist'], 3: ['athlete'], 4: ['politics'], 5: ['transportation'], 6: ['building'], 7: ['river', 'mountain', 'lake'], 8: ['village'], 9: ['animal'], 10: ['plant', 'tree'], 11: ['album'], 12: ['film'], 13: ['novel', 'publication', 'book']} Loading encoded texts from datasets/dbpedia/dbpedia_data/train.pt Loading texts with label names from datasets/dbpedia/dbpedia_data/label_name_data.pt Reading texts from datasets/dbpedia/dbpedia_data/test.txt Converting texts into tensors.

ERROR

The same error as #8 . RuntimeError: The task could not be sent to the workers as it is too large forsend_bytes.

Can you help me deal with these errors?

Sincerely, Heisenberg

AlexYoung757 commented 3 years ago

may be the version sucha as cuda and pytorch are not compatible

SUFEHeisenberg commented 3 years ago

may be the version sucha as cuda and pytorch are not compatible

It shouldn't be that reason, I run the code print(torch.cuda.is_available())before the experiment. It goes True whether the CUDA version is 11.0 or 10.1.

I decrease the --train_batch_size to 8 and increase the --accum_steps to 8 at agnews.sh and imdb.sh, still comes out the same result.

yumeng5 commented 3 years ago

Hi @SUFEHeisenberg,

Thanks for letting me know the issue and I'm sorry for the delay in my response--I've been very busy with several recent deadlines.

Firstly, for the amazon and dbpedia errors, probably it's because these two datasets are too large and could cause problems for the joblib package on some machines. You could try to remove the joblib part as mentioned here; this will remove the parallelization for the encoding process, but since it will only be conducted once (as long as your corpus doesn't change), this could still be acceptable.

For the agnews and imdb errors, I believe it's somewhat related to your GPU training setup. It looks weird to me because you seemed to be able to complete the MCP training step but got stuck at the self-training step (I noticed in your agnews output that it shows Loading model trained via masked category prediction from datasets/agnews/agnews_data/mcp_model.pt which means you already completed the MCP training?). Since the MCP training is also GPU-parallelized, it is not fundamentally differenent from the self-training step. So I was wondering if you could re-check whether or not you could still do the MCP training without issues (by removing the mcp_model.pt cached model and re-running the code). Perhaps you could also check whether or not this issue only occurs with LOTClass training (can you run other PyTorch GPU training without issues?)

Let me know if the problem still persists.

Best, Yu

SUFEHeisenberg commented 3 years ago

Hi @yumeng5 , It is really appreciate for me to receive your detailed and comprehensive reply! Apologise for my late response due to some other stuff over these period! I can resuse the GPU nowadays, after that I try your solutions which helps me a lot, but still confront some confusing problems as follows:

  1. agnews and imdb errors still print the same errors even I delete all the xxx.pt models under the dataset dir. There is no error indeedly in MCP process, but trapped in the self-trainning. I update my error and ouput docs into my github. And I can run other former Pytorch codes smoothly without issues(NOT parallel, cause the agnews and imdb I use the same trainer.py code where remove joblib part just as #8 ).

  2. amazon and dbpedia errors have been solevd, but change into anather error just the same with agnews.

It worth mentioning that I run the code well in own One GPU machine, however, it have already taken soooo long time to wait for the final results.

I will be sooo graceful if you can check and reply this issues!

Last but not least, Happy Chinese New Year!

Sincerely, Heisenberg.

yumeng5 commented 3 years ago

Hi @SUFEHeisenberg,

Thanks for providing the details. I have looked at the error logs but I don't have specific ideas what might cause the errors. I did some search, and it seems that the one described here is most similar to your case. It might be related to your environment/system settings which I, unfortunately, cannot provide further help because I cannot reproduce the error on my machines.

Since you mentioned that you could run non-parallel training code fine, I was wondering if you could remove all the PyTorch distributed parallel training part (not only the joblib part) in LOTClass as described in this issue? If you are running on a V100 GPU, you probably won't need to do parallel training as you can have a large enough batch size (i.e., increase train_batch_size and reduce accum_steps which will allow you to train faster) with a 32GB GPU mem.

Happy New Year!

Best, Yu

SUFEHeisenberg commented 3 years ago

Hi @yumeng5 , Thanks A Lot for your careful reply!

I followed your instructions and made changes in all the appropriate places in #2 , afterwards I found that there is nothing relevant with the parrllel part. I used one GPU in the interactive GPU queue (which means there were no other user could take up the gpu I was using) after reading the url you show me, but still encounter the same bug before without any trace of differences. And MCP can be done while self-training cannot. BUT I can run in my own PC, where both have same environment and same code.

So I suspect this must have something to do with the GPU cluster server which I'm still not clear out where the bug is, maybe some mode that I didn't have the access to activate? I dunno.

Nevertheless I will go on finding this error, if I figure out the way to solve it, I'll close the issue in the future. Thanks a lot for your consistent and detailed directions and marvellous project!

Sincerely, Heisenberg