Closed SUFEHeisenberg closed 3 years ago
may be the version sucha as cuda and pytorch are not compatible
may be the version sucha as cuda and pytorch are not compatible
It shouldn't be that reason, I run the code print(torch.cuda.is_available())
before the experiment. It goes True
whether the CUDA version is 11.0
or 10.1
.
I decrease the --train_batch_size
to 8 and increase the --accum_steps
to 8 at agnews.sh
and imdb.sh
, still comes out the same result.
Hi @SUFEHeisenberg,
Thanks for letting me know the issue and I'm sorry for the delay in my response--I've been very busy with several recent deadlines.
Firstly, for the amazon
and dbpedia
errors, probably it's because these two datasets are too large and could cause problems for the joblib
package on some machines. You could try to remove the joblib
part as mentioned here; this will remove the parallelization for the encoding process, but since it will only be conducted once (as long as your corpus doesn't change), this could still be acceptable.
For the agnews
and imdb
errors, I believe it's somewhat related to your GPU training setup. It looks weird to me because you seemed to be able to complete the MCP training step but got stuck at the self-training step (I noticed in your agnews
output that it shows Loading model trained via masked category prediction from datasets/agnews/agnews_data/mcp_model.pt
which means you already completed the MCP training?). Since the MCP training is also GPU-parallelized, it is not fundamentally differenent from the self-training step. So I was wondering if you could re-check whether or not you could still do the MCP training without issues (by removing the mcp_model.pt
cached model and re-running the code). Perhaps you could also check whether or not this issue only occurs with LOTClass training (can you run other PyTorch GPU training without issues?)
Let me know if the problem still persists.
Best, Yu
Hi @yumeng5 , It is really appreciate for me to receive your detailed and comprehensive reply! Apologise for my late response due to some other stuff over these period! I can resuse the GPU nowadays, after that I try your solutions which helps me a lot, but still confront some confusing problems as follows:
agnews
and imdb
errors still print the same errors even I delete all the xxx.pt
models under the dataset dir. There is no error indeedly in MCP process, but trapped in the self-trainning. I update my error and ouput docs into my github. And I can run other former Pytorch codes smoothly without issues(NOT parallel, cause the agnews
and imdb
I use the same trainer.py
code where remove joblib part just as #8 ).
amazon
and dbpedia
errors have been solevd, but change into anather error just the same with agnews
.
It worth mentioning that I run the code well in own One GPU machine, however, it have already taken soooo long time to wait for the final results.
I will be sooo graceful if you can check and reply this issues!
Last but not least, Happy Chinese New Year!
Sincerely, Heisenberg.
Hi @SUFEHeisenberg,
Thanks for providing the details. I have looked at the error logs but I don't have specific ideas what might cause the errors. I did some search, and it seems that the one described here is most similar to your case. It might be related to your environment/system settings which I, unfortunately, cannot provide further help because I cannot reproduce the error on my machines.
Since you mentioned that you could run non-parallel training code fine, I was wondering if you could remove all the PyTorch distributed parallel training part (not only the joblib part) in LOTClass as described in this issue? If you are running on a V100 GPU, you probably won't need to do parallel training as you can have a large enough batch size (i.e., increase train_batch_size
and reduce accum_steps
which will allow you to train faster) with a 32GB GPU mem.
Happy New Year!
Best, Yu
Hi @yumeng5 , Thanks A Lot for your careful reply!
I followed your instructions and made changes in all the appropriate places in #2 , afterwards I found that there is nothing relevant with the parrllel part. I used one GPU in the interactive GPU queue (which means there were no other user could take up the gpu I was using) after reading the url you show me, but still encounter the same bug before without any trace of differences. And MCP can be done while self-training cannot. BUT I can run in my own PC, where both have same environment and same code.
So I suspect this must have something to do with the GPU cluster server which I'm still not clear out where the bug is, maybe some mode that I didn't have the access to activate? I dunno.
Nevertheless I will go on finding this error, if I figure out the way to solve it, I'll close the issue in the future. Thanks a lot for your consistent and detailed directions and marvellous project!
Sincerely, Heisenberg
Hi @yumeng5 ! A complete and intuitive work!
I followed your work those couple days. My own Windows10 PC sports only one Quadro P2000 GPU, and the problems are solved from the issues #2 .Thans a lot for your guidements!
But the GPU is too small to keep a high speed and enough batch. So I use my GPU cluster server at my school which uses the IBM's LSF system.
Environment Info
And I encountered some bugs when I run the different shell scripts.
agnews.sh
scirpts
!/bin/sh
BSUB –gpu "num=2:mode=exclusive_process"
BSUB -n 2
BSUB -q gpu
BSUB -o LOTClass.out
BSUB -e LOTClass.err
BSUB -J LOTClass
BSUB -R "rusage[ngpus_physical=2]"
export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_VISIBLE_DEVICES=0,1
python src/train.py --dataset_dir datasets/agnews/agnews_data/ \ --label_names_file label_names.txt \ --train_file train.txt \ --test_file test.txt \ --test_label_file test_labels.txt \ --max_len 200 \ --train_batch_size 32 \ --accum_steps 2 \ --eval_batch_size 64 \ --gpus 2 \ --mcp_epochs 3 \ --self_train_epochs 1 \
OUTPUT
Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/agnews_data/', dist_port=12345, early_stop=False, eval_batch_size=64, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']} Loading encoded texts from datasets/agnews/agnews_data/train.pt Loading texts with label names from datasets/agnews/agnews_data/label_name_data.pt Loading encoded texts from datasets/agnews/agnews_data/test.pt Loading category vocabulary from datasets/agnews/agnews_data/category_vocab.pt Class 0 category vocabulary: ['politics', 'political', 'politicians', 'Politics', 'government', 'elections', 'issues', 'history', 'democracy', 'affairs', 'policy', 'politically', 'politician', 'society', 'policies', 'voters', 'people', 'debate', 'election', 'culture', 'economics', 'forces', 'relations', 'governance', 'parliament', 'leadership', 'campaign', 'problems', 'opposition', 'military', 'movements', 'diplomacy', 'war', 'polls', 'congress', 'campaigning', 'nature', 'dynamics', 'debates', 'taxes', 'struggles', 'control', 'campaigns', 'economy', 'officials', 'ideology', 'leaders', 'religion', 'geography', 'state', 'Congress', 'wars', 'corruption', 'roads', 'territory', 'voting', 'climate', 'agriculture', 'balance']
Class 1 category vocabulary: ['sports', 'sport', 'Sports', 'sporting', 'soccer', 'athletics', 'athletic', 'baseball', 'hockey', 'basketball', 'regional', 'travel', 'matches', 'coaches', 'youth', 'Sport', 'health', 'teams', 'recreational', 'team', 'medical', 'match', 'cultural', 'gaming', 'play', 'golf', 'local', 'outdoor', 'tennis', 'schools', 'league', 'radio', 'stadium', 'recreation', 'activities', 'transportation', 'club', 'wrestling', 'rugby', 'everything', 'training', 'fields', 'city', 'fans', 'leagues', 'school', 'safety', 'national', 'aquatic', 'summer', 'track', 'air', 'letters', 'rules', 'championship', 'racing', 'grounds', 'pro', 'arts', 'leisure', 'great', 'clubs', 'broadcast']
Class 2 category vocabulary: ['business', 'trade', 'Business', 'businesses', 'trading', 'commercial', 'market', 'enterprise', 'corporate', 'financial', 'sales', 'commerce', 'job', 'shop', 'economic', 'professional', 'world', 'operation', 'family', 'name', 'line', 'career', 'retail', 'firm', 'operations', 'marketing', 'good', 'work', 'private', 'personal', 'chain', 'time', 'group', 'division', 'investment', 'industrial', 'house', 'side', 'companies', 'store', 'global', 'task', 'consumer', 'shopping', 'street', 'property', 'special', 'merchant', 'part', 'department', 'town', 'real', 'traffic', 'space', 'concern', 'selling']
Class 3 category vocabulary: ['technology', 'technologies', 'Technology', 'tech', 'technological', 'equipment', 'device', 'innovation', 'system', 'information', 'generation', 'infrastructure', 'phone', 'devices', 'energy', 'capability', 'concept', 'systems', 'computer', 'hardware', 'technique', 'Internet', 'design', 'program', 'protocol', 'ability', 'technical', 'platform', 'digital', 'knowledge', 'content', 'method', 'techniques', 'strategy', 'material', 'internet', 'Tech', 'web', 'development', 'invention', 'feature', 'IT', 'project', 'facility', 'intelligence', 'process', 'card', 'wireless', 'car', 'format', 'concepts', 'gene', 'model', 'features', 'smart', 'app', 'computers', 'machine', 'also', 'talent', 'solution', 'idea', 'speed', 'algorithm', 'style']
Loading model trained via masked category prediction from datasets/agnews/agnews_data/mcp_model.pt
Start self-training. PS: Read file for stderr output of this job.
ERRORS
2021-01-29 11:43:51.382337: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 Some weights of the model checkpoint at bert-base-cased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 531, in self_train_dist model = self.set_up_dist(rank) File "/nfsshare/home/usr/NLP/LOTClass/src/trainer.py", line 69, in set_up_dist model = self.model.to(rank) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 607, in to return self._apply(convert) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 354, in _apply module._apply(fn) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 376, in _apply param_applied = fn(param) File "/nfsshare/apps/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 605, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
and I print the “nvidia-smi” of two GPUs, and no other programms take up GPU resources, which confuses me all day along. I use different methods BUT still get the same problem.
amazon.sh
scripts
parameters same as yours, and look likes agnews.sh
OUTPUT
Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/amazon/amazon_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['bad'], 1: ['good']} Loading encoded texts from datasets/amazon/amazon_data/train.pt Loading texts with label names from datasets/amazon/amazon_data/label_name_data.pt Reading texts from datasets/amazon/amazon_data/test.txt Converting texts into tensors. PS: Read file for stderr output of this job.
ERRORS
The same as #8 .
RuntimeError: The task could not be sent to the workers as it is too large for
send_bytes.
imdb.sh
scripts
parameters same as yours, and look likes agnews.sh
OUTPUT
Namespace(accum_steps=8, category_vocab_size=100, dataset_dir='datasets/imdb/imdb_data/', dist_port=12345, early_stop=False, eval_batch_size=32, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=512, mcp_epochs=4, out_file='out.txt', self_train_epochs=4.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=8, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['bad'], 1: ['good']} Loading encoded texts from datasets/imdb/imdb_data/train.pt Loading texts with label names from datasets/imdb/imdb_data/label_name_data.pt Loading encoded texts from datasets/imdb/imdb_data/test.pt Contructing category vocabulary. Class 0 category vocabulary: ['bad', 'Bad', 'wrong', 'nasty', 'worst', 'badly', 'negative', 'sad', 'sorry', 'rotten', 'low', 'violent', 'weird', 'dark', 'shit', 'crazy', 'dirty', 'serious', 'sick', 'small', 'stupid', 'scary', 'dumb', 'much', 'gross', 'foul', 'dangerous', 'crap', 'mixed', 'fast', 'sour', 'miserable', 'severe', 'lost', 'hit', 'dreadful', 'trouble', 'gone']
Class 1 category vocabulary: ['good', 'excellent', 'high', 'Good', 'wonderful', 'amazing', 'fantastic', 'fair', 'positive', 'sure', 'sound', 'quality', 'light', 'solid', 'brilliant', 'awesome', 'smart', 'happy', 'bright', 'safe', 'true', 'clean', 'rich', 'successful', 'full', 'special', 'fun', 'popular', 'sweet', 'superior', 'simple', 'average', 'superb', 'normal', 'important', 'love', 'cool', 'quick', 'easy', 'whole', 'hot', 'interesting', 'damn']
Preparing self supervision for masked category prediction. Number of documents with category indicative terms found for each category is: {0: 873, 1: 828} There are totally 1701 documents with category indicative terms.
Training model via masked category prediction. Epoch 1: Average training loss: 0.5981351137161255 Epoch 2: Average training loss: 0.23333437740802765 Epoch 3: Average training loss: 0.09165686368942261 Epoch 4: Average training loss: 0.056073933839797974
Start self-training.
PS:
Read file for stderr output of this job.
ERROR
The same as agnews.sh:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
dbpedia.sh
scripts
parameters same as yours, and look likes agnews.sh
OUTPUT
Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/dbpedia/dbpedia_data/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50) Effective training batch size: 128 Label names used for each class are: {0: ['company'], 1: ['school', 'university'], 2: ['artist'], 3: ['athlete'], 4: ['politics'], 5: ['transportation'], 6: ['building'], 7: ['river', 'mountain', 'lake'], 8: ['village'], 9: ['animal'], 10: ['plant', 'tree'], 11: ['album'], 12: ['film'], 13: ['novel', 'publication', 'book']} Loading encoded texts from datasets/dbpedia/dbpedia_data/train.pt Loading texts with label names from datasets/dbpedia/dbpedia_data/label_name_data.pt Reading texts from datasets/dbpedia/dbpedia_data/test.txt Converting texts into tensors.
ERROR
The same error as #8 .
RuntimeError: The task could not be sent to the workers as it is too large for
send_bytes.
Can you help me deal with these errors?
Sincerely, Heisenberg