Closed anonymous-st4rec closed 9 months ago
Thank you for your excellent work. I encountered some problems while running the code. Could you help to answer them? Here are the training parameters.
import os root_data_dir = '../../' dataset = 'dataset/HM' behaviors = 'hm_50w_users.tsv' images = 'hm_50w_items.tsv' lmdb_data = 'hm_50w_items.lmdb' logging_num = 2 testing_num = 1 CV_resize = 224 CV_model_load = 'swin_tiny' freeze_paras_before = 0 mode = 'train' item_tower = 'modal' epoch = 150 load_ckpt_name = 'None' l2_weight_list = [0.01] drop_rate_list = [0.1] batch_size_list = [16] lr_list_ct = [(1e-4, 1e-4), (5e-5, 5e-5), (1e-4, 5e-5)] embedding_dim_list = [512] for l2_weight in l2_weight_list: for batch_size in batch_size_list: for drop_rate in drop_rate_list: for embedding_dim in embedding_dim_list: for lr_ct in lr_list_ct: lr = lr_ct[0] fine_tune_lr = lr_ct[1] label_screen = '{}_bs{}_ed{}_lr{}_dp{}_L2{}_Flr{}'.format( item_tower, batch_size, embedding_dim, lr, drop_rate, l2_weight, fine_tune_lr) run_py = "CUDA_VISIBLE_DEVICES='2,3' \ /home/zwy/anaconda3/envs/m/bin/python -m torch.distributed.launch --nproc_per_node 2 --master_port 1289\ run.py --root_data_dir {} --dataset {} --behaviors {} --images {} --lmdb_data {}\ --mode {} --item_tower {} --load_ckpt_name {} --label_screen {} --logging_num {} --testing_num {}\ --l2_weight {} --drop_rate {} --batch_size {} --lr {} --embedding_dim {}\ --CV_resize {} --CV_model_load {} --epoch {} --freeze_paras_before {} --fine_tune_lr {}".format( root_data_dir, dataset, behaviors, images, lmdb_data, mode, item_tower, load_ckpt_name, label_screen, logging_num, testing_num, l2_weight, drop_rate, batch_size, lr, embedding_dim, CV_resize, CV_model_load, epoch, freeze_paras_before, fine_tune_lr) os.system(run_py)
Here is the error that occurred.
/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( [2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] [2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] ***************************************** [2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2023-10-14 21:32:25,576] torch.distributed.run: [WARNING] ***************************************** usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER] [--root_data_dir ROOT_DATA_DIR] [--dataset DATASET] [--behaviors BEHAVIORS] [--images IMAGES] [--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS] [--new_seqs NEW_SEQS] [--new_items NEW_ITEMS] [--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE] [--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR] [--l2_weight L2_WEIGHT] [--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT] [--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD] [--freeze_paras_before FREEZE_PARAS_BEFORE] [--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM] [--num_attention_heads NUM_ATTENTION_HEADS] [--transformer_block TRANSFORMER_BLOCK] [--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN] [--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME] [--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM] [--testing_num TESTING_NUM] [--local_rank LOCAL_RANK] run.py: error: unrecognized arguments: --local-rank=0 usage: run.py [-h] [--mode MODE] [--item_tower ITEM_TOWER] [--root_data_dir ROOT_DATA_DIR] [--dataset DATASET] [--behaviors BEHAVIORS] [--images IMAGES] [--lmdb_data LMDB_DATA] [--cold_seqs COLD_SEQS] [--new_seqs NEW_SEQS] [--new_items NEW_ITEMS] [--new_lmdb_data NEW_LMDB_DATA] [--batch_size BATCH_SIZE] [--epoch EPOCH] [--lr LR] [--fine_tune_lr FINE_TUNE_LR] [--l2_weight L2_WEIGHT] [--fine_tune_l2_weight FINE_TUNE_L2_WEIGHT] [--drop_rate DROP_RATE] [--CV_model_load CV_MODEL_LOAD] [--freeze_paras_before FREEZE_PARAS_BEFORE] [--CV_resize CV_RESIZE] [--embedding_dim EMBEDDING_DIM] [--num_attention_heads NUM_ATTENTION_HEADS] [--transformer_block TRANSFORMER_BLOCK] [--max_seq_len MAX_SEQ_LEN] [--min_seq_len MIN_SEQ_LEN] [--num_workers NUM_WORKERS] [--load_ckpt_name LOAD_CKPT_NAME] [--label_screen LABEL_SCREEN] [--logging_num LOGGING_NUM] [--testing_num TESTING_NUM] [--local_rank LOCAL_RANK] run.py: error: unrecognized arguments: --local-rank=1 [2023-10-14 21:32:30,604] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 3708157) of binary: /home/zwy/anaconda3/envs/m/bin/python Traceback (most recent call last): File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zwy/anaconda3/envs/m/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zwy/anaconda3/envs/m/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ run.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-10-14_21:32:30 host : gpuserver rank : 1 (local_rank: 1) exitcode : 2 (pid: 3708158) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-10-14_21:32:30 host : gpuserver rank : 0 (local_rank: 0) exitcode : 2 (pid: 3708157) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Looking forward to your reply, thank you.
Thank you for your excellent work. I encountered some problems while running the code. Could you help to answer them? Here are the training parameters.
Here is the error that occurred.
Looking forward to your reply, thank you.