Closed Larry-Liu02 closed 1 month ago
Hi. For your reference, most experiments can be run on a GPU with 32G of memory. If you encounter insufficient memory issues, you could consider adjusting the training parameters in yaml
file. For example, in ViT.yaml:
train_batch_size: 16 # You can change it to 8 for lesser memory usage
fine_tune_arg: {
tune_scale: 165, # You are tuning the last two layers of the ViT model when setting the tune_scale to 165. For lesser memory usage, you can choose to tune only the last layer by changing tune_scale to 181.
pre_trained: True,
activation: 'relu',
dnn_layers: [],
method: 'mean'
}
Many thanks for your instruction! When I run the code, I find the problem as below. I guess this may be caused by my torch version. I am not sure whether you encounter same issues previously. Just Pixelnet has this problem, IDnet and Vinet can run successfully.
$python main.py --device 0 --config_file PixelNet/sasrec.yaml overall/ViT.yaml
15 May 16:12 INFO [Training]: train_batch_size = [4]
15 May 16:12 INFO [Evaluation]: eval_batch_size = [1024]
15 May 16:12 INFO
World_Size = 1
15 May 16:12 INFO
General Hyper Parameters:
model = MOSASRec
seed = 2020
state = INFO
use_modality = True
reproducibility = True
checkpoint_dir = saved
show_progress = False
log_wandb = False
data_path = ../dataset/
Training Hyper Parameters: epochs = 200 train_batch_size = 4 optim_args = {'modal_lr': 0.0001, 'rec_lr': 0.0001, 'modal_decay': 0, 'rec_decay': 0.1} eval_step = 1 stopping_step = 30
Evaluation Hyper Parameters: eval_batch_size = 1024 topk = [5, 10] metrics = ['Recall', 'NDCG'] valid_metric = NDCG@10 metric_decimal_place = 7 eval_type = EvaluatorType.RANKING valid_metric_bigger = True
Dataset Hyper Parameters: MAX_ITEM_LIST_LENGTH = 10
Other Hyper Parameters: n_layers = 2 n_heads = 4 embedding_size = 512 inner_size = 2 hidden_dropout_prob = 0.1 attn_dropout_prob = 0.1 hidden_act = gelu layer_norm_eps = 1e-12 initializer_range = 0.02 wandb_project = REC image_path = ../dataset/image.lmdb encoder_name = clip-vit-base-patch32 encoder_source = transformers fine_tune_arg = {'tune_scale': 181, 'pre_trained': True, 'activation': 'relu', 'dnn_layers': [], 'method': 'mean'} MODEL_INPUT_TYPE = InputType.SEQ device = cuda:0
15 May 16:12 INFO Pixel200K The number of users: 200001 Average actions of users: 19.82828 The number of items: 96283 Average actions of items: 41.187927130720176 The number of inters: 3965656 The sparsity of the dataset: 99.9794063532928% 15 May 16:12 INFO MOSASRec( (visual_encoder): MeanItemEncoder( (item_encoder): CLIPVisionModel( (vision_model): CLIPVisionTransformer( (embeddings): CLIPVisionEmbeddings( (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False) (position_embedding): Embedding(50, 768) ) (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (encoder): CLIPEncoder( (layers): ModuleList( (0-11): 12 x CLIPEncoderLayer( (self_attn): CLIPAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): CLIPMLP( (activation_fn): QuickGELUActivation() (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) ) (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) ) (post_layernorm): Identity() ) ) (rec_fc): Sequential( (0): Linear(in_features=768, out_features=512, bias=True) (1): ReLU() ) ) (position_embedding): Embedding(10, 512) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (trm_encoder): TransformerEncoder( (layer): ModuleList( (0-1): 2 x TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=512, out_features=512, bias=True) (key): Linear(in_features=512, out_features=512, bias=True) (value): Linear(in_features=512, out_features=512, bias=True) (softmax): Softmax(dim=-1) (attn_dropout): Dropout(p=0.1, inplace=False) (dense): Linear(in_features=512, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.1, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=512, out_features=1024, bias=True) (dense_2): Linear(in_features=1024, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) Trainable parameters: 11693312 15 May 16:18 INFO recsys_lr_params_len: 35 modal_lr_params_len: 18 E0515 16:18:17.162791 140468101243264 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 107760) raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run.py FAILED
Failures:
Many thanks for your instruction! When I run the code, I find the problem as below. I guess this may be caused by my torch version. I am not sure whether you encounter same issues previously. Just Pixelnet has this problem, IDnet and Vinet can run successfully.
$python main.py --device 0 --config_file PixelNet/sasrec.yaml overall/ViT.yaml
15 May 16:12 INFO [Training]: train_batch_size = [4] 15 May 16:12 INFO [Evaluation]: eval_batch_size = [1024] 15 May 16:12 INFO World_Size = 1
15 May 16:12 INFO General Hyper Parameters: model = MOSASRec seed = 2020 state = INFO use_modality = True reproducibility = True checkpoint_dir = saved show_progress = False log_wandb = False data_path = ../dataset/
Training Hyper Parameters: epochs = 200 train_batch_size = 4 optim_args = {'modal_lr': 0.0001, 'rec_lr': 0.0001, 'modal_decay': 0, 'rec_decay': 0.1} eval_step = 1 stopping_step = 30
Evaluation Hyper Parameters: eval_batch_size = 1024 topk = [5, 10] metrics = ['Recall', 'NDCG'] valid_metric = NDCG@10 metric_decimal_place = 7 eval_type = EvaluatorType.RANKING valid_metric_bigger = True
Dataset Hyper Parameters: MAX_ITEM_LIST_LENGTH = 10
Other Hyper Parameters: n_layers = 2 n_heads = 4 embedding_size = 512 inner_size = 2 hidden_dropout_prob = 0.1 attn_dropout_prob = 0.1 hidden_act = gelu layer_norm_eps = 1e-12 initializer_range = 0.02 wandb_project = REC image_path = ../dataset/image.lmdb encoder_name = clip-vit-base-patch32 encoder_source = transformers fine_tune_arg = {'tune_scale': 181, 'pre_trained': True, 'activation': 'relu', 'dnn_layers': [], 'method': 'mean'} MODEL_INPUT_TYPE = InputType.SEQ device = cuda:0
15 May 16:12 INFO Pixel200K The number of users: 200001 Average actions of users: 19.82828 The number of items: 96283 Average actions of items: 41.187927130720176 The number of inters: 3965656 The sparsity of the dataset: 99.9794063532928% 15 May 16:12 INFO MOSASRec( (visual_encoder): MeanItemEncoder( (item_encoder): CLIPVisionModel( (vision_model): CLIPVisionTransformer( (embeddings): CLIPVisionEmbeddings( (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False) (position_embedding): Embedding(50, 768) ) (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (encoder): CLIPEncoder( (layers): ModuleList( (0-11): 12 x CLIPEncoderLayer( (self_attn): CLIPAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): CLIPMLP( (activation_fn): QuickGELUActivation() (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) ) (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) ) (post_layernorm): Identity() ) ) (rec_fc): Sequential( (0): Linear(in_features=768, out_features=512, bias=True) (1): ReLU() ) ) (position_embedding): Embedding(10, 512) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (trm_encoder): TransformerEncoder( (layer): ModuleList( (0-1): 2 x TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=512, out_features=512, bias=True) (key): Linear(in_features=512, out_features=512, bias=True) (value): Linear(in_features=512, out_features=512, bias=True) (softmax): Softmax(dim=-1) (attn_dropout): Dropout(p=0.1, inplace=False) (dense): Linear(in_features=512, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.1, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=512, out_features=1024, bias=True) (dense_2): Linear(in_features=1024, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) Trainable parameters: 11693312 15 May 16:18 INFO recsys_lr_params_len: 35 modal_lr_params_len: 18 E0515 16:18:17.162791 140468101243264 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 107760) raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run.py FAILED
Failures:
Root Cause (first observed failure): [0]: time : 2024-05-15_16:18:17 host : kemove-Z790-D-DDR4 rank : 0 (local_rank: 0) exitcode : -11 (pid: 107760) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 107760
Your requirement list can not be compatible with my cuda, so I choose those versions: transformers 4.39.0 torch 2.4.0.dev20240505+cu118 torch_geometric 2.5.3 torchaudio 2.2.0.dev20240505+cu118 torchmetrics 0.2.0 torchvision 0.19.0.dev20240505+cu118
Hi, the error message you've provided is a bit brief, and it's difficult for me to accurately determine the cause of the error. The program may be crashing due to insufficient system resources. Try to decrease the num_works to 2 in REC/data/utils.py :
num_workers = 10 # set it to a lower number like 1, 2, 3 ....
rank = torch.distributed.get_rank()
seed = torch.initial_seed()
If that doesn't solve the problem, it would be best if you could look up the approximate code area where the error occurred. Please feel free to reach out if you need further help.
num_workers = 10 # set it to a lower number like 1, 2, 3 ....
rank = torch.distributed.get_rank()
seed = torch.initial_seed()
I tried this adjustment, but it doesn't work, this error doesn't show the actual line of code area. I still think this was caused by the torch version that was incompatible with it. Can you inform me of your Cuda version based on your requirement.txt and GPU type? I try to search for the same equipment to run the code. Many thanks!
num_workers = 10 # set it to a lower number like 1, 2, 3 .... rank = torch.distributed.get_rank() seed = torch.initial_seed()
I tried this adjustment, but it doesn't work, this error doesn't show the actual line of code area. I still think this was caused by the torch version that was incompatible with it. Can you inform me of your Cuda version based on your requirement.txt and GPU type? I try to search for the same equipment to run the code. Many thanks!
For your reference:
pytorch==1.10.2+cu111
cudatoolkit==11.2.1
python==3.9.7
Based on our experiments, the code can run well on 3090ti (with 24G memory), V100 (with 32G memory), A40 (with 40G memory) , and A100(with 80G memory).
I wonder how large GPU I should use? I use 1 GPU with 24GB, and it says it is out of memory. python main.py --device 0 --config_file PixelNet/sasrec.yaml overall/ViT.yaml