Process with pixelnet - Githubissues

Larry-Liu02 commented 2 months ago

I wonder how large GPU I should use? I use 1 GPU with 24GB, and it says it is out of memory. python main.py --device 0 --config_file PixelNet/sasrec.yaml overall/ViT.yaml

hyc9 commented 2 months ago

Hi. For your reference, most experiments can be run on a GPU with 32G of memory. If you encounter insufficient memory issues, you could consider adjusting the training parameters in yaml file. For example, in ViT.yaml:

train_batch_size: 16     # You can change it to 8  for lesser memory usage      
fine_tune_arg: {
    tune_scale: 165,      # You are tuning the last two layers of the ViT model when setting the tune_scale to 165. For lesser memory usage, you can choose to tune only the last layer by changing tune_scale to 181.
    pre_trained: True,
    activation: 'relu',
    dnn_layers: [],
    method: 'mean'                        
}

Larry-Liu02 commented 2 months ago

Many thanks for your instruction! When I run the code, I find the problem as below. I guess this may be caused by my torch version. I am not sure whether you encounter same issues previously. Just Pixelnet has this problem, IDnet and Vinet can run successfully.

$python main.py --device 0 --config_file PixelNet/sasrec.yaml overall/ViT.yaml

15 May 16:12 INFO [Training]: train_batch_size = [4] 15 May 16:12 INFO [Evaluation]: eval_batch_size = [1024] 15 May 16:12 INFO
World_Size = 1

15 May 16:12 INFO
General Hyper Parameters: model = MOSASRec seed = 2020 state = INFO use_modality = True reproducibility = True checkpoint_dir = saved show_progress = False log_wandb = False data_path = ../dataset/

Training Hyper Parameters: epochs = 200 train_batch_size = 4 optim_args = {'modal_lr': 0.0001, 'rec_lr': 0.0001, 'modal_decay': 0, 'rec_decay': 0.1} eval_step = 1 stopping_step = 30

Evaluation Hyper Parameters: eval_batch_size = 1024 topk = [5, 10] metrics = ['Recall', 'NDCG'] valid_metric = NDCG@10 metric_decimal_place = 7 eval_type = EvaluatorType.RANKING valid_metric_bigger = True

Dataset Hyper Parameters: MAX_ITEM_LIST_LENGTH = 10

Other Hyper Parameters: n_layers = 2 n_heads = 4 embedding_size = 512 inner_size = 2 hidden_dropout_prob = 0.1 attn_dropout_prob = 0.1 hidden_act = gelu layer_norm_eps = 1e-12 initializer_range = 0.02 wandb_project = REC image_path = ../dataset/image.lmdb encoder_name = clip-vit-base-patch32 encoder_source = transformers fine_tune_arg = {'tune_scale': 181, 'pre_trained': True, 'activation': 'relu', 'dnn_layers': [], 'method': 'mean'} MODEL_INPUT_TYPE = InputType.SEQ device = cuda:0

15 May 16:12 INFO Pixel200K The number of users: 200001 Average actions of users: 19.82828 The number of items: 96283 Average actions of items: 41.187927130720176 The number of inters: 3965656 The sparsity of the dataset: 99.9794063532928% 15 May 16:12 INFO MOSASRec( (visual_encoder): MeanItemEncoder( (item_encoder): CLIPVisionModel( (vision_model): CLIPVisionTransformer( (embeddings): CLIPVisionEmbeddings( (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False) (position_embedding): Embedding(50, 768) ) (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (encoder): CLIPEncoder( (layers): ModuleList( (0-11): 12 x CLIPEncoderLayer( (self_attn): CLIPAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): CLIPMLP( (activation_fn): QuickGELUActivation() (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) ) (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) ) (post_layernorm): Identity() ) ) (rec_fc): Sequential( (0): Linear(in_features=768, out_features=512, bias=True) (1): ReLU() ) ) (position_embedding): Embedding(10, 512) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (trm_encoder): TransformerEncoder( (layer): ModuleList( (0-1): 2 x TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=512, out_features=512, bias=True) (key): Linear(in_features=512, out_features=512, bias=True) (value): Linear(in_features=512, out_features=512, bias=True) (softmax): Softmax(dim=-1) (attn_dropout): Dropout(p=0.1, inplace=False) (dense): Linear(in_features=512, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.1, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=512, out_features=1024, bias=True) (dense_2): Linear(in_features=1024, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) Trainable parameters: 11693312 15 May 16:18 INFO recsys_lr_params_len: 35 modal_lr_params_len: 18 E0515 16:18:17.162791 140468101243264 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 107760) raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2024-05-15_16:18:17 host : kemove-Z790-D-DDR4 rank : 0 (local_rank: 0) exitcode : -11 (pid: 107760) error_file: traceback : Signal 11 (SIGSEGV) received by PID 107760 Your requirement list can not be compatible with my cuda, so I choose those versions: transformers 4.39.0 torch 2.4.0.dev20240505+cu118 torch_geometric 2.5.3 torchaudio 2.2.0.dev20240505+cu118 torchmetrics 0.2.0 torchvision 0.19.0.dev20240505+cu118

hyc9 commented 2 months ago

Many thanks for your instruction! When I run the code, I find the problem as below. I guess this may be caused by my torch version. I am not sure whether you encounter same issues previously. Just Pixelnet has this problem, IDnet and Vinet can run successfully.

$python main.py --device 0 --config_file PixelNet/sasrec.yaml overall/ViT.yaml

15 May 16:12 INFO [Training]: train_batch_size = [4] 15 May 16:12 INFO [Evaluation]: eval_batch_size = [1024] 15 May 16:12 INFO World_Size = 1

15 May 16:12 INFO General Hyper Parameters: model = MOSASRec seed = 2020 state = INFO use_modality = True reproducibility = True checkpoint_dir = saved show_progress = False log_wandb = False data_path = ../dataset/

Training Hyper Parameters: epochs = 200 train_batch_size = 4 optim_args = {'modal_lr': 0.0001, 'rec_lr': 0.0001, 'modal_decay': 0, 'rec_decay': 0.1} eval_step = 1 stopping_step = 30

Evaluation Hyper Parameters: eval_batch_size = 1024 topk = [5, 10] metrics = ['Recall', 'NDCG'] valid_metric = NDCG@10 metric_decimal_place = 7 eval_type = EvaluatorType.RANKING valid_metric_bigger = True

Dataset Hyper Parameters: MAX_ITEM_LIST_LENGTH = 10

Other Hyper Parameters: n_layers = 2 n_heads = 4 embedding_size = 512 inner_size = 2 hidden_dropout_prob = 0.1 attn_dropout_prob = 0.1 hidden_act = gelu layer_norm_eps = 1e-12 initializer_range = 0.02 wandb_project = REC image_path = ../dataset/image.lmdb encoder_name = clip-vit-base-patch32 encoder_source = transformers fine_tune_arg = {'tune_scale': 181, 'pre_trained': True, 'activation': 'relu', 'dnn_layers': [], 'method': 'mean'} MODEL_INPUT_TYPE = InputType.SEQ device = cuda:0

15 May 16:12 INFO Pixel200K The number of users: 200001 Average actions of users: 19.82828 The number of items: 96283 Average actions of items: 41.187927130720176 The number of inters: 3965656 The sparsity of the dataset: 99.9794063532928% 15 May 16:12 INFO MOSASRec( (visual_encoder): MeanItemEncoder( (item_encoder): CLIPVisionModel( (vision_model): CLIPVisionTransformer( (embeddings): CLIPVisionEmbeddings( (patch_embedding): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False) (position_embedding): Embedding(50, 768) ) (pre_layrnorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (encoder): CLIPEncoder( (layers): ModuleList( (0-11): 12 x CLIPEncoderLayer( (self_attn): CLIPAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): CLIPMLP( (activation_fn): QuickGELUActivation() (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) ) (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) ) (post_layernorm): Identity() ) ) (rec_fc): Sequential( (0): Linear(in_features=768, out_features=512, bias=True) (1): ReLU() ) ) (position_embedding): Embedding(10, 512) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) (trm_encoder): TransformerEncoder( (layer): ModuleList( (0-1): 2 x TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=512, out_features=512, bias=True) (key): Linear(in_features=512, out_features=512, bias=True) (value): Linear(in_features=512, out_features=512, bias=True) (softmax): Softmax(dim=-1) (attn_dropout): Dropout(p=0.1, inplace=False) (dense): Linear(in_features=512, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.1, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=512, out_features=1024, bias=True) (dense_2): Linear(in_features=1024, out_features=512, bias=True) (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) Trainable parameters: 11693312 15 May 16:18 INFO recsys_lr_params_len: 35 modal_lr_params_len: 18 E0515 16:18:17.162791 140468101243264 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 107760) raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2024-05-15_16:18:17 host : kemove-Z790-D-DDR4 rank : 0 (local_rank: 0) exitcode : -11 (pid: 107760) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 107760

Your requirement list can not be compatible with my cuda, so I choose those versions: transformers 4.39.0 torch 2.4.0.dev20240505+cu118 torch_geometric 2.5.3 torchaudio 2.2.0.dev20240505+cu118 torchmetrics 0.2.0 torchvision 0.19.0.dev20240505+cu118

Hi, the error message you've provided is a bit brief, and it's difficult for me to accurately determine the cause of the error. The program may be crashing due to insufficient system resources. Try to decrease the num_works to 2 in REC/data/utils.py :

    num_workers = 10    # set it to a lower number like 1, 2, 3 ....
    rank = torch.distributed.get_rank() 
    seed = torch.initial_seed()

If that doesn't solve the problem, it would be best if you could look up the approximate code area where the error occurred. Please feel free to reach out if you need further help.

Larry-Liu02 commented 2 months ago

num_workers = 10    # set it to a lower number like 1, 2, 3 ....
rank = torch.distributed.get_rank() 
seed = torch.initial_seed()

I tried this adjustment, but it doesn't work, this error doesn't show the actual line of code area. I still think this was caused by the torch version that was incompatible with it. Can you inform me of your Cuda version based on your requirement.txt and GPU type? I try to search for the same equipment to run the code. Many thanks!

hyc9 commented 2 months ago

num_workers = 10    # set it to a lower number like 1, 2, 3 ....
rank = torch.distributed.get_rank() 
seed = torch.initial_seed()
I tried this adjustment, but it doesn't work, this error doesn't show the actual line of code area. I still think this was caused by the torch version that was incompatible with it. Can you inform me of your Cuda version based on your requirement.txt and GPU type? I try to search for the same equipment to run the code. Many thanks!

For your reference:

pytorch==1.10.2+cu111
cudatoolkit==11.2.1
python==3.9.7

Based on our experiments, the code can run well on 3090ti (with 24G memory), V100 (with 32G memory), A40 (with 40G memory) , and A100(with 80G memory).

westlake-repl / PixelRec

Process with pixelnet #2