wzk1015 / video-bgm-generation

[ACM MM 2021 Best Paper Award] Video Background Music Generation with Controllable Music Transformer
https://wzk1015.github.io/cmt/
MIT License
289 stars 34 forks source link

Bugs encountered while using the inference code "gen_midi_conditional.py" in "src/" folder #8

Closed shansongliu closed 2 years ago

shansongliu commented 2 years ago

Hi, I encountered some bugs while using the "gen_midi_conditional.py" code to generate midi files for a given video. I installed the Python 2 environment given the requirement file "py2_requirements.txt" and then used the "video2npz.sh" to produce a "xxx.npz" file for the given video. But I encountered some problems while using the "gen_midi_conditional.py" code, the program output and error report are pasted below:

Command I used: python3 gen_midi_conditional.py -f ../inference/LGpwmBqJF1Q_HarryPotter2ChamberOfSecrets.npz -c ../exp/train_exp/loss_70_params.pt

Code standard print: inference D_MODEL 512 N_LAYER 12 N_HEAD 8 DECODER ATTN causal-linear [18, 3, 18, 129, 18, 6, 27, 102, 5025] [*] load model from: ../exp/train_exp/loss_70_params.pt new song [vlog_npz matrix print here] ------ initiate ------ tensor([[[17, 1, 10, 0, 0, 0, 0, 1, 0]]])

Error print: Traceback (most recent call last): File "gen_midi_conditional.py", line 104, in generate() File "gen_midi_conditional.py", line 85, in generate res, err_note_number_list, err_beat_number_list = net(is_train=False, vlog=vlog_npz, C=0.7) File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs, *kwargs) File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 483, in forward return self.inference_from_scratch(*kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 341, in inference_from_scratch h, y_type = self.forwardhidden(input, is_training=False, init_token=pre_init) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 216, in forward_hidden init_emb_linear = self.forward_init_token(init_token) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/model.py", line 162, in forward_init_token emb_genre = self.init_emb_genre(x[..., 0]) File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate/src_mm21_py2/utils.py", line 80, in forward return self.lut(x) math.sqrt(self.d_model) File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward return F.embedding( File "/data/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self

The inference code, trained model and data (including original video and processed .npz file) are attached in Google drive. Here is the link: https://drive.google.com/drive/folders/1Ch3jjxZrztKAtEvuEhGjxPk2-G0NSYe0?usp=sharing

Could you help me check this? Really appreciate it.

Best regards,

wzk1015 commented 2 years ago

pre_init in model.py are the init tokens for genre(first column), key(unused, second column) and instrument(third column). In your gen_midi_conditional.py you define the embedding size of them as init_n_token = [1, 1, 1] in line 48, so pre_init is out of range.

You can fix it by :

shansongliu commented 2 years ago

Do you mean I should set the pre_init variable ( pre_init = np.array([[5, 0, 0], [0, 0, 0], [0, 0, 1], [0, 0, 2], [0, 0, 3], [0, 0, 4], [0, 0, 5]]) ) as pre_init = np.array([])? I see in the train.py, the value of init_n_token is [1,1,1]

wzk1015 commented 2 years ago

Yes, set pre_init as np.array([]). You can also try pre_init = np.array([[0, 0, 0]]) if np.array([]) doesn't work well

init_n_token is not the token itself, but the number of embedding classes for genre, key and instrument.

shansongliu commented 2 years ago

Thanks for your quick reply. After I set pre_init as np.array([[0, 0, 0]]), the inference program can run without no more error message output (Set pre_init as np.array([]) still triggers error). But what makes me feel strange is that the inference program does not seem to stop. It has run about 8 hours after launch for the 2min input video. I wonder is this normal? By the way, I haven't seen a midi output yet. Will the midi file be generated in the src/ folder? Thanks again for your patience.

wzk1015 commented 2 years ago

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this).

I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Becomebright commented 2 years ago

As stated in README / Directory Structure, the generated midi files will be stored in the inference/ folder.

Becomebright commented 2 years ago

I followed the README instruction and it runs normally. Here are some of the generated and intermediate files: https://drive.google.com/drive/folders/1UtZXXLiY9PNFo-p3lIslQxlEKCQzIcnU?usp=sharing.

shansongliu commented 2 years ago

I followed the README instruction and it runs normally. Here are some of the generated and intermediate files: https://drive.google.com/drive/folders/1UtZXXLiY9PNFo-p3lIslQxlEKCQzIcnU?usp=sharing.

Hi, Shangzhe, it seems that the link needs access permission. I have already sent an access permission application. BTW, I indeed followed the detailed instruction provided by the README.md. But as I stated, the inference program could not stop (seems like it ran into an infinite loop) after I corrected the pre_init variable advised by Zhaokai. Did you use the video data, trained model and inference code in this link https://drive.google.com/drive/folders/1Ch3jjxZrztKAtEvuEhGjxPk2-G0NSYe0?usp=sharing and successfully generate midi files?

Becomebright commented 2 years ago

I used your video, our model, and inference code in this repo without any modification. Perhaps your inference code or model has problems.

shansongliu commented 2 years ago

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this).

I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, could you be more specific about what might be wrong when I use the video2npz pipeline? I followed the inference instruction in README.md. I saw there are three sub-steps in the video2npz.sh script. For the first sub-step optical_flow.py, the optical flow npz file was generated. Then for the second sub-step video2metadata.py, a json file was generated. The last sub-step metadata2numpy_mix.py generated a npz data file according to the last-sub-step-generated json file.

Then I used this npz data file together with my self-trained model and also the gen_midi_conditional.py in which the decoder_n_class and init_n_token variables were changed in line with the training data (output by the train.py file). After all these done, the inference program gen_midi_conditional.py can actually run, but the only problem is that it seemed that it ran into an infinite loop.

For your mentioned points:

1) I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

I am not quite sure about the video length you mentioned. Do you mean the number of the video frames? Or the dimension of the vlog_npz variable in gen_midi_conditional.py?

2) For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Could you clarify which line (or which variable) in the source code you are referring to?

Again, many thanks for your patience and kindness. I really appreciate it.

shansongliu commented 2 years ago

I used your video, our model, and inference code in this repo without any modification. Perhaps your inference code or model has problems.

Thanks for your clarification.

wzk1015 commented 2 years ago

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this). I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length. For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, could you be more specific about what might be wrong when I use the video2npz pipeline? I followed the inference instruction in README.md. I saw there are three sub-steps in the video2npz.sh script. For the first sub-step optical_flow.py, the optical flow npz file was generated. Then for the second sub-step video2metadata.py, a json file was generated. The last sub-step metadata2numpy_mix.py generated a npz data file according to the last-sub-step-generated json file.

Then I used this npz data file together with my self-trained model and also the gen_midi_conditional.py in which the decoder_n_class and init_n_token variables were changed in line with the training data (output by the train.py file). After all these done, the inference program gen_midi_conditional.py can actually run, but the only problem is that it seemed that it ran into an infinite loop.

For your mentioned points:

  1. I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

I am not quite sure about the video length you mentioned. Do you mean the number of the video frames? Or the dimension of the vlog_npz variable in gen_midi_conditional.py?

  1. For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Could you clarify which line (or which variable) in the source code you are referring to?

Again, many thanks for your patience and kindness. I really appreciate it.

  1. you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

  2. see the output when running inference, it should be like this

    [   9   1   6   0   0   3   4  35 216]
    [   3   1  10   0   0   5   1  36 226]
    [   0   2   0  74  16   5   0  36 226]

    the second row from the right (35,36,36) indicates pbeat

shansongliu commented 2 years ago

That seems weird. Normally it runs for several minutes for a short video, and stop generating automatically with Beat Timing Encoding. Or it will break from the loop if music length exceed video length (see this). I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length. For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, could you be more specific about what might be wrong when I use the video2npz pipeline? I followed the inference instruction in README.md. I saw there are three sub-steps in the video2npz.sh script. For the first sub-step optical_flow.py, the optical flow npz file was generated. Then for the second sub-step video2metadata.py, a json file was generated. The last sub-step metadata2numpy_mix.py generated a npz data file according to the last-sub-step-generated json file. Then I used this npz data file together with my self-trained model and also the gen_midi_conditional.py in which the decoder_n_class and init_n_token variables were changed in line with the training data (output by the train.py file). After all these done, the inference program gen_midi_conditional.py can actually run, but the only problem is that it seemed that it ran into an infinite loop. For your mentioned points:

  1. I am not quite sure about your model setting, but I guess the video2npz pipeline has some problem. You can check the npz file(or vlog in model.py) to see whether its length matches the video length.

I am not quite sure about the video length you mentioned. Do you mean the number of the video frames? Or the dimension of the vlog_npz variable in gen_midi_conditional.py?

  1. For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Could you clarify which line (or which variable) in the source code you are referring to? Again, many thanks for your patience and kindness. I really appreciate it.

  1. you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed
  2. see the output when running inference, it should be like this
[   9   1   6   0   0   3   4  35 216]
[   3   1  10   0   0   5   1  36 226]
[   0   2   0  74  16   5   0  36 226]

the second row from the right (35,36,36) indicates pbeat

Thanks for your detailed explanation, I will continue to check.

shansongliu commented 2 years ago

you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

I checked the value of n_beat and len(vlog). They are not equal, n_beat=940 > len(vlog)=166. And the value of cur_vlog gets stuck at 14 and never proceeds. Does this mean the input npz file of the inference code gen_midi_conditional.py is corrputed?

wzk1015 commented 2 years ago

you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

I checked the value of n_beat and len(vlog). They are not equal, n_beat=940 > len(vlog)=166. And the value of cur_vlog gets stuck at 14 and never proceeds. Does this mean the input npz file of the inference code gen_midi_conditional.py is corrputed?

n_beat > len(vlog) is normal, the former represents total number of beats, the latter represents Bar and Beat tokens. Can you provide the standard output of inference?

shansongliu commented 2 years ago

you can check the value of n_beat and len(vlog). And also trace the value of cur_vlog to see why this break condition isn't executed

I checked the value of n_beat and len(vlog). They are not equal, n_beat=940 > len(vlog)=166. And the value of cur_vlog gets stuck at 14 and never proceeds. Does this mean the input npz file of the inference code gen_midi_conditional.py is corrputed?

n_beat > len(vlog) is normal, the former represents total number of beats, the latter represents Bar and Beat tokens. Can you provide the standard output of inference?

I put the newly generated standard output (stdout_new.txt) in this link https://drive.google.com/drive/folders/1Ch3jjxZrztKAtEvuEhGjxPk2-G0NSYe0

shansongliu commented 2 years ago

For beat timing encoding you can also check the pbeat attribute (see the output when running inference, pbeat is the second column from the right), it should be monotonically increasing from 0 to 99.

Hi, Zhaokai, I observe that my pbeat attribute will be stuck to a number (say 5 or 14) and does not increase any longer when performing inference. I think this is the reason why the loop cannot stop. Do you have any idea why this can happen?

wzk1015 commented 2 years ago

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f7922930aa219aa605246ed67a6f98c5c8df0e1

shansongliu commented 2 years ago

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f79229

Thanks, will try it.

shansongliu commented 2 years ago

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f79229

Thanks, will try it.

I tried the modified version, now it gives the following error. It seems that it still has the dimension problem.

Traceback (most recent call last): File "train.py", line 226, in train_dp() File "train.py", line 169, in train_dp losses = net(is_train=True, x=batch_x, target=batch_y, loss_mask=batch_mask, init_token=batch_init) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], *kwargs[0]) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 482, in forward return self.train_forward(**kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 450, in train_forward h, y_type = self.forward_hidden(x, memory=None, is_training=True, init_token=init_token) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 213, in forward_hidden encoder_pos_emb = torch.cat([init_emb_linear, encoder_pos_emb], dim=1) RuntimeError: Tensors must have same number of dimensions: got 2 and 3

shansongliu commented 2 years ago

It seems that this is due to the inconsistency of init tokens between train and generate, and will appear when using another training set. This should be fixed by 8f79229

Thanks, will try it.

I tried the modified version, now it gives the following error. It seems that it still has the dimension problem.

Traceback (most recent call last): File "train.py", line 226, in train_dp() File "train.py", line 169, in train_dp losses = net(is_train=True, x=batch_x, target=batch_y, loss_mask=batch_mask, init_token=batch_init) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], *kwargs[0]) File "/data/miniconda3/envs/mm21_py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 482, in forward return self.train_forward(**kwargs) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 450, in train_forward h, y_type = self.forward_hidden(x, memory=None, is_training=True, init_token=init_token) File "/group/30042/shansongliu/Projects/VideoMusicRecommend/VideoBGMGenerate_new3/src/model.py", line 213, in forward_hidden encoder_pos_emb = torch.cat([init_emb_linear, encoder_pos_emb], dim=1) RuntimeError: Tensors must have same number of dimensions: got 2 and 3

I just downloaded the newest version of this repo and directly used the train.py there without further modification.

wzk1015 commented 2 years ago

A typo, just fix it by d4a6c33dbd6e1a6f001ce1ba405d09050cb0df2f, you can try the latest version

shansongliu commented 2 years ago

A typo, just fix it by d4a6c33, you can try the latest version

It can run now. Thanks. Will check the inference later.