whettenr / bestrq

7 stars 1 forks source link

Accuracy for Best-RQ #1

Open AraKchrUser opened 1 month ago

AraKchrUser commented 1 month ago

What is the accuracy of identifying the correct codebook for Best-RQ? Thank you..

whettenr commented 1 month ago

I was able to get an accuracy of around 20%

AraKchrUser commented 1 month ago

Please tell me, did you measure it during validation or training?

AraKchrUser commented 1 month ago

I am concerned about the question of how this correlates with the fact that on LS-960, wav2vec 2.0 accuracy reaches 70% (clause 5.4, https://arxiv.org/pdf/2006.11477 ). I would be grateful if you could help me figure it out.

whettenr commented 1 month ago

Only on the validation.

Yes, if I remember and understand correctly wav2vec 2.0 accuracy reaches a higher accuracy because the task is a contrastive task where the model has to chose the right entry among around 100 distractors.

This task is different from BRQ where the model has to choose the correct codebook entry (which has over 8000 entries).

So it makes sense that finding the correct one out of 100 is easier to learn then finding the correct one out of 8000+ (i.e. more than 80 times more options to select from).

AraKchrUser commented 1 month ago

Thank you very much for the reply! I wanted to clarify a couple more points.

1) I would be very grateful if you would share with me the learning curves (loss and accuracy), is it possible? 2) Regarding classification difficulty: So you've already done an experiment to reduce the number of codewords, which resulted in improved accuracy on pretrain (but with a possible deterioration of the speech recognition task)?

whettenr commented 1 month ago

No problem

  1. What would be the best way to share? (feel free to give me an email) If you'd like to do a video conference call I'm available.

  2. I have, and yes I did experiment a bit with this and it did improve the accuracy on pre-training task (for example shrinking the code), but actually did not actually deteriorate the performance on speech recognition task (unless you decrease the codebook a HUGE amount).

Also, they even note that codebook size doesn't matter too much in the original paper section 5 when discussing about hyper parameters they say the following:

The pre-training quality is not very sensitive to the codebook vocab size and the codebook dimension, and is more sensitive to the masking probability and the mask length.

In my paper (Table 3), I report on some preliminary experiments with changing the codebook, I found that the codebook did not change the performance on speech recognition in any consistent way for my experiments.

I think that if you are working with more languages, you might get more benefits out of having a large codebook or even two codebooks (such as in Google USM). But in my experiments, I found that the mask probability, learning rate, and batch size were much more important for the downstream performance than the codebook size.

AraKchrUser commented 1 month ago

Yes, sure! My email is armen0101017@gmail.com I have read your article, thank you very much for your work! I would love to communicate with you via video conference, but unfortunately, so far, my English leaves much to be desired ( I still have several open questions that may be potentially interesting to you. Can be extended by email. However, our discussion may also be useful for other researchers who are interested in SSL )

AraKchrUser commented 1 month ago

So I'm new to SSL) And I didn’t quite understand how to evaluate my BEST-RQ pretrain. I am concerned about this issue after reading this post https://github.com/facebookresearch/fairseq/issues/2949#issuecomment-736841653 Should you focus on accuracy or loss? Is 40% accuracy really better than 20%? And how do you understand which metric values ​​are good for a given task, and whether the model is ready for fine tuning? Feel free to share the contacts of the authors of the original article if you think it’s worth consulting with them. I'll be very grateful.

AraKchrUser commented 1 month ago

I would like to show my learning curves, masked spectrograms on librispeech and distribution of the use of codewords during training. I have 600M conformer, x_len_reduce=8, masking probability p=5% and mask_span=8 frames (40% of audio is masked). Can there be any recommendations on how to use these parameters, since I do not quite understand how they are set? I think you wrote that there is a dependence on the number of hours in the batch, but I did not find anything in the original article....

whettenr commented 1 month ago

So I'm new to SSL) And I didn’t quite understand how to evaluate my BEST-RQ pretrain. I am concerned about this issue after reading this post facebookresearch/fairseq#2949 (comment) Should you focus on accuracy or loss? Is 40% accuracy really better than 20%? And how do you understand which metric values ​​are good for a given task, and whether the model is ready for fine tuning? Feel free to share the contacts of the authors of the original article if you think it’s worth consulting with them. I'll be very grateful.

You should be concerned. This is one of the challenges with SSL. Neither the training loss or the accuracy are the best indicators of what the performance will be on fine-tuning/other downstream tasks. They both can give you some insights as to if the model is training/converging well, but from what I know, most people just run the model on the downstream task (even if just for a few epochs) to get an idea if the model is performing well. You also don't have to wait until training is finished. For example, you can try running on the downstream task 1/4 or 1/2 of the way through training to get an idea if pre-training is going well.

whettenr commented 1 month ago

I still have several open questions that may be potentially interesting to you. Can be extended by email. However, our discussion may also be useful for other researchers who are interested in SSL )

I think there are dependencies on the batch size but this is just my intuition and have not experimented throughly. It could also depend on dataset size. From the little that I have done I feel that If you have a really large batch size you don't need a very big mask (like in the BEST-RQ paper they have a really small mask, really big batch size, a lot of data, and get great performance).

I think if you have a small batch size you will not be able to reach good performance unless you increase the mask (this was my case, I could not get good performance until I increase the mask size, but I was working with a very small batch size and on a much smaller dataset due to GPU and resource constraints)

AraKchrUser commented 1 month ago

Thanks for the training logs, I will study them. Please tell me, how do you build a "codebook usage"? Is it something like len(set(step_targets))/8192 ? I assume that these values were obtained at the learning stage?

I have built histograms of predictions (left) and targets (right) using tensorboard (tb) for the training stage (something tells me that the tb histogram greatly distorts the actual results).

histograms

I'm confused by a few things, maybe you've dealt with them? 1) You can see how some "codewords" are more common than others. On the one hand, this reflects the distribution on our data, but on the other, it seems that this is not optimal for training. Have you observed the collapse of the codebook? 2) According to your logs, you only use 20% of "codewords" at each step. I decided to calculate this metric for my bestrq framework for Librispeech for each epoch. On the train, I have "codebook usage" = 50%, but on validation, this value is much less - <10% ! Does this mean that it is necessary to adjust the implementation or preprocessing?

xxchauncey commented 1 week ago

Hi, guys

Is there any conclusion? I'm also really curious about how BRQ can work. In my opinion, BRQ just find a random target for a frame to be encoded to a latent which is expected to get closer and closer to the random target. Since the codebook is quite imbalance as @AraKchrUser points out and the classification accuracy is not high, I don't quite understand how it can work well in the downstream task.

whettenr commented 1 week ago

@xxchauncey @AraKchrUser I don't know exactly, but I can give you both my thoughts.

Thoughts on codebook usage

I think it is normal for some entries to be used more than others. In natural language (speech and text), certain sounds and words are more common than other. For example in english text we use the letter "e" a lot more than "z" or the word "the" a lot more than "artificial," and speech is the same (the "z" sound might be less common than the "k"... I'm not sure about this particular example but I think you get the point). This is shown in the right-hand graph of @AraKchrUser where we see the targets are unbalanced.

Thoughts on the relationship between codebook usage / accuracy and downstream task

Though my experiments, I have noticed that pre-training loss/accuracy does NOT indicate good/bad performance. I don't think unbalanced usage necessarily means low accuracy, because if targets are unbalanced, then the model should learn this and have a high accuracy. Maybe there are some codebook entries that are really close to each other and the model has trouble getting it right.

So what leads to good downstream performance?? This is a big question and I don't think researchers understand fully. I believe because the model is needing to reconstruct masked speech based on unmasked sections, the model is still learning useful transformations/statistics about speech, even if the accuracy remains around 20%.

I'm not sure why it would be so low on the validation set though :(.

AraKchrUser commented 5 days ago

Hi, @whettenr! Thank you for responding to us and helping us understand such a complex and interesting topic. Have you come across any information about the ratio of data that should be used for pre-training and fine-tuning?

whettenr commented 3 days ago

Hey @AraKchrUser , No problem. No I haven't see much of the ratio of pre-training to fine-tuning data sorry ... (though there probably some studies out there).

From what I do know is that the closer your pre-training data is to you fine-tuning data the better.

By closer, I'm taking about how similar the datasets are. Are they recorded with the same equipment in similar acoustical environments, what is the speech like (fast or slow, spontaneous or an audiobook), the language ...

Just to give a simple example if you pre-train on only english, audiobook data, I would expect the model to need much less fine-tuning on english, audiobook data, and, conversely, need much more fine-tuning to obtain good results on spontaneous Portuguese speech.