Question about training curves

Hey, sorry for repeatedly bothering you with questions, hopefully this is one of the last few ones.

I am training PickScore with exactly the same configuration as you provided (grad-accumulation was finally working!). However, while training I notice a weird stair-case-like behaviour in my loss curves:

The sharp dips in loss occur exactly at the beginning of each epoch. I am unsure why this happens, and I have ruled out any wandb logging issues as the main cause. One justification is that the model is beginning to memorise the training samples, perhaps because it is very high capacity (980M params). These threads have similar arguments as to why this might be happening:

I am wondering if you also faced similar issues while training your original PickScore model on Pick-a-pic-v1? Would it be possible for you to share your training loss curves if they are available? I want to ensure that the model is not fully overfitting to the train set, although I did check that the validation accuracy stays stagnant so perhaps it is fine?

Would be great to hear your thoughts on this, thanks again!

yuvalkirstain / PickScore

Question about training curves #19