In the finetune.py, there is:
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=2562828, max_pixels=5122828, padding_side="right")
It means we set padding strategy, it's behavior is add padding tokens to the shorter data sample in the batch to match the longest of data sample in the batch.
In the finetune.py, there is: processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=2562828, max_pixels=5122828, padding_side="right")
It means we set padding strategy, it's behavior is add padding tokens to the shorter data sample in the batch to match the longest of data sample in the batch.
See https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py#L82 for more details.