whwu95 / Text4Vis

【AAAI'2023 & IJCV】Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
MIT License
204 stars 15 forks source link

About the dataloader. #9

Closed ZMHH-H closed 1 year ago

ZMHH-H commented 1 year ago

Hi, thanks for your great job! I notice that there are two ways to load the video data: 1.extrated frames 2. on-the-fly decoding Could you please provide some details about the difference in their loading speed? And how much memory space will the extracted frames take up (e.g. K400)?

whwu95 commented 1 year ago

Hi, Minghao,

Thank you for your interest in our work. I haven't specifically tested the speed of the two loaders, "extracted frames" and "on-the-fly decoding," in this context. In fact, all models discussed in this paper were trained using extracted frames. However, if you are using a solid-state drive (SSD), the speed difference between the two loaders may not be significant. If you choose to use extracted frames for training, it would require approximately 1 T of disk space for the K400 dataset.

ZMHH-H commented 1 year ago

Thank you for your reply, your advice is helpful to me!