zjr2000 / GVL

Official implementation for paper Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
https://arxiv.org/abs/2303.06378
MIT License
26 stars 6 forks source link

How to train my own data set? #5

Open lingyixia opened 12 months ago

lingyixia commented 12 months ago

Could you please share a pipline for pre-preparing a new data for training?

zjr2000 commented 12 months ago
  1. Prepare the annotation file:

    train_caption_file: training corpus, refer to this file val_caption_file: validation corpus, refer to this file eval_gt_file_for_grounding: validation file for video grounding, refer to this file dict_file: vocabulary file of your dataset, refer to this file

  2. Prepare the features: Gather each video's features into a .npy file, with the format L * D, where L denotes temporal resolution and D represents the feature dimension. Store these files in a single designated folder for streamlined access.

  3. Prepare the .yaml file: Create a configuration file for training by modifying the existing cfg file. You can start with the template provided at: Configuration File Template and adjust it using the annotation details mentioned above.

hipeng-tech commented 3 months ago

hi,thanks for your work, I have a question that in the train_capion_file ,what does the "area"stands for?