yuvalkirstain / PickScore

MIT License
373 stars 20 forks source link

The meaning of each column in the dataset #7

Closed kkjh0723 closed 11 months ago

kkjh0723 commented 11 months ago

Thanks for sharing the code and dataset!

I want to train the model with private data, but I cannot understand some of the fields in the pick-a-pic dataset. Specifically, what is the meaning of the following fields? ranking_id, user_id, num_example_per_prompt, __index_level_0__

And are all those fields used to train the pickscore model? (it seems num_example_per_prompt is related to the inverse proportional weighting stated in the paper)

yuvalkirstain commented 11 months ago

Thanks for reaching out! The only relevant field for training the model is num_example_per_prompt, you can safely ignore the rest or fill in dummy values in their place.

For completeness:

  1. user_id is the id of the user that made the ranking.
  2. 'ranking_id' is the the id of the user.
  3. 'num_example_per_prompt' is how many rankings correspond to the prompt in the specific data point which is indeed used when calculating the inverse proportional weighting stated in the paper.
  4. __index_level_0__ is an artifact from using HF datasets and can be removed.

We wanted to keep as many fields as possible to enable different use cases.

kkjh0723 commented 11 months ago

@yuvalkirstain Thanks for quick reply. I have two following questions regarding your answers.

  1. ranking_id is still unclear to me. What is different from user_id?
  2. A "data point" in your third answer means a row in the dataset? If then all data points have the same caption (prompt) should have same num_example_per_prompt?
yuvalkirstain commented 11 months ago
  1. ranking_id is still unclear to me. What is different from user_id? - we maintained two tables a table of users. each new user is assigned with a user id, and a table for rankings. each time a user chooses an image, we log the ranking with a new ranking id. you can also consider the ranking id to be the id of the data point/example.
  2. A "data point" in your third answer means a row in the dataset? If then all data points have the same caption (prompt) should have same num_example_per_prompt ? - yes and yes to both answers :)
kkjh0723 commented 11 months ago

Thanks for the answer!