yuvalkirstain / PickScore

MIT License
446 stars 26 forks source link

The meaning of each column in the dataset #7

Closed kkjh0723 closed 1 year ago

kkjh0723 commented 1 year ago

Thanks for sharing the code and dataset!

I want to train the model with private data, but I cannot understand some of the fields in the pick-a-pic dataset. Specifically, what is the meaning of the following fields? ranking_id, user_id, num_example_per_prompt, __index_level_0__

And are all those fields used to train the pickscore model? (it seems num_example_per_prompt is related to the inverse proportional weighting stated in the paper)

yuvalkirstain commented 1 year ago

Thanks for reaching out! The only relevant field for training the model is num_example_per_prompt, you can safely ignore the rest or fill in dummy values in their place.

For completeness:

  1. user_id is the id of the user that made the ranking.
  2. 'ranking_id' is the the id of the user.
  3. 'num_example_per_prompt' is how many rankings correspond to the prompt in the specific data point which is indeed used when calculating the inverse proportional weighting stated in the paper.
  4. __index_level_0__ is an artifact from using HF datasets and can be removed.

We wanted to keep as many fields as possible to enable different use cases.

kkjh0723 commented 1 year ago

@yuvalkirstain Thanks for quick reply. I have two following questions regarding your answers.

  1. ranking_id is still unclear to me. What is different from user_id?
  2. A "data point" in your third answer means a row in the dataset? If then all data points have the same caption (prompt) should have same num_example_per_prompt?
yuvalkirstain commented 1 year ago
  1. ranking_id is still unclear to me. What is different from user_id? - we maintained two tables a table of users. each new user is assigned with a user id, and a table for rankings. each time a user chooses an image, we log the ranking with a new ranking id. you can also consider the ranking id to be the id of the data point/example.
  2. A "data point" in your third answer means a row in the dataset? If then all data points have the same caption (prompt) should have same num_example_per_prompt ? - yes and yes to both answers :)
kkjh0723 commented 1 year ago

Thanks for the answer!