Trying execute the data_selection script

baominhlt commented 2 years ago

Hi yaoxingcheng, Your work is so great. I tried to execute your data_selection script with your example source and target data. After finishing the execution, a selected data is created. It has 3 columns, these are "text","id" and "rank".

However, when I access your released dataset, I found that your dataset has one different column (it is the "label" column).

I wonder if the "rank" column created by your original script is similar to the “label” column in your released datasets. Sincerely, baominhlt

yaoxingcheng commented 2 years ago

Hi~ Thanks for your attention. The "rank" column records the rank of an external data sample when it's retrieved from the source corpus by a task sample measured by BM25 score. For example, if query A retrieves A1, A2, A3, the similarity score between A and A1 is 0.9, A and A2 is 0.8, A and A3 is 0.7, then the rank of A1 is 0, rank of A2 is 1 and the rank of A3 is 2. If a source sample is retrieved by multiple target samples, we will just take the minimum rank.

Here, different from id which is unique for each retrieved sample, a rank value may be assigned to multiple retrieved samples. We hope rank could somehow characterize the distance between a retrieved sample and the task corpus and we can make some use of it in the future. Since we haven't used it now, so in our released dataset, the rank column is not included.

baominhlt commented 2 years ago

Thanks. I got it. Besides, how do you assign the label value for each row in your released datasets?

yaoxingcheng commented 2 years ago

What's shown in your screenshot from the released dataset is the task data instead of the external data retrieved by task data. The "label" column is just the annotated label in the task dataset.

baominhlt commented 2 years ago

Got it. Thanks yaoxingcheng. :3

yaoxingcheng / TLM

Trying execute the data_selection script #12