Closed baominhlt closed 2 years ago
Hi~ Thanks for your attention. The "rank" column records the rank of an external data sample when it's retrieved from the source corpus by a task sample measured by BM25 score. For example, if query A retrieves A1, A2, A3, the similarity score between A and A1 is 0.9, A and A2 is 0.8, A and A3 is 0.7, then the rank of A1 is 0, rank of A2 is 1 and the rank of A3 is 2. If a source sample is retrieved by multiple target samples, we will just take the minimum rank.
Here, different from id which is unique for each retrieved sample, a rank value may be assigned to multiple retrieved samples. We hope rank could somehow characterize the distance between a retrieved sample and the task corpus and we can make some use of it in the future. Since we haven't used it now, so in our released dataset, the rank column is not included.
Thanks. I got it. Besides, how do you assign the label value for each row in your released datasets?
What's shown in your screenshot from the released dataset is the task data instead of the external data retrieved by task data. The "label" column is just the annotated label in the task dataset.
Got it. Thanks yaoxingcheng. :3
Hi yaoxingcheng, Your work is so great. I tried to execute your data_selection script with your example source and target data. After finishing the execution, a selected data is created. It has 3 columns, these are "text","id" and "rank".
However, when I access your released dataset, I found that your dataset has one different column (it is the "label" column).
I wonder if the "rank" column created by your original script is similar to the “label” column in your released datasets. Sincerely, baominhlt