yh1105 / datasetforTBCCD

26 stars 11 forks source link

How to generate similarity.txt file for big clone bench data #2

Closed nikitamehrotra12 closed 4 years ago

nikitamehrotra12 commented 4 years ago

Hi,

I am retraining TBCCD on another subset of big clone bench. Could you please share the script you used for generating this similarity.txt file?

And also how did you convert the entire content of h2 db (for big clone bench) into text format.

Thanks in advance.

yh1105 commented 4 years ago

Hi, Thanks for your question. Similarity.txt was given to me by the CDLH author. Since I can't get the 9134 code fragments mentioned in the CDLH paper, in order to compare the same data with CDLH, I emailed the CDLH author to get the similarity.txt. I have no way to get similarity.txt myself, but I have confirmed that the true or false clone marked in similarity.txt is correct. I think if you need to retrain the TBCCD, you can directly provide the format data such as "codefragment1 \t codefragment2 \t label".

"And also how did you convert the entire content of h2 db (for big clone bench) into text format." I am not sure I understand your meaning of this question, what does h2 mean?

nikitamehrotra12 commented 4 years ago

Hi,

Many thanks for your reply and information.

H2 is the database viewer for big clone bench data. All the code files in big clone bench are in .java format and for tbccd, I need .txt files. So I was wondering if you have some script for converting the entire big clone bench files to text format.

Also, do you have the code for CDLH tool as well? I was searching but unfortunately, I didn't find it over the internet. So I would be grateful if you can share it with me.

Thanks for your help :).

yh1105 commented 4 years ago

i see your question. you can using .java formate for tbccd directly, both .java and .txt all work for tbccd.

the author of cdlh didn't publish their source code, if i can get the source code of cdlh, i will not use the dataset from cdlh, because i also didn't know how the get the dataset,(similarty.txt and function.txt)

if you have any other questions, let me know, thanks for your interest in our work.

nikitamehrotra12 commented 4 years ago

Many Thanks for your help.

nikitamehrotra12 commented 4 years ago

Hi,

How did you divide the 7500 files from POJ dataset into 6500 training files and 500, 500 dev and test files respectively?

yh1105 commented 4 years ago

For POJ dataset, contains 15 problems, each problem has 500 files, so 15*500=7500. I directly shuffle a list contains 7500 files ,then random select 500 files for dev,500 files for test, and 6500 files for train. Note that, the author of CDLH tell me they use 500 files for test, so I also do the experiment use 500 files for test, although I think the divide is not reasonable, the large of test file is too small. If you want to get the 6500 training files and 500, 500 dev and test as the same as tbccd, you can count for the dataset, getting all the list, then set(list). If you need me provide the file to you , just let me know , I will help you. In addition, besides the experiment compare with CDLH, I using 8:1:1 for training,dev,test.

nikitamehrotra12 commented 4 years ago

Okay. Actually for CDLH data also I was dividing it in the ratio of 8:1:1. Hence my numbers were different from the ones reported in the repository. That's why I was curious.

Thanks for the explanation.

nikitamehrotra12 commented 4 years ago

Hi

I was trying to generate AST's for big clone bench data provided with the source code....but I am getting some error with java parser...Could you please help me with this?

The error that I am getting is : download_20200104_215018

yh1105 commented 4 years ago

I assume you the problem due to the different javalang you used, are you using the javalang parser as the same as I pushed in the repo?

nikitamehrotra12 commented 4 years ago

yes...

nikitamehrotra12 commented 4 years ago

Its the same

navdhaagarwal commented 4 years ago

Hi,

I am using the same java parser provided with the repo and I am getting the error attached below. It is not able to recognize the basic keywords like 'package' or 'import'. I tried printing the statement where the error arises, and it is on the first line(package declaration) of the program.

Annotation 2020-01-06 223754

The screenshot of the program is :

image

yh1105 commented 4 years ago

Sorry for not responding in time. I've been sick these days and I just have the strength to answer the question. First, many searchers have successfully run our code. Regarding your question, I think you have the javalang package in your python environment. When you run the program, you call javalang in the local environment, not javalang in my repo. Please try it first, pip uninstall javalang, and then run the program to try it out. If there are still problems, Let's solve them together.

nikitamehrotra12 commented 4 years ago

Thanks for you reply. Hope you are feeling better now.

I will try this and will get back to you if this doesn't work.

navdhaagarwal commented 4 years ago

Hi, The probable error is that javalang is not able to recognize the keywords 'package' and 'import'. Hence the source files in which either the packages are declared or has any import statements throws an error. Once we remove these statements, java parser works well.

Thank you

navdhaagarwal commented 4 years ago

Hi,

In your paper, you have reported that you are using the first version of bigclonebench which has 9134 code fragments and contain 6 million clone pairs and 260,000 false clone pairs. Once I run getTrainDevTestDataPairForBCB.py in order to generate the clone and non-clone pairs, I get 3 files namely recordtraindataBCB.txt, recordtestdataBCB.txt and, recorddevdataBCB.txt. The number of entries in each of the files is 3,30,68,778, 124750, 124750 respectively.

Shouldn't the total entries (clones and non-clones) be around 6 million + 260,000?

Thanks

yh1105 commented 4 years ago

Hi, for comparison with CDLH, I use the same data as CDLH. The following words are from the CDLH paper:

"Specifically, BigCloneBench consists of projects from 25,000 systems, covers 10 functionalities including 6,000,000 true clone pairs and 260,000 false clone pairs. Note that all those clone types are given by domain experts. We discard code fragments without any tagged true and false clone pairs, and use the remaining 9,134 code fragments".

I manually confirmed that these 9134 code snippets (I can't get 1 code snippet) are correct. The total entries (clones and non-clones) around 6 million + 260,000 in BigCloneBench is correct.

In our paper, we use the same data as CDLH, first selected 9134 code fragments, then split code fragments for training, dev, and testing dataset , finally construct the clones and non-clones pair. So finally we get 3,30,68,778, 124750, 124750 for training, dev and testing respectively. 33068778=9133×9132/2 124750=500×499/2