zhengdongzd / butterfly-core

multi-labeled dataset
3 stars 2 forks source link

butterfly-core

We construct a new dataset of research collaboration network from "DBLP-Citation-network V12" on Aminer. The constructed dataset has 144,334 vertices, 1,821,930 edges, and a total of 7 vertex labels. Each vertex represents an author. For each vertex, we count the author's published papers based on research fields. Finally, we take the research field of his/her most published papers as the vertex label.

The label of a vertex represents the main research field of the author, "Database", Machine Learning", "Systems and Networking", "Theory", "Data Mining", "Natural Language Processing" and "Computer Vision". We take the following conferences for each research field, treat other venues as others and filter the authors not in the 7 research fields.

Database: SIGMOD, ICDE, VLDB, PODS, ICDT, EDBT; Machine Learning: NeurIPS, ICML, COLT, UAI, AISTATS; Systems and Networking: OSDI, SOSP, NSDI, ISCA, ASPLOS, SIGCOMM; Theory: STOC, SODA, FOCS; Data Mining: SIGKDD, CIKM, WSDM, ICDM, SDM, WWW; Natural Language Processing: ACL, EMNLP, NAACL; Computer Vision: CVPR, ECCV, ICCV.

There are three files under the data folder. The "edges.txt" is the graph edge list, "vertex_to_field.txt" represents the vertex label and "vertex_to_name.txt" is corresponding to the author mame.