wyl7 / DCI-pytorch

The pytorch implementation of DCI (SIGIR2021).
35 stars 9 forks source link

Require for other 3 datasets #2

Closed tanxiaoyan12345 closed 2 years ago

tanxiaoyan12345 commented 2 years ago

Hello, I'm very interested in your work, and I want to do some experients on the datasets. I noticed that there only one dataset in this repository, so i was wondering if you could provide other 3 datasets in the paper?

Thank you very much!

wyl7 commented 2 years ago

Hello xiaoyan, Thanks for your attention! I am not the publisher of these datasets and have provided the access to all the datasets in my paper. If you have any question about the data preprocessing, feel free to email me😁

Thank you!

mpanpan commented 2 years ago

Hello,I'am very interested in your work. I download the dataset from the corresponding paper, and preprocess the data like the wiki and wiki_label. But when I run init_feats.py to generate the node feature, there has a problem: ValueError: row index exceeds matrix dimensions. I am wondering if you've ever had that kind of problem or how do you deal with datasets. Thank you!

mpanpan commented 2 years ago

hello, the dataset I download from the access you have provided, their number of nodes are not equal to the paper provided. So I was wondering what you do with datasets. Thank you !

wyl7 commented 2 years ago

Hello,I'am very interested in your work. I download the dataset from the corresponding paper, and preprocess the data like the wiki and wiki_label. But when I run init_feats.py to generate the node feature, there has a problem: ValueError: row index exceeds matrix dimensions. I am wondering if you've ever had that kind of problem or how do you deal with datasets. Thank you!

Hello, Thanks for your attention to our work. All the codes can run normally. You may need to understand the code before running the code to avoid the bugs you met. Note that, the first column in the dataName.txt correspond to the ids of users, while the second column in the dataName.txt correspond to the ids of items.

wyl7 commented 2 years ago

hello, the dataset I download from the access you have provided, their number of nodes are not equal to the paper provided. So I was wondering what you do with datasets. Thank you !

I have described how we preprocessed the dataset in my paper. Considering the Amazon dataset is large, we sampled a subgraph from it to conduct the experiments. For the Reddit, Wiki, and Alpha, we have double checked, and the number of nodes is equal to that provided by the original paper. Thanks!

wyl7 commented 2 years ago

Some tips for preprocessing your dataset.

dataName.txt stores the edges in the graph. Format is: node_id, node_id. The first column in dataName.txt is user ID. The second column in dataName.txt is item ID. If the dataset contains N users, a user's node id is between 0 and N-1. dataName_label.txt stores the user labels. Format is: node_id, label, where the user label is a binary value which takes value 1 if the user is abnormal and 0 otherwise.

mpanpan commented 2 years ago

Thank you for your prompt reply. I have the following two questions.

  1. I debug the alpha dataset (https://cs.stanford.edu/~srijan/rev2/), find that max(row) and the max(col) is 7604, but the number of the node is 7040, which result in the problem of "ValueError('row index exceeds matrix dimensions')". Because the “self.row.max() >= self.shape[0]:”. I read the code and try many method, can't solve it. I want to consult you how to deal with it?

  2. The amazon dataset's node type is string, when convert it's type to int, the problem "ValueError: could not convert string to float: 'A3SGXH7AUHU8GW'" is occur. I searched the Internet for related problems and read the code, but nothing was solved.

So I wonder if I could discuss with you how you deal with this problem? Or could you send these two data sets? My email is 1945775878@qq.com.

Thanks for your reading!

wyl7 commented 2 years ago

I have emailed you. Thanks!

wyl7 commented 2 years ago

We have double checked the dataset Alpha. The node number is 7040, including 3286 users and 3754 products, which is same as the statistics provided in the original paper https://cs.stanford.edu/~srijan/pubs/rev2-wsdm18.pdf. Actually, we do not know what result in your bug. You may need check your code again. Thank you! 😄

Best regards