yueliu1999 / Dink-Net

[ICML 2023] An official source code for paper "Dink-Net: Neural Clustering on Large Graphs".
MIT License
273 stars 34 forks source link

Dink-Net: Neural Clustering on Large Graphs

[Yue Liu](https://yueliu1999.github.io/)1,2, [Ke Liang](https://liangke23.github.io/)1, [Jun Xia](https://junxia97.github.io/)2, [Sihang Zhou](https://scholar.google.com/citations?user=p9Se8kYAAAAJ&hl=zh-CN&oi=ao/)1, [Xihong Yang](https://xihongyang1999.github.io/)1, [Xinwang Liu](https://xinwangliu.github.io/)1, [Stan Z. Li](https://scholar.google.com/citations?user=Y-nyLGIAAAAJ&hl=zh-CN&oi=ao)2 1[National University of Defense Technology](https://english.nudt.edu.cn/), 2[Westlake University](https://westlake.edu.cn/)

Deep graph clustering, which aims to group the nodes of a graph into disjoint clusters with deep neural networks, has achieved promising progress in recent years. However, the existing methods fail to scale to the large graph with million nodes. To solve this problem, a scalable deep graph clustering method (Dink-Net) is proposed with the idea of dilation and shrink. Firstly, by discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner. Meanwhile, the cluster centers are initialized as learnable neural parameters. Subsequently, the clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss in an adversarial manner. By these settings, we unify the two-step clustering, i.e., representation learning and clustering optimization, into an end-to-end framework, guiding the network to learn clustering-friendly features. Besides, Dink-Net scales well to large graphs since the designed loss functions adopt the mini-batch data to optimize the clustering distribution even without performance drops. Both experimental results and theoretical analyses demonstrate the superiority of our method.

stars forks  issues  visitors

Table of Contents
  1. Usage
  2. Acknowledgement
  3. Citation

Usage

Datasets

Dataset Type # Nodes # Edges # Feature Dimensions # Classes
Cora Attribute Graph 2,708 5,278 1,433 7
CiteSeer Attribute Graph 3,327 4,614 3,703 6
Amazon-Photo Attribute Graph 7,650 119,081 745 8
ogbn-arxiv Attribute Graph 169,343 1,166,243 128 40
Reddit Attribute Graph 232,965 23,213,838 602 41
ogbn-products Attribute Graph 2,449,029 61,859,140 100 47
ogbn-papers100M Attribute Graph 111,059,956 1,615,685,872 128 172

Requirements

codes are tested on Python3.7

dgl-cu113==0.9.1.post1
munkres==1.1.4
networkx==2.8.3
numpy==1.23.2
scikit_learn==1.3.0
scipy==1.6.0
torch==2.0.1
torch-scatter==2.0.9
torch-sparse==0.6.12
torch-spline-conv==1.2.1
torch-geometric==2.1.0.post1
tqdm==4.65.0
wandb=0.15.4
ogb==1.3.6

Configurations

--device     |  running device
--dataset    |  dataset name
--hid_units  |  hidden units
--activate   |  activation function
--tradeoff   |  tradeoff parameter
--lr         |  learning rate
--epochs     |  training epochs
--eval_inter |  evaluation interval
--wandb      |  wandb logging

Quick Start

clone this repository and change directory to Dink-Net

git clone https://github.com/yueliu1999/Dink-Net.git
cd ./Dink-Net

unzip the datasets and model parameters

unzip -d ./data/ ./data/datasets.zip
unzip -d ./models/ ./models/models.zip

run codes with scripts

bash ./scripts/train_cora.sh

bash ./scripts/train_citeseer.sh

bash ./scripts/train_amazon_photo.sh

bash ./scripts/train_ogbn-arxiv.sh

or directly run codes with commands

python main.py --device cuda:0 --dataset cora --hid_units 512 --lr 1e-2 --epochs 200 --wandb

python main.py --device cuda:0 --dataset citeseer --hid_units 1536 --lr 5e-4 --epochs 200 --wandb

python main.py --device cuda:0 --dataset amazon_photo --hid_units 512 --lr 1e-2 --epochs 100  --eval_inter 1 --wandb

python main.py --device cuda:0 --dataset ogbn_arxiv --hid_units 1500 --encoder_layer 3 --lr 1e-4 --epochs 30 --batch_size 8192 --batch_train --eval_inter 1 --wandb

tips: remove "--wandb" to disable wandb logging if logging error happened.

Results

main_results

Table 1. Clustering performance (%) of our method and fourteen state-of-the-art baselines. The bold and underlined values are the best and the runner-up results. “OOM” indicates that the method raises the out-of-memory failure. “-” denotes that the methods do not converge.

main_results_vis

Figure 1. t-SNE visualization of seven methods on the Cora dataset.

Acknowledgements

Our code are partly based on the following GitHub repository. Thanks for their awesome works.

Pretraining

pretrain Dink-Net on your own dataset. Refer to here.

Citations

If you find this repository helpful, please cite our paper.

@inproceedings{Dink-Net,
  title={Dink-Net: Neural Clustering on Large Graphs},
  author={Liu, Yue and Liang, Ke and Xia, Jun and Zhou, Sihang and Yang, Xihong and Liu, Xinwang and Li, Stan Z.},
  booktitle={International Conference on Machine Learning},
  year={2023},
  organization={PMLR}
}

(back to top)