yueliu1999/Dink-Net - Githubissues

Dink-Net: Neural Clustering on Large Graphs

[Yue Liu](https://yueliu1999.github.io/)^1,2, [Ke Liang](https://liangke23.github.io/)¹, [Jun Xia](https://junxia97.github.io/)², [Sihang Zhou](https://scholar.google.com/citations?user=p9Se8kYAAAAJ&hl=zh-CN&oi=ao/)¹, [Xihong Yang](https://xihongyang1999.github.io/)¹, [Xinwang Liu](https://xinwangliu.github.io/)¹, [Stan Z. Li](https://scholar.google.com/citations?user=Y-nyLGIAAAAJ&hl=zh-CN&oi=ao)² ¹[National University of Defense Technology](https://english.nudt.edu.cn/), ²[Westlake University](https://westlake.edu.cn/)

Deep graph clustering, which aims to group the nodes of a graph into disjoint clusters with deep neural networks, has achieved promising progress in recent years. However, the existing methods fail to scale to the large graph with million nodes. To solve this problem, a scalable deep graph clustering method (Dink-Net) is proposed with the idea of dilation and shrink. Firstly, by discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner. Meanwhile, the cluster centers are initialized as learnable neural parameters. Subsequently, the clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss in an adversarial manner. By these settings, we unify the two-step clustering, i.e., representation learning and clustering optimization, into an end-to-end framework, guiding the network to learn clustering-friendly features. Besides, Dink-Net scales well to large graphs since the designed loss functions adopt the mini-batch data to optimize the clustering distribution even without performance drops. Both experimental results and theoretical analyses demonstrate the superiority of our method.

Table of Contents

Usage
Acknowledgement
Citation

Usage

Datasets

Dataset	Type	# Nodes	# Edges	# Feature Dimensions	# Classes
Cora	Attribute Graph	2,708	5,278	1,433	7
CiteSeer	Attribute Graph	3,327	4,614	3,703	6
Amazon-Photo	Attribute Graph	7,650	119,081	745	8
ogbn-arxiv	Attribute Graph	169,343	1,166,243	128	40
Reddit	Attribute Graph	232,965	23,213,838	602	41
ogbn-products	Attribute Graph	2,449,029	61,859,140	100	47
ogbn-papers100M	Attribute Graph	111,059,956	1,615,685,872	128	172

Requirements

codes are tested on Python3.7

dgl-cu113==0.9.1.post1
munkres==1.1.4
networkx==2.8.3
numpy==1.23.2
scikit_learn==1.3.0
scipy==1.6.0
torch==2.0.1
torch-scatter==2.0.9
torch-sparse==0.6.12
torch-spline-conv==1.2.1
torch-geometric==2.1.0.post1
tqdm==4.65.0
wandb=0.15.4
ogb==1.3.6

Configurations

--device     |  running device
--dataset    |  dataset name
--hid_units  |  hidden units
--activate   |  activation function
--tradeoff   |  tradeoff parameter
--lr         |  learning rate
--epochs     |  training epochs
--eval_inter |  evaluation interval
--wandb      |  wandb logging

Quick Start

clone this repository and change directory to Dink-Net

git clone https://github.com/yueliu1999/Dink-Net.git
cd ./Dink-Net

unzip the datasets and model parameters

unzip -d ./data/ ./data/datasets.zip
unzip -d ./models/ ./models/models.zip

run codes with scripts

bash ./scripts/train_cora.sh

bash ./scripts/train_citeseer.sh

bash ./scripts/train_amazon_photo.sh

bash ./scripts/train_ogbn-arxiv.sh

or directly run codes with commands

python main.py --device cuda:0 --dataset cora --hid_units 512 --lr 1e-2 --epochs 200 --wandb

python main.py --device cuda:0 --dataset citeseer --hid_units 1536 --lr 5e-4 --epochs 200 --wandb

python main.py --device cuda:0 --dataset amazon_photo --hid_units 512 --lr 1e-2 --epochs 100  --eval_inter 1 --wandb

python main.py --device cuda:0 --dataset ogbn_arxiv --hid_units 1500 --encoder_layer 3 --lr 1e-4 --epochs 30 --batch_size 8192 --batch_train --eval_inter 1 --wandb

tips: remove "--wandb" to disable wandb logging if logging error happened.

Results

Table 1. Clustering performance (%) of our method and fourteen state-of-the-art baselines. The bold and underlined values are the best and the runner-up results. “OOM” indicates that the method raises the out-of-memory failure. “-” denotes that the methods do not converge.

main_results_vis

Figure 1. t-SNE visualization of seven methods on the Cora dataset.

Acknowledgements

Our code are partly based on the following GitHub repository. Thanks for their awesome works.

Awesome Deep Graph Clustering: a collection of deep graph clustering (papers, codes, and datasets).
Graph-Group-Discrimination: the official implement of Graph Group Discrimination (GGD) model.
S3GC: the official implement of Scalable Self-Supervised Graph Clustering (S3GC) model.
HSAN: the official implement of Hard Sample Aware Network (HSAN) model.
SCGC: the official implement of Simple Contrastive Graph Clustering (SCGC) model.
DCRN: the official implement of Dual Correlation Reduction Network (DCRN) model.

Pretraining

pretrain Dink-Net on your own dataset. Refer to here.

Citations

If you find this repository helpful, please cite our paper.

@inproceedings{Dink-Net,
  title={Dink-Net: Neural Clustering on Large Graphs},
  author={Liu, Yue and Liang, Ke and Xia, Jun and Zhou, Sihang and Yang, Xihong and Liu, Xinwang and Li, Stan Z.},
  booktitle={International Conference on Machine Learning},
  year={2023},
  organization={PMLR}
}

(back to top)