issue on zero classes prediction

yl-1993 / learn-to-cluster

Learning to Cluster Faces (CVPR 2019, CVPR 2020)

MIT License

704 stars 143 forks source link

issue on zero classes prediction #45

Closed completelyboofyblitzed closed 4 years ago

completelyboofyblitzed commented 4 years ago

Hello! First of all, thank you for sharing the code of your project! I was trying to implement it on a custom dataset and ran into the following error:

[Time] build super vertices consumes 2.7493 s
[warn] idx2lb is empty! skip write idx2lb to ./data/cluster_proposals/part0_train/hnsw_k_30_th_0.6_step_0.05_minsz_3_maxsz_300_iter_0/pred_labels.txt
[Time] dump clustering to ./data/cluster_proposals/part0_train/hnsw_k_30_th_0.6_step_0.05_minsz_3_maxsz_300_iter_0/pred_labels.txt consumes 0.0001 s
saving cluster proposals to ./data/cluster_proposals/part0_train/hnsw_k_30_th_0.6_step_0.05_minsz_3_maxsz_300_iter_0/proposals
0it [00:00, ?it/s]
k=2, th_knn=0.4, th_step=0.05, minsz=3, maxsz=500, sv_minsz=2, sv_maxsz=8, is_rebuild=False
[Time] read proposal list consumes 4.6750 s
Traceback (most recent call last):
  File "dsgcn/main.py", line 104, in <module>
    main()
  File "dsgcn/main.py", line 100, in main
    handler(model, cfg, logger)
  File "/home/username/learn-to-cluster/dsgcn/train_cluster_det.py", line 18, in train_cluster_det
    train_cluster(model, cfg, logger, batch_processor)
  File "/home/username/learn-to-cluster/dsgcn/train.py", line 16, in train_cluster
    dataset = build_dataset(cfg.train_data)
  File "/home/username/learn-to-cluster/dsgcn/datasets/__init__.py", line 13, in build_dataset
    return ClusterDataset(cfg)
  File "/home/username/learn-to-cluster/dsgcn/datasets/cluster_dataset.py", line 59, in __init__
    self._read(feat_path, label_path, proposal_folders)
  File "/home/username/learn-to-cluster/dsgcn/datasets/cluster_dataset.py", line 94, in _read
    proposal_folders = proposal_folders()
  File "/home/username/learn-to-cluster/proposals/generate_proposals.py", line 69, in generate_proposals
    **param_i1)
  File "/home/username/learn-to-cluster/proposals/generate_iter_proposals.py", line 112, in generate_iter_proposals
    raise FileNotFoundError('{} not found.'.format(sv_labels))
FileNotFoundError: ./data/cluster_proposals/part0_train/hnsw_k_30_th_0.6_step_0.05_minsz_3_maxsz_300_iter_0/pred_labels.txt not found.

Does it mean that no clusters were detected? What do you think should be done in this case?

yl-1993 commented 4 years ago

@kak-to-tak Thanks for checking out our project.

Yes. I think the error means no valid super vertices exist. One possible reason lies in the threshold. If the threshold is too high, each vertex is taken as a cluster and it will be filtered by minsz, which leads to empty idx2lb. Do you mind trying to lower the threshold for super vertices generation?

completelyboofyblitzed commented 4 years ago

@yl-1993 To what extent is it reasonable to lower the threshold? Do I understand correctly that the default threshold is 0.4? Isn't it already quite low? Could you please dwell a bit on the hyperparameters of your algorithm? Excuse me, if it's all in the paper but I can't seem to find a comprehensive information on this. Thank you.

yl-1993 commented 4 years ago

@kak-to-tak Thanks for the question. As applying threshold in kNN graph is widely used, we do not emphasize it much in our paper. The threshold 0.4 is for super-vertex (iter=1), which is usually different from threshold for super-vertex (iter=0).

The value of threshold mainly depends on feature manifold. In practice, it can be determined by constructing a small validation set which contains both positive pairs and negative pairs. We can compute the similarity scores for all pairs, and find a threshold to maximize the accuracy on the set.

Another simple way to set threshold in training set is to start with a middle value, e.g., 0.5. Then it can be adjusted according to some indicators, e.g., number of proposals or the average size of proposals. In this case, you may want to adjust the value to both side a little bit, say 0.3 and 0.5. And you may have a quick feeling of what it leads to.

yl-1993 commented 4 years ago

@kak-to-tak To be more specific, I suspect this line is most related to this issue. Low threshold may lead to very large proposals and high threshold may lead to very small proposals. The former will be filtered out by maxsz while the latter will be filtered out by minsz.

The meaning of each hyper-parameter is as follows:

minsz: the minimal size of a proposal.
maxsz: the maximal size of a proposal.
sv_minsz: the minimal number of super vertices, e.g. sv_minsz=2 means that we should combine at least two super vertices. It tries to enlarge the receptive field of the proposals.
sv_maxsz: the maximal number of super vertices, e.g. sv_mazsz=8 means that we should combine at most eight super vertices. It tries to avoid oversized proposals.
th_knn: the threshold to cut the edges with low similarities. It makes the affinity graph more sparse so as to reduce unnecessary neighborhood propagation and increase the computational efficiency.
step: the step in the dynamic threshold algorithm. When generating basic proposals, the threshold will increase a length of step if a super vertex is too large. step=0.05 works stably under most cases.

Thus, other possible solutions are: (depend on whether proposals are too large or too small) (1) increase the maxsz or decrease the minsz. (2) decrease the sv_maxsz or increase the sv_minsz

completelyboofyblitzed commented 4 years ago

@yl-1993 Thank you so much for a quick reply and the details! I have another question about the number of epochs. After I lowered the threshold to 0.3 I noticed that it managed to work for two iterations but dropped with the same error on the third one. What is the approach for choosing the optimal number of epochs?

yl-1993 commented 4 years ago

@kak-to-tak The high-level idea is similar to generating super pixels in 2D images. If the initial super pixels are large, then the number of iterations is likely to be small. On the contrary, if the initial super pixels are small, then we may take more iterations. Therefore, the number of iterations is related to size of initial super pixels.

Another perspective is from the selection of desired proposals:

If the size is too small, then the proposal is a conservative formation. In this case, the precision is high but the recall is low. For example, the proposal contains a leg of a person.
If the size is too large, then it is probable that none of the classes dominate in the proposal. In this case, the precision is low but the recall is high. For example, the proposal contains two nearby person.

Overall, it is hard to give a criteria for optimal iterations, but our empirical results show that it brings performance gain when the number of iterations is 2 or 3. Adding more iterations increase the recall of the clustering results but may impair the precision.

One possible future direction is to produce proposals via a learnable network, similar to RPN in object detection.

completelyboofyblitzed commented 4 years ago

@yl-1993 Thank you!