yl-1993 / hfsoftmax

Accelerated Training for Massive Classification via Dynamic Class Selection (AAAI 2018)
MIT License
96 stars 24 forks source link

When the number of classes is nearly 1 million, how to implement the hash forest? #5

Closed stgzr closed 5 years ago

stgzr commented 5 years ago

Hi, when the number of classes is nearly 100k(MS-Celeb-1M), i think annoy could build hash forest in an acceptable time. But as the number of classes becomes larger, considered as a "massive classification", nearly 1 million, the HF building time seems unacceptable(I tried annoy building a 1 million features and set n_trees = 100, feat_dim = 2048, it costs ~22 minutes). So is there a solution, or do you have any advice? Thank you!

yl-1993 commented 5 years ago

Hi @stgzr Here are some ideas may help:

  1. Try hnsw sampler. It is much quicker and accurate. Both of these algorithms provide a trade-off between computation and accuracy, e.g., it can further accelerate by setting post strategy of hnsw to 0.
  2. Hide the building time. As stated in the paper, the cost of building the structure can be hidden in the background. As for implementation, you can launch another thread to build hash forest, which is similar as data processing.
  3. Hand over the work to parameter server. If you have a strong parameter server, it can help build the structure. All clients are only need to send query features and received selective weight vectors. In this case, parameter server may be not the appropriate name since it is also responsible for many other things.
stgzr commented 5 years ago

Thank you for your reply! @yl-1993 It Is very helpful, I will try later. So is there anything different like precision between hnsw and annoy implementation while keeping same setting for HF?

Another question: I trained my neural network using a moderate large batch(e.g. 2k), then I found the forward part of "selecting active class"(get_nns_by_vector() loop) also costed time(~500ms by 16 processes). Because I use the selective softmax to reduce the large cost of FC layer multiplication(e.g. 500k classification), while it seems introduce another huge cost which is not counteracting the original FC multiplication. Do you have any ideas within this condition?

Thank you very much!

yl-1993 commented 5 years ago

@stgzr Thanks for trying.

  1. For the first question, both hnsw and annoy can be viewed as a NN-search (nearest neighbor) algorithm. Better performance in NN-search usually leads to better performance in selective softmax training.
  2. For the second question, the computation cost increases linearly as the batch size increases. It also holds for the normal CNN training. A suggestion is to first profile the time consuming bottleneck, for example, the time for transporting vectors between client and server, the time for updating weight matrix in parameter server. (I have to say 2k can be categorized into large batch for imagenet training.)
  3. Besides, this repo only provides the python code to demonstrate the idea and pipeline in the paper. If you intend to obtain a high performance implementation, it is recommended to rewrite the time consuming part in C/C++.
stgzr commented 5 years ago

@yl-1993 Thank you! So it needs more efforts to implement a high performance selective softmax based on your open source code, I will consider your advice carefully. My last question is about the accuracy, it seems the performance(HF-A) decrease 0.8% compared to full-softmax according to the Table.2 in the paper. I understand this little drop is possible because we just select several active classes. I wonder if there is a way to get closer to the full-softmax accuracy despite that HF or NN-search accuracy while keep low computational cost.

I think the key to selective softmax is how to find the active classes. So do you ever try any other methods like unsupervised clustering? By the way, I think the idea in massive classification task to reduce FC computational cost is very useful but the relative papers seem too few, so do you have any plans about future work?

yl-1993 commented 5 years ago

@stgzr Thanks for delving into the details of our paper.

  1. Basically, it is a trade-off between computational accuracy and the performance. As shown in Ms1M experiments in the paper, in some situation, the performance can get close to or even surpass the full-softmax. It may credit to the selective procedure can be more robust to some noises within classes.
  2. It provides a way of quick experiments on large-scale classification. In case you don't want to lose any performance, you can first focus on other variables and find the optimal settings with selective softmax, then training your final model with other varaibles fixed by full-softmax.
  3. You're right. The key lies in finding the active classes. As shown in experiments, we compare with K-means and other unsupervised clustering approaches.
  4. Currently, we move on to other topics. You can keep an eye on MMLab to check out more interesting works of our group.