support for sparse binary data

fkaiserbio commented 4 years ago

First of all, many thanks for this fantastic piece of software! I'm completely amazed about the speed of this ANN method.

The problem I have to deal with requires support of ultra-high dimensional data and binary metrics.

As far as I understood it, I can provide binary data to NGT as follows:

set -o c parameter
use e.g. -d j for Jaccard

Represent my binary data as 1 byte unsigned int, as follows:

split in 8 bit pieces
represent each 8 bit piece as unsigned 1 byte int
e.g. 01101101 11101101 gets represented as 109 237

Is this correct?

Now here my question:

Is there any chance to add support for high dimensional and sparse data by specifying "on bit positions" instead of 1 integer byte conversions?

So, for example:

data        :  0  1  1  0  1  1  0  1  1  1  1  0  1  1  0  1
--------------------------------------------------------------
bit position: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15

would get represented as 1 2 4 5 7 8 9 10 12 13 15. Of course this would also require some adaptions in your metric functions, but it would allow to encode sparse but ultra-high dimensional data with low memory footprint.

Thank you in advance!

masajiro commented 4 years ago

Thank you for your interests and suggestions.

Your understanding for Jaccard distance is correct.

As for ultra-high dimensional data, I agree that your idea can reduce memory footprint. However, I am not sure if NGT is effective for such sparse and ultra-high dimensional data. If NGT is effective for them, I would like to consider implementing it. Do you have any good examples (datasets) for this case?

fkaiserbio commented 4 years ago

Dear masajiro,

thanks for your swift response and clarifying my understanding of the binary represntation.

Such data is common in chemoinformatics. Individual molecules are represented as undirected graphs. Based on these graphs one can derive so-called chemical fingerprints [1]. The problem is - due to the basically infinite possibilities how chemical structures look like - a common approach is to canonically enumerate subgraphs in these molecular graphs. The presence and absence of individual subgraphs is then represented as "on bits" in ultra-high dimensional and sparse fingerprints. There are methods where the dimensions can go up to billions (i.e. as many dimensions as can be described with 4 byte unsigned integer). Nonetheless, one can easily compute distances between such data points by making use of the representation described above and compatible metrics, such as Jaccard, Dice, etc...

I have attached a dataset (sparse_binary.tar.gz), containing 10k entities, that was generated using the method in reference [1].

If you can have a look into this, it would be completely amazing :+1: I'm sure it would open up new applications of your library for the scientific community.

[1] https://pubs.acs.org/doi/10.1021/ci100050t

masajiro commented 4 years ago

Thank you for your information. It sounds interesting.

masajiro commented 4 years ago

The maximum number of the elements per line in the attached dataset is 75. As for actual datasets, can you know what the maximum number is in advance? Since NGT cannot handle variable length objects, the maximum size must be set for the fixed length object repository during its initialization.

fkaiserbio commented 4 years ago

Dear masajiro,

unfortunately there is no way to know how many "on bits" are maximally set for each data point. So in theory the upper bound is 4 byte unsigned integer, at least this is how RDkit implements it. So, the theoretical number of maximal elements per line in this example is 2^32, which will be never occur in practice. Usually, the data points contain between several dozens to several hundreds elements per line.

Is it possible to set the fixed length to maximal 4 byte unsigned integer?

masajiro commented 4 years ago

It is impossible to set the maximum value for a 4 byte unsigned integer to the fixed length. I think that the theoretical maximum number is not needed. I guess that the databases (indexes) are not updated so often. When you first create a database with an initial dataset, you can set the maximum length of the objects in the dataset to the fixed length. For most of the datasets, the maximum length in each dataset could not be so large (under 10,000?). Is it right? However, if longer objects should be inserted later, the database must be rebuilded after changing the fixed length. If it makes sense, I will implement it. Although the best way is implementing variable length objects, I hesitate to implement it, because it will takes so much time.

fkaiserbio commented 4 years ago

I guess that the databases (indexes) are not updated so often.

Exactly, it will be no problem to identify the maximum length for a defined data set for which an index should be built. However, there is the small risk that unseen query data points may exceed this length. But I think it's fairly safe to assume e.g. 10,000 as the upper limit, or to set the maximum length to the largest element of the data used for database creation.

However, if longer objects should be inserted later, the database must be rebuilded after changing the fixed length.

If this happens at all. The large databases initially created should fairly represent the reality and significantly longer elements are not to be expected. I have sampled 10 million data points and it seems that the length does not exceed 80. However, this depends highly on the overall size of the chemical structures in the data set this can increase.

If it makes sense, I will implement it.

It makes perfect sense :smiley:

masajiro commented 4 years ago

I experimentally implemented handling sparse objects for Jaccard distance and push it as a branch "Jaccard". Since I was not able to implement it in the ngt command because of the tricky implementation, instead I made a sample command jaccard_sparse for it. https://github.com/yahoojapan/NGT/blob/jaccard/samples/jaccard-sparse/jaccard-sparse.cpp You can use this as follows.

git clone -b jaccard https://github.com/yahoojapan/NGT
cd NGT
mkdir build
cd build
cmake ..
make -j
make install
ldconfig /usr/local/lib
curl -L https://github.com/yahoojapan/NGT/files/4598035/sparse_binary.tar.gz|tar zxvf -
# randomly extract objects as queries.
sort -R sparse_binary.tsv | head -10 > sparse_binary_query_10.tsv
# create an empty index.
./samples/jaccard-sparse/jaccard-sparse create -d 100 -D J sparse
# insert objects and build the index.
./samples/jaccard-sparse/jaccard-sparse append sparse sparse_binary.tsv
# search for the index with the queries.
./samples/jaccard-sparse/jaccard-sparse search sparse sparse_binary_query_10.tsv

Could you test this for other big datasets and give me any feedbacks?

fkaiserbio commented 4 years ago

This sounds fantastic! I will have a look asap and will update you. Thank you!

fkaiserbio commented 4 years ago

Dear masajiro,

I have tested your implementation for 1.2 million data points. For this I compiled NGT from jaccard branch with:

cmake -DNGT_SHARED_MEMORY_ALLOCATOR=ON -DNGT_LARGE_DATASET=ON ..

And what should I say, it works like a charm! Thanks for all your efforts! The query time is fantastic and so is the build time (on pretty decent compute server):

Processed 100000 objects. time= 36.2736 (sec)
[...]
Processed 1200000 objects. time= 36.8929 (sec)

real    7m39.673s
user    64m41.334s
sys     1m15.171s

# End of Search
Average Query Time= 0.00521241 (sec), 5.21241 (msec), (0.0521241/10)

As a next step I will evaluate recall performance. After this successful first test I was hoping to make use of NGT's ability to process data larger than memory via memory mapped files or the ngtq command and be able to scale a bit up. However, if I go with, let's say, a hundred million data points, memory allocation fails:

terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc

I wonder whether there is any way to bring the sparse data support to ngtq, where you say it is suitable for "billions of objects" or to make proper use of memory mapped files?

masajiro commented 4 years ago

Did you have the memory error with the jaccard-sparse command built with the following you mentioned?

cmake -DNGT_SHARED_MEMORY_ALLOCATOR=ON -DNGT_LARGE_DATASET=ON ..

If so, the error is a little strange. In most cases, NGT with the option above is limited by amount of not the memory but storage. Could you execute the following commands and send results?

free
ngt info (index-path)
ls -l (index-path)

Unfortunately, I think that it is impossible to embed sparse objects in ngtq in terms of its logical method. NGT with the shared memory option (memory mapped files) is available for objects over the memory size. However, when you handle over 50M objects, you have to change the line below. At this moment, there is no way to change the values with the ngt command.

https://github.com/yahoojapan/NGT/blob/master/lib/NGT/Index.h#L77-L79

About 512 is needed for every 50M objects.

Are there any public big chemical finger print datasets like you are using?

fkaiserbio commented 4 years ago

Did you have the memory error with the jaccard-sparse command built with the following you mentioned?

Yes, I've checked this again. I've definitely compiled with these flags using this script and the repository cloned from the jaccard branch:
mkdir build
cd build
cmake -DNGT_SHARED_MEMORY_ALLOCATOR=ON -DNGT_LARGE_DATASET=ON ..
make -j
If so, the error is a little strange. In most cases, NGT with the option above is limited by amount of not the memory but storage. Could you execute the following commands and send results?
$ free
total        used        free      shared  buff/cache   available
Mem:      526983672   472691780    34812580     1019200    19479312    49939664
Swap:     536870908    46815192   490055716
The output of ngt info on the index that aborted during creation is as follows:
$ ngt info index/
NGT version: 1.11.6
The size of the object repository (not the number of the objects):      999999
The number of the removed objects:      0/999999
The number of the nodes:        0
The number of the edges:        0
The mean of the edge lengths:   -nan
The mean of the number of the edges per node:   -nan
The number of the nodes without edges:  0
The maximum of the outdegrees:  0
The minimum of the outdegrees:  -NA-
The number of the nodes where indegree is 0:    0
The maximum of the indegrees:   0
The minimum of the indegrees:   -NA-
#-nodes,#-edges,#-no-indegree,avg-edges,avg-dist,max-out,min-out,v-out,max-in,min-in,v-in,med-out,med-in,mode-out,mode-in,c95,c5,o-distance(10),o-skip,i-distance(10),i-skip:0:0:0:-nan:-nan:0:18446744073709551615:-nan:0:9223372036854775807:-nan:-1:-1:0:0:-nan:-nan:-nan:-nan:nan:0:nan:0
This looks odd, especially if compared to a proper built index. But I have to admit that the number of objects I tried to insert exceeds 50M. So maybe I should try again with the changes in https://github.com/yahoojapan/NGT/blob/master/lib/NGT/Index.h#L77-L79 you mentioned. When I start building the index it slowly allocates the entire RAM, but not the swap, until it fails with the mentioned error. No console output is printed until then. If I understood you correctly this should not be the case if built with the DNGT_SHARED_MEMORY_ALLOCATOR=ON flag.

The output of ls to the index directory:

$ ls -l index/
-rw-r--r-- 1 [...] 513M May 19 11:36 grp
-rw-r--r-- 1 [...]  64K May 19 11:36 grpc
-rw-r--r-- 1 [...] 4.1G May 19 11:37 objpo
-rw-r--r-- 1 [...]  64K May 19 11:37 objpoc
-rw-r--r-- 1 [...]  629 May 19 11:36 prf
-rw-r--r-- 1 [...] 513M May 19 11:36 trei
-rw-r--r-- 1 [...]  64K May 19 11:36 treic
-rw-r--r-- 1 [...] 513M May 19 11:36 trel
-rw-r--r-- 1 [...]  64K May 19 11:36 trelc

Are there any public big chemical finger print datasets like you are using?

There are huge databases of chemical molecules, e.g. the ZINC database (https://zinc.docking.org/). But I don't know about any public resource that serves pre-calculated fingerprints. But for sake of testing I will share a 121M sample of my data set with you:

https://drive.google.com/open?id=1n9DhlZ92QtNiKI6Xe5TflmrOh5PQ4Vl3

Maximal object size is here is <150. Hope this helps! Many thanks. Again, I really appreciate your efforts to solve this challenge.

masajiro commented 4 years ago

Thank you for your information. However, I cannot reproduce the problem yet. I made a Dockerfile that can build an index and search for it in the same conditions as yours. Could you run this as follows?

tar zxvf jaccard.tar.gz
head -2000000 [path]/121M_sparse_binary.tsv > jaccard/sparse_binary.tsv
cd jaccard
./run.sh # or sudo ./run.sh

If it doesn't work, the cause of the problem is probably the host computer to run it. If it works, the cause is probably the installation of NGT. If possible, I would like you to make a Dockerfile to occur the problem.

fkaiserbio commented 4 years ago

Dear masajiro,

I've created a Docker directive, based on yours, where the bad alloc error occurs for a dataset of 2M entries.

However, I had to use Ubuntu 18.04 LTS as base system, because 18.10 is EOL and did not built for me. I've even increased shrared memory size (https://github.com/yahoojapan/NGT/blob/master/lib/NGT/Index.h#L77-L79) from 512 to 1024 but it didn't help.

To reproduce (tested on two independent systems):

$ tar -zxvf docker_bad_alloc.tar.gz
$ cd docker/
$ ./run.sh

Please download the tarball here: https://drive.google.com/open?id=1leDer_6zDRK949FetZnHyv8I8UZ0dRMz

masajiro commented 4 years ago

Thank you for your helpful cooperation! Finally I have found the cause of this problem. The data file you gave has an invalid value in the 1,000,000th line.

... 3818546315 3994088662 4121755354 4256676941202696463 368531313 423168317 434630546 435232729 ...

Moreover, since my sample program doesn't adequately check invalid values, it enters an infinite loop spending memory.

masajiro commented 4 years ago

I have updated the Jaccard branch to check invalid values.

fkaiserbio commented 4 years ago

Fantastic, I will test it asap! Thanks for digging out this nasty little error. It's a flaw in my routines how I generate the data.

fkaiserbio commented 4 years ago

Dear masajiro,

I've tested everything and index creation as well as querying now just runs perfectly! :smiley: Occasionally, I'm getting the following warning:

LeafNode::splitObjects: Too many same distances. Reduce internal children size for the tree index or not use the tree index.
  internalChildrenSize=5
  # of the children=4
  Size=101
  pivot=13
  cluster id=1
  Show distances for debug.
  13250903:0
  2.49317e-23 4.82114e-20 4.26136e-19 5.65415e-18 3.65176e-14 9.08799e-13 3.68678e-09 6.4208e-08 6.6114e-08 2.83759e-07 2.14516e-05 0.00303098 21.7223 21798.1 2.09354e+10 9.41388e+15 1.12815e+35 1.99538e+36 7.6787
e+36 7.04757e+37 -1.04987e-38 -7.57934e-38 -2.19976e-35 -2.20391e-35 -2.37837e-28 -1.63086e-25 -4.45485e-25 -5.28067e-22 -2.21122e-15 -3.73078e-15 -1.79613e-09 -3.47223e-09 -0.0162436 -0.822823 -1.54167 -1.69822 -
31.4405 -1.64751e+06 -3.74825e+09 -5.561e+12 -2.3394e+27 -5.02691e+29 -3.64353e+31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Index creation just continues but are there any implications on the performance? I'm trying to run the implementation now on a larger data set (100M) and insertion time for a batch of 100k data points went up from 30s in the beginning to several thousands of seconds at a later point:

Processed 100000 objects. time= 27.5122 (sec)
[...]
Processed 74100000 objects. time= 4613.4 (sec)

Or is it just that this implementation may not be suitable for such large datasets? I guess it is also expected behavior that with growing index size insertion time will increase.

masajiro commented 4 years ago

I have checked your problems using your data (121M).

If you use data like one you provided, the first message LeafNode::splitObjects: Too many same distances. ... is no problem because of low frequency.

After the size of the mapped files exceeds the amount of the physical memory, it takes more time to insert an object, as the size of the files is increasing. When I have built indexes using your data (121M) with the dimensionality=500 on an SSD file system and 256GB memory, the time of the end was 4 to 6 times longer than that of the beginning.

Processed 100000 objects. time= 29.3392 (sec)
...
Processed 121200000 objects. time= 187.683 (sec)

However, the time of your case is too long compared to my case. If you use an HDD file system, the time might become longer like your case.

In addition, I think that you specified 1000 to the dimensionality. If you reduce the dimensionality, the insertion time will be reduced.

fkaiserbio commented 4 years ago

Thank you for checking that! I think this clears now all my questions. There is nothing more left to say then a big thank you for taking this effort :+1: I really appreciate the outcome and will make use of the new features. Please feel free to close the issue.

masajiro commented 4 years ago

I have a few questions to merge this branch into the master branch.

What do you think caused your slow insertion over 70,000,000?
Is the search accuracy high enough for your dataset?
Is this implementation practical enough for chemical fingerprints?

fkaiserbio commented 4 years ago

I'm happy to answer your questions:

The insertion slowdown was most likely caused by a busy ZFS file system on SATA drives. Regarding accurateness, I still have to finally test it, which I could not find time to do yet. But from a quick investigation of the results it looks very promising and method retrieves neighbors which are chemically similar to the query. So my feeling is: yes, the method is very suitable for these kind of data. One of the best feature is the disk caching, which I find very helpful for large chemical libraries. If you would like to wait for a proper benchmarking before merging, I will try to do it as soon as I find time. But it requires some implementation effort on my side.

masajiro commented 4 years ago

Thank you for your helpful answers to merge. I will merge it to the master branch after checking code. As for accuracy, there no need to hurry. However, when you have spare time, I would like you to confirm accuracies and distances to check my implementation of jaccard distance for sparse data.

yahoojapan / NGT

support for sparse binary data #66