yahoojapan / NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
Apache License 2.0
1.26k stars 115 forks source link

How to create a QBG with Capi ? #132

Closed lerouxrgd closed 1 year ago

lerouxrgd commented 1 year ago

Hello @masajiro ,

Currently I use qbg_create at a specific path which return true (and not a pointer to the QBGIndex), so just after I use qbg_open_index with the same path to get a QBGIndex pointer but it throws the following error:

Capi : qbg_open_index() : Error: /home/rgd/dev/projects/ngt-rs/ngt-sys/NGT/lib/NGT/NGTQ/QuantizedBlobGraph.h:Index:317: QBG::Index: No quantized blob graph. /home/rgd/dev/projects/ngt-rs/ngt-sys/NGT/lib/NGT/NGTQ/QuantizedBlobGraph.h:load:988: Not found the rearranged inverted index. [/tmp/.tmp9KnL6r]

Should I proceed differently ?

masajiro commented 1 year ago

The following source code is a sample to use QG. Before running this, please run the commands below.

$ curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean.tsv
$ curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean_query.tsv
$ head -1 sift-128-euclidean_query.tsv > query.tsv
#include        "NGT/Index.h"
#include        "NGT/NGTQ/Capi.h"
int
main(int argc, char **argv)
{
  std::string indexPath  = "index";
  std::string objectFile = "sift-128-euclidean.tsv";
  std::string queryFile  = "query.tsv";

  std::cerr << "run the following commands to prepare data for this sample program." << std::endl;
  std::cerr << "  curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean.tsv" << std::endl;
  std::cerr << "  curl -L -O https://github.com/yahoojapan/NGT/raw/main/tests/datasets/ann-benchmarks/sift-128-euclidean_query.tsv" << std::endl;
  std::cerr << "  head -1 sift-128-euclidean_query.tsv > query.tsv" << std::endl;
  std::cerr << std::endl;
  std::cerr << "index path=" << indexPath << std::endl;
  std::cerr << "object file=" << objectFile << std::endl;
  std::cerr << "query file=" << queryFile << std::endl;
  std::cerr << std::endl;

  NGTError err = ngt_create_error_object();
  NGTProperty prop = ngt_create_property(err);
  if (prop == NULL) {
    std::cerr << ngt_get_error_string(err) << std::endl;
    return 1;
  }
  size_t dimension = 128;
  ngt_set_property_dimension(prop, dimension, err);

  std::cerr << "create an empty index..." << std::endl;
  NGTIndex index = ngt_create_graph_and_tree(indexPath.c_str(), prop, err);
  if (index == NULL) {
    std::cerr << ngt_get_error_string(err) << std::endl;
    return 1;
  }

  std::cerr << "insert objects..." << std::endl;
  try {
    std::ifstream is(objectFile);
    std::string line;
    while (getline(is, line)) {
      std::vector<double> obj;
      std::stringstream linestream(line);
      while (!linestream.eof()) {
        float value;
        linestream >> value;
        if (linestream.fail()) {
          obj.clear();
          break;
        }
        obj.push_back(value);
      }
      if (obj.empty()) {
        std::cerr << "An empty line or invalid value: " << line << std::endl;
        return 1;
      }
      if (ngt_insert_index(index, obj.data(), dimension, err) == 0) {
        std::cerr << ngt_get_error_string(err) << std::endl;
        return 1;
      }
    }
  } catch (NGT::Exception &err) {
    std::cerr << "Error " << err.what() << std::endl;
    return 1;
  } catch (...) {
    std::cerr << "Error" << std::endl;
    return 1;
  }

  std::cerr << "build the index..." << std::endl;
  if (ngt_create_index(index, 100, err) == false) {
    std::cerr << "Error:" << ngt_get_error_string(err) << std::endl;
    return 1;
  }

  std::cerr << "save the index..." << std::endl;
  if (ngt_save_index(index, indexPath.c_str(), err) == false) {
    std::cerr << ngt_get_error_string(err) << std::endl;
    return 1;
  }

  std::cerr << "close the index..." << std::endl;
  ngt_close_index(index);

  NGTQGQuantizationParameters quantizationParameters;
  ngtqg_initialize_quantization_parameters(&quantizationParameters);

  std::cerr << "quantize the index..." << std::endl;
  ngtqg_quantize(indexPath.c_str(), quantizationParameters, err);

  std::cerr << "open the quantized index..." << std::endl;
  index = ngtqg_open_index(indexPath.c_str(), err);
  if (index == NULL) {
    std::cerr << ngt_get_error_string(err) << std::endl;
    return 1;
  }

  std::ifstream is(queryFile);
  if (!is) {
    std::cerr << "Cannot open the specified file. " << queryFile << std::endl;
    return 1;
  }

  std::string line;
  float queryVector[dimension];
  if (getline(is, line)) {
    std::vector<double> queryObject;
    {
      std::vector<std::string> tokens;
      NGT::Common::tokenize(line, tokens, " \t");
      tokens.resize(dimension);
      if (tokens.size() != dimension) {
        std::cerr << "dimension of the query is invalid. dimesion=" << tokens.size() << ":" << dimension << std::endl;
        return 1;
      }
      for (std::vector<std::string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
        queryVector[distance(tokens.begin(), ti)] = NGT::Common::strtod(*ti);
      }
    }
    NGTObjectDistances result = ngt_create_empty_results(err);
    NGTQGQuery query;
    ngtqg_initialize_query(&query);
    query.query = queryVector;
    query.size = 10;
    query.result_expansion = 100;
    query.epsilon = 0.1;
    std::cerr << "search the index for the specified query..." << std::endl;
    ngtqg_search_index(index, query, result, err);
    auto rsize = ngt_get_result_size(result, err);
    std::cout << "Rank\tID\tDistance" << std::endl;
    for (size_t i = 0; i < rsize; i++) {
      NGTObjectDistance object = ngt_get_result(result, i, err);
      std::cout << i + 1 << "\t" << object.id << "\t" << object.distance << std::endl;
    }

    ngt_destroy_results(result);
  }

  std::cerr << "close the quantized index" << std::endl;
  ngtqg_close_index(index);
  ngt_destroy_error_object(err);

  return 0;
}

I hope that this will be helpful.

lerouxrgd commented 1 year ago

Thank you for the example, however this is a QG index example. I am actually wondering how to build a QBG index (using functions qbg_create and qbg_open_index as I mentioned).

masajiro commented 1 year ago

Since I was able to reproduce the issue you mentioned, I have released v2.0.10 to resolve it, and added the usage example of QBG C APIs as well.

lerouxrgd commented 1 year ago

Thank you for the fix and the example, it helps a lot !

I am able to run your qbg-capi example on my machine, however when I try to do a very similar unit test in Rust I cannot insert an object and get the following error:

Error: Error("Capi : qbg_append_object() : Error: /home/rgd/dev/projects/ngt-rs/ngt-sys/NGT/lib/NGT/NGTQ/ObjectFile.h:put:169: ObjectFile::Dimensions are inconsistency. 256:128")

Note that I set construction parameter dimension to 128 and I only insert objects of dimension 128, so it is very strange to find 256 in the error message. If I try with a different dimension there is always a factor 2 in the error message.

The QBG unit test is here, and I have a similar one that works for QG index here.

lerouxrgd commented 1 year ago

Actually this was an issue on my side, I was not using the good value for QbgObject::Float.

I am now running into a deeper issue that leads to a SIGSEGV when I run the unit test (the same I linked above). Using gdb to debug it gives me:

Thread 2 "qbg::index::tes" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff69ff6c0 (LWP 638023)]
NGT::NeighborhoodGraph::BooleanVector::insert (i=1, this=0x7ffff69fd170) at /home/rleroux/dev/workspaces/rust/ngt-rs/ngt-sys/NGT/lib/NGT/Graph.h:815
815             inline void insert(size_t i) { std::vector<bool>::operator[](i) = true; }
(gdb) bt
#0  NGT::NeighborhoodGraph::BooleanVector::insert (i=1, this=0x7ffff69fd170) at /home/rleroux/dev/workspaces/rust/ngt-rs/ngt-sys/NGT/lib/NGT/Graph.h:815
#1  QBG::Index::searchBlobGraph (this=this@entry=0x7fff08001450, searchContainer=..., seeds=...) at /home/rleroux/dev/workspaces/rust/ngt-rs/ngt-sys/NGT/lib/NGT/NGTQ/QuantizedBlobGraph.h:770
#2  0x00007ffff7f1e9ec in QBG::Index::searchBlobGraph (this=this@entry=0x7fff08001450, searchContainer=...) at /home/rleroux/dev/workspaces/rust/ngt-rs/ngt-sys/NGT/lib/NGT/NGTQ/QuantizedBlobGraph.h:722
#3  0x00007ffff7f1f049 in qbg_search_index_ (results=<optimized out>, param=..., query=Python Exception <class 'gdb.error'>: value has been optimized out
<synthetic pointer>, pindex=0x7fff08001450)
    at /home/rleroux/dev/workspaces/rust/ngt-rs/ngt-sys/NGT/lib/NGT/NGTQ/Capi.cpp:389
#4  qbg_search_index (index=0x7fff08001450, query=..., results=<optimized out>, error=0x7ffff006f2a0) at /home/rleroux/dev/workspaces/rust/ngt-rs/ngt-sys/NGT/lib/NGT/NGTQ/Capi.cpp:407
#5  0x000055555558d03f in ngt::qbg::index::QbgIndex::search (self=0x7ffff69fe170, query=...) at src/qbg/index.rs:116
masajiro commented 1 year ago

When searching, you must open a QBG index with read-only. It appears that ngt-rs always opens a QBG index with read_only set to false in this line. If an index is opened with read_only set to false, the SIGSEGV would occur.

lerouxrgd commented 1 year ago

Thank you for narrowing it down ! Everything works fine now, I will be able to update ngt-rs to NGT 2.0 now !