zhihu / rucene

Rust port of Lucene
Apache License 2.0
1.01k stars 61 forks source link

Indexing too many document fails in one commit fails. #2

Open fulmicoton opened 4 years ago

fulmicoton commented 4 years ago

Context: I am adding rucene to https://github.com/tantivy-search/search-benchmark-game.

It is a search benchmarking comparing Lucene, Tantivy, Bleve and now Rucene. Indexing works but I have to periodically commit to avoid getting a panic.

See the following two lines of code and comment. https://github.com/tantivy-search/search-benchmark-game/blob/master/engines/rucene-0.1/src/bin/build_index.rs#L103-L104

(I suspect a u32 overflow)

fulmicoton commented 4 years ago

FYI Here is the backtrace.

doc 2420000
doc 2430000
doc 2440000
doc 2450000
doc 2460000
doc 2470000
doc 2480000
doc 2490000
doc 2500000
doc 2510000
doc 2520000
doc 2530000
thread 'main' panicked at 'index out of bounds: the len is 65537 but the index is 562949953355776', /rustc/c8ea4ace9213ae045123fdfeb59d1ac887656d31/src/libcore/slice/mod.rs:2806:10
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:84
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:61
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1025
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:65
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:50
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:193
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:210
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:471
  11: rust_begin_unwind
             at src/libstd/panicking.rs:375
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:84
  13: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:62
  14: rucene::core::codec::postings::terms_hash_per_field::TermsHashPerFieldBase<T>::write_byte
  15: rucene::core::codec::postings::terms_hash_per_field::TermsHashPerField::add
  16: rucene::core::index::writer::doc_consumer::DocConsumer<D,C,MS,MP>::process_document
  17: rucene::core::index::writer::doc_writer::DocumentsWriter<D,C,MS,MP>::update_document
  18: build_index::main
  19: std::rt::lang_start::{{closure}}
  20: main
  21: __libc_start_main
  22: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
sunxiaoguang commented 4 years ago

Can you reproduce the panic with RUST_BACKTRACE=full enabled? There are multiple array accesses in TermsHashPerFieldBase::write_byte. Line number would make it easier to find out the place caused overflow. Thanks

fulmicoton commented 4 years ago

I don't have time for this but you can reproduce on your own by running

ENGINES=rucene-0.1 make index

in the search benchmark project... https://github.com/tantivy-search/search-benchmark-game

sunxiaoguang commented 4 years ago

Sure, let me try it out

jtong11 commented 4 years ago

@fulmicoton, It is a a 2GB limit with using i32. We will fix it soon.