zilliztech / knowhere

Knowhere is an open-source vector search engine, integrating FAISS, HNSW, etc.
Apache License 2.0
181 stars 77 forks source link

Test failure for Ubuntu 20.04 ppc64le architecture #161

Closed mgiessing closed 7 months ago

mgiessing commented 1 year ago

Hi, I want to build knowhere (as part of milvus) for ppc64le and with minimal changes I'm able to successfully build it. However, when I then run tests it'll fail with this SIGABRT message:

root@c0f59838709d:/knowhere/build# ./Release/tests/ut/knowhere_tests
[...]
I1021 10:38:49.231748 120158 knowhere_config.cc:93] [KNOWHERE][SetBlasThreshold][knowhere_tests] Set faiss::distance_compu
te_blas_threshold to 16384
I1021 10:38:49.231760 120158 knowhere_config.cc:104] [KNOWHERE][SetEarlyStopThreshold][knowhere_tests] Set faiss::early_st
op_threshold to 0
I1021 10:38:49.231784 120158 knowhere_config.cc:115] [KNOWHERE][SetClusteringType][knowhere_tests] Set faiss::clustering_t
ype to 1
I1021 10:38:49.231792 120158 knowhere_config.cc:115] [KNOWHERE][SetClusteringType][knowhere_tests] Set faiss::clustering_t
ype to 0
I1021 10:38:49.231814 120158 knowhere_config.cc:87] [KNOWHERE][SetSimdType][knowhere_tests] FAISS hook

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
knowhere_tests is a Catch2 v3.3.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Knowhere SIMD config
-------------------------------------------------------------------------------
/knowhere/tests/ut/test_knowhere_init.cc:43
...............................................................................

/knowhere/tests/ut/test_knowhere_init.cc:48: FAILED:
  REQUIRE( s.find(res) != s.end() )
with expansion:
  {?} != {?}

terminate called after throwing an instance of 'Catch::TestFailureException'
/knowhere/tests/ut/test_knowhere_init.cc:48: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:       10 |        9 passed | 1 failed
assertions: 99121199 | 99121197 passed | 2 failed

Aborted (core dumped)

The only changes to the git are the following to enable ppc64 builds:

root@c0f59838709d:/knowhere/build# git diff
diff --git a/cmake/libs/libfaiss.cmake b/cmake/libs/libfaiss.cmake
index a78f268..fd1e568 100644
--- a/cmake/libs/libfaiss.cmake
+++ b/cmake/libs/libfaiss.cmake
@@ -30,7 +30,7 @@ if(__X86_64)
   target_link_libraries(knowhere_utils PUBLIC glog::glog)
 endif()

-if(__AARCH64)
+if(__AARCH64 OR __PPC64)
   set(UTILS_SRC src/simd/hook.cc src/simd/distances_ref.cc)
   add_library(knowhere_utils STATIC ${UTILS_SRC})
   target_link_libraries(knowhere_utils PUBLIC glog::glog)
@@ -85,7 +85,7 @@ if(__X86_64)
   target_compile_definitions(faiss PRIVATE FINTEGER=int)
 endif()

-if(__AARCH64)
+if(__AARCH64 OR __PPC64)
   knowhere_file_glob(GLOB FAISS_AVX_SRCS thirdparty/faiss/faiss/impl/*avx.cpp)

   list(REMOVE_ITEM FAISS_SRCS ${FAISS_AVX_SRCS})
diff --git a/cmake/utils/platform_check.cmake b/cmake/utils/platform_check.cmake
index d713a2d..953b3a3 100644
--- a/cmake/utils/platform_check.cmake
+++ b/cmake/utils/platform_check.cmake
@@ -3,8 +3,9 @@ include(CheckSymbolExists)
 macro(detect_target_arch)
   check_symbol_exists(__aarch64__ "" __AARCH64)
   check_symbol_exists(__x86_64__ "" __X86_64)
+  check_symbol_exists(__powerpc64__ "" __PPC64)

-  if(NOT __AARCH64 AND NOT __X86_64)
+  if(NOT __AARCH64 AND NOT __X86_64 AND NOT __PPC64)
     message(FATAL "knowhere only support amd64 and arm64.")
   endif()
 endmacro()
diff --git a/conanfile.py b/conanfile.py
index 029c372..5084737 100644
--- a/conanfile.py
+++ b/conanfile.py
@@ -81,7 +81,7 @@ class KnowhereConan(ConanFile):
             self.options.rm_safe("fPIC")

     def requirements(self):
-        self.requires("boost/1.78.0")
+        self.requires("boost/1.75.0")
         self.requires("glog/0.4.0")
         self.requires("nlohmann_json/3.11.2")
         self.requires("openssl/1.1.1t")

I'm aware that there will be no SIMD acceleration and only scalar computation is used. I've also tested the exact same code on ubuntu:20.04-aarch64 and there the tests finish successfully.

Anyone know what could be the issue or how to properly debug this?

Thanks!

Information on system & build

OS: ubuntu:20.04 arch: ppc64le gcc: 9.4.0 (ubuntu 20.04 build-essential default) knowhere version: v2.2.1

liliu-z commented 1 year ago

Looks like this UT is not runnable in any host other than x86. We will fix this in the the release. Also, will be grateful if you can fix it and contribute this to Knowhere repo.

liliu-z commented 1 year ago

/assign @Presburger

mgiessing commented 1 year ago

Looks like this UT is not runnable in any host other than x86. We will fix this in the the release. Also, will be grateful if you can fix it and contribute this to Knowhere repo.

Well I tested this on ubuntu:20.04 (aarch64) as well as almalinux:8 (aarch64) and the tests run successfully for that architecture. I would like to contribute the PR to make this work on ppc64le architecture, but currently I have no idea why this fails :/ That's why I'm asking if you guys have an idea :)

However I will create a PR to support ppc64le at all, since the build process does work.

Presburger commented 1 year ago

@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.

mgiessing commented 1 year ago

@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.

This is totally understandable and I don't expect you to implement ppc64le specific vector code etc. :)

My initial question was more about if you know why this test might fail because I could not make any sense of it and the only intrinsic code I've seen/found was distances_{SSE/AVX/AVX2/AVX512/NEON}.cc (NEON doesn't even exist in v2.2.1 I think) & some in thirdparty lib FAISS.

As I said I will create a PR to support ppc64le at all + if I got more time I will implement VSX for accelerated SIMD operations.

Thanks!

liliu-z commented 1 year ago

@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.

This is totally understandable and I don't expect you to implement ppc64le specific vector code etc. :)

My initial question was more about if you know why this test might fail because I could not make any sense of it and the only intrinsic code I've seen/found was distances_{SSE/AVX/AVX2/AVX512/NEON}.cc (NEON doesn't even exist in v2.2.1 I think) & some in thirdparty lib FAISS.

As I said I will create a PR to support ppc64le at all + if I got more time I will implement VSX for accelerated SIMD operations.

Thanks!

Thanks for your contribution! I will hold this util:

  1. The UT get fixed. We are on this, and it will be very soon.
  2. Finish evaluation of this PR.
chasingegg commented 1 year ago

@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.

This is totally understandable and I don't expect you to implement ppc64le specific vector code etc. :)

My initial question was more about if you know why this test might fail because I could not make any sense of it and the only intrinsic code I've seen/found was distances_{SSE/AVX/AVX2/AVX512/NEON}.cc (NEON doesn't even exist in v2.2.1 I think) & some in thirdparty lib FAISS.

As I said I will create a PR to support ppc64le at all + if I got more time I will implement VSX for accelerated SIMD operations.

Thanks!

It is because when we init knowhere, we will dynamicly set simd level by our cpu instruction set, you could refer to hook.cc. Currently we only support x86 and arm, so we could not find supported simd on your ppc64le machine, thus ut fails. BTW, current ut needs to be fixed by simply adding NEON

alexanderguzhva commented 1 year ago

@mgiessing Could you please rebase your #162 on top of #163 which got merged? Thanks

mgiessing commented 1 year ago

@alexanderguzhva I just rebased my PR :-)

@chasingegg Thanks a lot for you hint, after adding a ppc64le section (using _ref, I'll add intrinsics later) to src/simd/hook.cc yesterday, knowhere get correctly initialized now :)

+
+#if defined(__powerpc64__)
+    fvec_inner_product = fvec_inner_product_ref;
+    fvec_L2sqr = fvec_L2sqr_ref;
+    fvec_L1 = fvec_L1_ref;
+    fvec_Linf = fvec_Linf_ref;
+
+    fvec_norm_L2sqr = fvec_norm_L2sqr_ref;
+    fvec_L2sqr_ny = fvec_L2sqr_ny_ref;
+    fvec_inner_products_ny = fvec_inner_products_ny_ref;
+    fvec_madd = fvec_madd_ref;
+    fvec_madd_and_argmin = fvec_madd_and_argmin_ref;
+
+    simd_type = "GENERIC";
+    support_pq_fast_scan = false;
+#endif
 }

However now it fails during binary search map test, I'll try to spend some time this evening/week to debug that further:

[...]
I1025 08:34:52.465216 195767 factory.cc:20] [KNOWHERE][Create][knowhere_tests] create knowhere index BIN_IVF_FLAT with version 1
{"dim":8,"enable_mmap":true,"k":5,"metric_type":"SUPERSTRUCTURE","nlist":16,"nprobe":8}
terminate called after throwing an instance of 'faiss::FaissException'
  what():  Error in virtual void faiss::IndexBinaryIVF::train(faiss::IndexBinary::idx_t, const uint8_t*) at /knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:300: IVF not to support Substructure and Superstructure.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
knowhere_tests is a Catch2 v3.3.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Search binary mmap
  Test Search
-------------------------------------------------------------------------------
/knowhere/tests/ut/test_mmap.cc:372
...............................................................................

/knowhere/tests/ut/test_mmap.cc:373: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  name := "BIN_IVF_FLAT"
  cfg_json := "{"dim":8,"enable_mmap":true,"k":5,"metric_type":
  "SUPERSTRUCTURE","nlist":16,"nprobe":8}"
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:        15 |        14 passed | 1 failed
assertions: 112368250 | 112368249 passed | 1 failed

Aborted (core dumped)

Thanks for all your help so far!

alexanderguzhva commented 11 months ago

Hi @mgiessing , sorry for being late. I've tried reproducing this issue on QEMU ppc64 and it seems that it can be reproduced! So, basically, there's something wrong with the exception handling O_o (to my BIG Surprise), something non-trivial. I'll take a further look. Meanwhile, please feel free to rebase your PR on top of the master branch, including changes for the hook and I'll accept your change. Thanks.

mgiessing commented 11 months ago

No problem - I appreciate your effort looking into this :)

I also tried to debug a bit further with gdb, however I wasn't entirely sure if this had something to do with steps involved before throwing the exception (e.g. at 11: IndexNode::Build) or as you said with the exception handling itself.

Btw. this backtrace was from RHEL8 with IBM advanced toolchain 15 (gcc 11.4.1), but the error is the same as on Ubuntu/gcc

(gdb) bt
#0  0x00007fff7ef94a7c in pthread_kill () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#1  0x00007fff7ef2ecdc in raise () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#2  0x00007fff7ef0c554 in abort () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#3  0x00007fff7f2944a8 in __gnu_cxx::__verbose_terminate_handler() () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#4  0x00007fff7f28fb84 in ?? () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#5  0x00007fff7f28db78 in ?? () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#6  0x00007fff7f28eee8 in __gxx_personality_v0 () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#7  0x00007fff81b4e5e8 in _Unwind_Phase2 (context=0x7fffda4d5190, exception_object=0x30452ee0) at /root/.conan/data/libunwind/1.6.2/_/_/build/b68b207efa3d35074600f068c0f047030ce18960/src/src/unwind/unwind-internal.h:118
#8  _Unwind_Resume (exception_object=0x30452ee0) at /root/.conan/data/libunwind/1.6.2/_/_/build/b68b207efa3d35074600f068c0f047030ce18960/src/src/unwind/Resume.c:37
#9  0x00007fff8156e78c in faiss::IndexBinaryIVF::train (this=0x30344910, n=1000, x=0x30333db0 "%P`\022IN<<\017-\017\n\005.W!<\016GA\002\005aHT^\025") at /root/git/knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:323
#10 0x00007fff81538744 in knowhere::IvfIndexNode<faiss::IndexBinaryIVF>::Train (this=0x3042ec50, dataset=..., cfg=...) at /root/git/knowhere/src/index/ivf/ivf.cc:330
#11 0x00007fff81480600 in knowhere::IndexNode::Build (this=0x3042ec50, dataset=..., cfg=...) at /root/git/knowhere/include/knowhere/index_node.h:41
#12 0x00007fff8146a8a8 in knowhere::Index<knowhere::IndexNode>::Build (this=0x7fffda4d6c20, dataset=..., json=...) at /root/git/knowhere/src/common/index.cc:44
#13 0x000000001009863c in CATCH2_INTERNAL_TEST_26 () at /root/git/knowhere/tests/ut/test_mmap.cc:382
#14 0x00000000101baea4 in Catch::TestInvokerAsFunction::invoke (this=0x30320370) at src/catch2/internal/catch_test_case_registry_impl.cpp:149
#15 0x00000000101a9570 in Catch::TestCaseHandle::invoke (this=0x3033c070) at src/catch2/../catch2/catch_test_case_info.hpp:115
#16 0x00000000101a81e0 in Catch::RunContext::invokeActiveTestCase (this=0x7fffda4d75e0) at src/catch2/internal/catch_run_context.cpp:541
#17 0x00000000101a7e94 in Catch::RunContext::runCurrentTest (this=0x7fffda4d75e0, redirectedCout=..., redirectedCerr=...) at src/catch2/internal/catch_run_context.cpp:504
#18 0x00000000101a5fc4 in Catch::RunContext::runTest (this=0x7fffda4d75e0, testCase=...) at src/catch2/internal/catch_run_context.cpp:235
#19 0x0000000010124ad0 in Catch::(anonymous namespace)::TestGroup::execute (this=0x7fffda4d75d0) at src/catch2/catch_session.cpp:110
#20 0x0000000010126590 in Catch::Session::runInternal (this=0x7fffda4d7940) at src/catch2/catch_session.cpp:332
#21 0x0000000010125ef0 in Catch::Session::run (this=0x7fffda4d7940) at src/catch2/catch_session.cpp:263
#22 0x000000001011e66c in Catch::Session::run<char> (this=0x7fffda4d7940, argc=1, argv=0x7fffda4d7f08) at src/catch2/../catch2/catch_session.hpp:41
#23 0x000000001011e48c in main (argc=1, argv=0x7fffda4d7f08) at src/catch2/internal/catch_main.cpp:36

When using gdb and setting breakpoint to FAISS_THROW_MSG() to go step-by-step on x86 & ppc64le the Intel system handled the exception correctly (going to https://github.com/zilliztech/knowhere/blob/v2.2.2/src/index/ivf/ivf.cc#L333) whereas Power threw the SIGABRT (via libunwind / unwind-internal.h)

Thanks!

alexanderguzhva commented 11 months ago

@mgiessing , yep, this is exactly what I see. Basically, the following code fails:

struct FaissException : public std::exception { ... };

...
try {
    throw FaissException("foo");
}
catch(std::exception& e) {
    // never reaches this point of execution
}
catch(...) {
    // and not even this one
}
alexanderguzhva commented 11 months ago

Nevertheless, it should not affect the knowhere correctness. I bet that it is related to using some wrong system libraries somewhere

alexanderguzhva commented 11 months ago

@mgiessing would you be able to try to compile using the most recent gcc or even clang-17 ? I see certain things on the internet related to libunwind issues

mgiessing commented 11 months ago

@alexanderguzhva just rebased my PR

Yeah, let me try to use newer gcc/clang

alexanderguzhva commented 11 months ago

@mgiessing meanwhile, I'm trying to rebuild a newer version of libunwind and use it. They have specific instructions for this https://github.com/libunwind/libunwind#building-for-powerpc64--linux

mgiessing commented 11 months ago

@alexanderguzhva A few updates:

alexanderguzhva commented 11 months ago

@mgiessing clang is doable, let me do that in qemu also, compiling libunwind in -O0 mode for gcc 9 did not help

alexanderguzhva commented 11 months ago

@mgiessing it takes forever in qemu to build it, so meanwhile you could try the following:

mgiessing commented 11 months ago

@alexanderguzhva I tried to build using clang17 as you indicated but still get errors related to boost:

$ conan install .. --build=missing -o with_ut=True -s compiler.libcxx=libc++ -s build_type=Release

[...]
boost/1.83.0 package(): Packaged 1 '.txt' file: LICENSE_1_0.txt
boost/1.83.0: Package 'bc91fdd79a2ee1b53469e3e1522fa757c4cb5e5e' created
boost/1.83.0: Created package revision b19e2df1c529a5d1693a8096c7897504
boost/1.83.0: WARN: Boost component 'math_c99l' is missing libraries. Try building boost with '-o boost:without_math_c99l'. (Option is not guaranteed to exist)
boost/1.83.0: WARN: Boost component 'math_tr1l' is missing libraries. Try building boost with '-o boost:without_math_tr1l'. (Option is not guaranteed to exist)
boost/1.83.0: WARN: Boost component 'stacktrace_addr2line' is missing libraries. Try building boost with '-o boost:without_stacktrace_addr2line'. (Option is not guaranteed to exist)
boost/1.83.0: WARN: Boost component 'stacktrace_backtrace' is missing libraries. Try building boost with '-o boost:without_stacktrace_backtrace'. (Option is not guaranteed to exist)
ERROR: boost/1.83.0: Error in package_info() method, line 1714
    raise ConanException(f"These libraries were expected to be built, but were not built: {non_built}")
    ConanException: These libraries were expected to be built, but were not built: {'boost_math_c99l', 'boost_stacktrace_addr2line', 'boost_stacktrace_backtrace', 'boost_math_tr1l'}

I changed the conan profile to clang:

[settings]
os=Linux
os_build=Linux
arch=ppc64le
arch_build=ppc64le
compiler=clang
compiler.version=17
compiler.libcxx=libstdc++
build_type=Release

Also, I added the mentioned code parts vi ~/.conan/data/boost/1.83.0/_/_/export/conanfile.py +290 to both sections:

[...]
            version_cxx11_standard_json = self._min_compiler_version_default_cxx11
            if version_cxx11_standard_json:
                if Version(self.settings.compiler.version) < version_cxx11_standard_json:
                    self.options.without_fiber = True
                    self.options.without_json = True
                    self.options.without_nowide = True
                    self.options.without_url = True
                    self.options.without_wave=True
                    self.options.without_locale=True
                    self.options.without_math=True
                    self.options.without_graph=True
            else:
                self.options.without_fiber = True
                self.options.without_json = True
                self.options.without_nowide = True
                self.options.without_url = True
                self.options.without_wave=True
                self.options.without_wave=True
                self.options.without_locale=True
                self.options.without_math=True
                self.options.without_graph=True
[...]

Any idea what's going wrong here?

alexanderguzhva commented 11 months ago

@mgiessing Yes,

        if self.settings.compiler.get_safe("cppstd"):
            if not valid_min_cppstd(self, 11):
                self.options.without_fiber = True
                self.options.without_nowide = True
                self.options.without_json = True
                self.options.without_url = True
        else:
            version_cxx11_standard_json = self._min_compiler_version_default_cxx11
            if version_cxx11_standard_json:
                if Version(self.settings.compiler.version) < version_cxx11_standard_json:
                    self.options.without_fiber = True
                    self.options.without_json = True
                    self.options.without_nowide = True
                    self.options.without_url = True
            else:
                self.options.without_fiber = True
                self.options.without_json = True
                self.options.without_nowide = True
                self.options.without_url = True

        // <---Put the code here :)

Also, it seems that you may need to add something like self.options.without_stacktrace_addr2line=True and self.options.without_stacktrace_backtrace=True and self.options.without_stacktrace=True as well. Alternatively, you may figure out the libraries that boost is missing

mgiessing commented 10 months ago

That worked, the conan install succeeded - thanks!

I'm not very experienced with clang, but the conan build .. command doesn't seem to pick up openmp although it should be there:

[...]
-- Found LAPACK: /usr/lib64/libopenblas.so;-lpthread;-lm;-ldl  
CMake Error at /root/micromamba/lib/python3.9/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)

Environment variables:

$ rpm -qa | grep libomp
libomp-devel-16.0.6-3.module_el8.9.0+3621+df7f7146.ppc64le
libomp-16.0.6-3.module_el8.9.0+3621+df7f7146.ppc64le

$ env | grep -i clang
LD_LIBRARY_PATH=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/lib:
CMAKE_C_COMPILER=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang
CC=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang
CMAKE_PREFIX_PATH=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8
CXX=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang++
CPPFLAGS=-I/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/include
CMAKE_CXX_COMPILER=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang++
LDFLAGS=-L/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/lib
PATH=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin:/root/micromamba/bin:/root/micromamba/condabin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/go/bin:/root/bin

I tried to google for it, but mostly this issue seem to occur on MacOS but not Linux. Also I wonder whether clang17 binaries vs libomp-16 is questionable

alexanderguzhva commented 10 months ago

@mgiessing for ubuntu the fix is sudo apt install libomp5-17 libomp-17-dev. So, you need libomp but 17

mgiessing commented 10 months ago

I was able to install via rpmfind, however same error :/

$ rpm -qa | grep libomp
libomp-17.0.2-1.module_el8+721+8e6a0389.ppc64le
libomp-devel-17.0.2-1.module_el8+721+8e6a0389.ppc64le

I might give ubuntu a try tomorrow, however I assume there must be a way to make this run on rpm distros :)

@alexanderguzhva You used ubuntu:20.04 or newer?

alexanderguzhva commented 10 months ago

@mgiessing both ubuntu 22.04 and 20.04

mgiessing commented 10 months ago

May I ask how you installed clang17 on Ubuntu on Power?

Option a) Github release

The github releases are just built for RPM-based distros (RHEL):

https://github.com/llvm/llvm-project/releases/tag/llvmorg-17.0.5

--> just powerpc64le RHEL8.8 (besides AIX)

Option b) Using llvm-toolchain

And going the official way using

wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
./llvm.sh 17

doesn't work because there is no deb package for Power :/

See:

// Ubuntu20.04 https://apt.llvm.org/focal/dists/llvm-toolchain-focal-17/main/

// Ubuntu22.04 https://apt.llvm.org/jammy/dists/llvm-toolchain-jammy-17/main/

alexanderguzhva commented 10 months ago

@mgiessing as I'm running on qemu which is very slow, I've decided to start from clang-15, which is available in a form of package. Otherwise, I would compile clang-17 from the scratch, if needed. Please try clang-15 or clang-14 or earlier versions, let's check if the problem is the GCC compiler itself

mgiessing commented 10 months ago

I've been able to build knowhere with (system) clang-10 on ubuntu:20.04 but faced the same error:

$ ./Release/tests/ut/knowhere_tests "Search binary mmap"
[...]
terminate called after throwing an instance of 'faiss::FaissException'
  what():  Error in virtual void faiss::IndexBinaryIVF::train(faiss::IndexBinary::idx_t, const uint8_t *) at /knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:301: IVF not to support Substructure and Superstructure.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
knowhere_tests is a Catch2 v3.3.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Search binary mmap
  Test Search
-------------------------------------------------------------------------------
/knowhere/tests/ut/test_mmap.cc:372
...............................................................................

/knowhere/tests/ut/test_mmap.cc:376: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  name := "BIN_IVF_FLAT"
  cfg_json := "{"dim":8,"enable_mmap":true,"k":5,"metric_type":
  "SUPERSTRUCTURE","nlist":16,"nprobe":8}"
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases:   2 |   1 passed | 1 failed
assertions: 141 | 140 passed | 1 failed

Aborted (core dumped)

$ clang --version
clang version 10.0.0-4ubuntu1 
Target: powerpc64le-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ env | grep clang
CXX=/usr/bin/clang++
CC=/usr/bin/clang
alexanderguzhva commented 10 months ago

@mgiessing Well, then I'll need to check this case more carefully. At least, I think that you may use Milvus/Knowhere because it does not throw exceptions too often internally :)

mgiessing commented 10 months ago

@alexanderguzhva Yeah, I am able to run milvus (v2.3.1) successfully and had no core dump so far :) I appreciate your effort looking into this and also ask our internal linux toolchain team if they are aware about any libunwind issues on Power.

alexanderguzhva commented 10 months ago

@mgiessing I bet that it is not only libunwind. I've tried a simple standalone throw-catch program, which replicates what happens inside milvus, and I was unable to replicate the issue so far.

sumitd2 commented 9 months ago

@mgiessing I am from the IBM Power porting team. If your sole purpose is to build Milvus, we have a port available here: https://github.com/ppc64le/build-scripts/pull/3467

I built knowhere (v2.2.2) as a part of milvus v2.3.3 on 22.04 Power, and ran the tests. Got this:

...
...
-------------------------------------------------------------------------------
Knowhere SIMD config
-------------------------------------------------------------------------------
/sumit/milvus/cmake_build/thirdparty/knowhere/knowhere-src/tests/ut/test_knowhere_init.cc:44
...............................................................................
/sumit/milvus/cmake_build/thirdparty/knowhere/knowhere-src/tests/ut/test_knowhere_init.cc:48: FAILED:
  REQUIRE( s.find(res) != s.end() )
with expansion:
  {?} != {?}
...
...
I20240110 11:12:58.496469 382136 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][knowhere_tests] Build index: done (2452.732220 ms)
I20240110 11:13:00.087837 382136 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][knowhere_tests] test: done (0.000184 ms)
===============================================================================
test cases:        30 |        29 passed | 1 failed
assertions: 115116696 | 115116695 passed | 1 failed
mgiessing commented 9 months ago

Hey @sumitd2 , thanks for your comment - I appreciate your effort. Milvus itself is running fine, only the test case of knowhere is causing a core dump (SIGABRT) as stated in this thread above.

From your code snippet I cannot see if you were able to recreate that core dump (although it looks like because of the failed test case).

Do you see any of these in your test?

[...]
  SIGABRT - Abort (abnormal termination) signal
[...]
Aborted (core dumped)

Thank you!

sumitd2 commented 9 months ago

@mgiessing No, I did not see the core dump

sumitd2 commented 9 months ago

Also, can you please try libunwind/1.7.2 and "libunwind:shared": True in conanfile.py. I remember having seen the libunwind crash. You may also have to add gtest/1.14.0

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.