Closed mgiessing closed 7 months ago
Looks like this UT is not runnable in any host other than x86. We will fix this in the the release. Also, will be grateful if you can fix it and contribute this to Knowhere repo.
/assign @Presburger
Looks like this UT is not runnable in any host other than x86. We will fix this in the the release. Also, will be grateful if you can fix it and contribute this to Knowhere repo.
Well I tested this on ubuntu:20.04 (aarch64) as well as almalinux:8 (aarch64) and the tests run successfully for that architecture. I would like to contribute the PR to make this work on ppc64le architecture, but currently I have no idea why this fails :/ That's why I'm asking if you guys have an idea :)
However I will create a PR to support ppc64le at all, since the build process does work.
@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.
@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.
This is totally understandable and I don't expect you to implement ppc64le specific vector code etc. :)
My initial question was more about if you know why this test might fail because I could not make any sense of it and the only intrinsic code I've seen/found was distances_{SSE/AVX/AVX2/AVX512/NEON}.cc
(NEON doesn't even exist in v2.2.1 I think) & some in thirdparty lib FAISS.
As I said I will create a PR to support ppc64le at all + if I got more time I will implement VSX for accelerated SIMD operations.
Thanks!
@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.
This is totally understandable and I don't expect you to implement ppc64le specific vector code etc. :)
My initial question was more about if you know why this test might fail because I could not make any sense of it and the only intrinsic code I've seen/found was
distances_{SSE/AVX/AVX2/AVX512/NEON}.cc
(NEON doesn't even exist in v2.2.1 I think) & some in thirdparty lib FAISS.As I said I will create a PR to support ppc64le at all + if I got more time I will implement VSX for accelerated SIMD operations.
Thanks!
Thanks for your contribution! I will hold this util:
@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you.
This is totally understandable and I don't expect you to implement ppc64le specific vector code etc. :)
My initial question was more about if you know why this test might fail because I could not make any sense of it and the only intrinsic code I've seen/found was
distances_{SSE/AVX/AVX2/AVX512/NEON}.cc
(NEON doesn't even exist in v2.2.1 I think) & some in thirdparty lib FAISS.As I said I will create a PR to support ppc64le at all + if I got more time I will implement VSX for accelerated SIMD operations.
Thanks!
It is because when we init knowhere, we will dynamicly set simd level by our cpu instruction set, you could refer to hook.cc
. Currently we only support x86 and arm, so we could not find supported simd on your ppc64le machine, thus ut fails. BTW, current ut needs to be fixed by simply adding NEON
@mgiessing Could you please rebase your #162 on top of #163 which got merged? Thanks
@alexanderguzhva I just rebased my PR :-)
@chasingegg Thanks a lot for you hint, after adding a ppc64le section (using _ref
, I'll add intrinsics later) to src/simd/hook.cc
yesterday, knowhere get correctly initialized now :)
+
+#if defined(__powerpc64__)
+ fvec_inner_product = fvec_inner_product_ref;
+ fvec_L2sqr = fvec_L2sqr_ref;
+ fvec_L1 = fvec_L1_ref;
+ fvec_Linf = fvec_Linf_ref;
+
+ fvec_norm_L2sqr = fvec_norm_L2sqr_ref;
+ fvec_L2sqr_ny = fvec_L2sqr_ny_ref;
+ fvec_inner_products_ny = fvec_inner_products_ny_ref;
+ fvec_madd = fvec_madd_ref;
+ fvec_madd_and_argmin = fvec_madd_and_argmin_ref;
+
+ simd_type = "GENERIC";
+ support_pq_fast_scan = false;
+#endif
}
However now it fails during binary search map test, I'll try to spend some time this evening/week to debug that further:
[...]
I1025 08:34:52.465216 195767 factory.cc:20] [KNOWHERE][Create][knowhere_tests] create knowhere index BIN_IVF_FLAT with version 1
{"dim":8,"enable_mmap":true,"k":5,"metric_type":"SUPERSTRUCTURE","nlist":16,"nprobe":8}
terminate called after throwing an instance of 'faiss::FaissException'
what(): Error in virtual void faiss::IndexBinaryIVF::train(faiss::IndexBinary::idx_t, const uint8_t*) at /knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:300: IVF not to support Substructure and Superstructure.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
knowhere_tests is a Catch2 v3.3.1 host application.
Run with -? for options
-------------------------------------------------------------------------------
Search binary mmap
Test Search
-------------------------------------------------------------------------------
/knowhere/tests/ut/test_mmap.cc:372
...............................................................................
/knowhere/tests/ut/test_mmap.cc:373: FAILED:
{Unknown expression after the reported line}
due to a fatal error condition:
name := "BIN_IVF_FLAT"
cfg_json := "{"dim":8,"enable_mmap":true,"k":5,"metric_type":
"SUPERSTRUCTURE","nlist":16,"nprobe":8}"
SIGABRT - Abort (abnormal termination) signal
===============================================================================
test cases: 15 | 14 passed | 1 failed
assertions: 112368250 | 112368249 passed | 1 failed
Aborted (core dumped)
Thanks for all your help so far!
Hi @mgiessing , sorry for being late. I've tried reproducing this issue on QEMU ppc64 and it seems that it can be reproduced! So, basically, there's something wrong with the exception handling O_o (to my BIG Surprise), something non-trivial. I'll take a further look. Meanwhile, please feel free to rebase your PR on top of the master branch, including changes for the hook and I'll accept your change. Thanks.
No problem - I appreciate your effort looking into this :)
I also tried to debug a bit further with gdb, however I wasn't entirely sure if this had something to do with steps involved before throwing the exception (e.g. at 11: IndexNode::Build
) or as you said with the exception handling itself.
Btw. this backtrace was from RHEL8 with IBM advanced toolchain 15 (gcc 11.4.1), but the error is the same as on Ubuntu/gcc
(gdb) bt
#0 0x00007fff7ef94a7c in pthread_kill () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#1 0x00007fff7ef2ecdc in raise () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#2 0x00007fff7ef0c554 in abort () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#3 0x00007fff7f2944a8 in __gnu_cxx::__verbose_terminate_handler() () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#4 0x00007fff7f28fb84 in ?? () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#5 0x00007fff7f28db78 in ?? () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#6 0x00007fff7f28eee8 in __gxx_personality_v0 () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#7 0x00007fff81b4e5e8 in _Unwind_Phase2 (context=0x7fffda4d5190, exception_object=0x30452ee0) at /root/.conan/data/libunwind/1.6.2/_/_/build/b68b207efa3d35074600f068c0f047030ce18960/src/src/unwind/unwind-internal.h:118
#8 _Unwind_Resume (exception_object=0x30452ee0) at /root/.conan/data/libunwind/1.6.2/_/_/build/b68b207efa3d35074600f068c0f047030ce18960/src/src/unwind/Resume.c:37
#9 0x00007fff8156e78c in faiss::IndexBinaryIVF::train (this=0x30344910, n=1000, x=0x30333db0 "%P`\022IN<<\017-\017\n\005.W!<\016GA\002\005aHT^\025") at /root/git/knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:323
#10 0x00007fff81538744 in knowhere::IvfIndexNode<faiss::IndexBinaryIVF>::Train (this=0x3042ec50, dataset=..., cfg=...) at /root/git/knowhere/src/index/ivf/ivf.cc:330
#11 0x00007fff81480600 in knowhere::IndexNode::Build (this=0x3042ec50, dataset=..., cfg=...) at /root/git/knowhere/include/knowhere/index_node.h:41
#12 0x00007fff8146a8a8 in knowhere::Index<knowhere::IndexNode>::Build (this=0x7fffda4d6c20, dataset=..., json=...) at /root/git/knowhere/src/common/index.cc:44
#13 0x000000001009863c in CATCH2_INTERNAL_TEST_26 () at /root/git/knowhere/tests/ut/test_mmap.cc:382
#14 0x00000000101baea4 in Catch::TestInvokerAsFunction::invoke (this=0x30320370) at src/catch2/internal/catch_test_case_registry_impl.cpp:149
#15 0x00000000101a9570 in Catch::TestCaseHandle::invoke (this=0x3033c070) at src/catch2/../catch2/catch_test_case_info.hpp:115
#16 0x00000000101a81e0 in Catch::RunContext::invokeActiveTestCase (this=0x7fffda4d75e0) at src/catch2/internal/catch_run_context.cpp:541
#17 0x00000000101a7e94 in Catch::RunContext::runCurrentTest (this=0x7fffda4d75e0, redirectedCout=..., redirectedCerr=...) at src/catch2/internal/catch_run_context.cpp:504
#18 0x00000000101a5fc4 in Catch::RunContext::runTest (this=0x7fffda4d75e0, testCase=...) at src/catch2/internal/catch_run_context.cpp:235
#19 0x0000000010124ad0 in Catch::(anonymous namespace)::TestGroup::execute (this=0x7fffda4d75d0) at src/catch2/catch_session.cpp:110
#20 0x0000000010126590 in Catch::Session::runInternal (this=0x7fffda4d7940) at src/catch2/catch_session.cpp:332
#21 0x0000000010125ef0 in Catch::Session::run (this=0x7fffda4d7940) at src/catch2/catch_session.cpp:263
#22 0x000000001011e66c in Catch::Session::run<char> (this=0x7fffda4d7940, argc=1, argv=0x7fffda4d7f08) at src/catch2/../catch2/catch_session.hpp:41
#23 0x000000001011e48c in main (argc=1, argv=0x7fffda4d7f08) at src/catch2/internal/catch_main.cpp:36
When using gdb and setting breakpoint to FAISS_THROW_MSG() to go step-by-step on x86 & ppc64le the Intel system handled the exception correctly (going to https://github.com/zilliztech/knowhere/blob/v2.2.2/src/index/ivf/ivf.cc#L333) whereas Power threw the SIGABRT (via libunwind / unwind-internal.h)
Thanks!
@mgiessing , yep, this is exactly what I see. Basically, the following code fails:
struct FaissException : public std::exception { ... };
...
try {
throw FaissException("foo");
}
catch(std::exception& e) {
// never reaches this point of execution
}
catch(...) {
// and not even this one
}
Nevertheless, it should not affect the knowhere correctness. I bet that it is related to using some wrong system libraries somewhere
@mgiessing would you be able to try to compile using the most recent gcc or even clang-17 ? I see certain things on the internet related to libunwind issues
@alexanderguzhva just rebased my PR
Yeah, let me try to use newer gcc/clang
@mgiessing meanwhile, I'm trying to rebuild a newer version of libunwind and use it. They have specific instructions for this https://github.com/libunwind/libunwind#building-for-powerpc64--linux
@alexanderguzhva A few updates:
conan install ...
@mgiessing clang is doable, let me do that in qemu also, compiling libunwind in -O0 mode for gcc 9 did not help
@mgiessing it takes forever in qemu to build it, so meanwhile you could try the following:
~/.conan/settings.yml
and add "17" to a corresponding list with clang@alexanderguzhva I tried to build using clang17 as you indicated but still get errors related to boost:
$ conan install .. --build=missing -o with_ut=True -s compiler.libcxx=libc++ -s build_type=Release
[...]
boost/1.83.0 package(): Packaged 1 '.txt' file: LICENSE_1_0.txt
boost/1.83.0: Package 'bc91fdd79a2ee1b53469e3e1522fa757c4cb5e5e' created
boost/1.83.0: Created package revision b19e2df1c529a5d1693a8096c7897504
boost/1.83.0: WARN: Boost component 'math_c99l' is missing libraries. Try building boost with '-o boost:without_math_c99l'. (Option is not guaranteed to exist)
boost/1.83.0: WARN: Boost component 'math_tr1l' is missing libraries. Try building boost with '-o boost:without_math_tr1l'. (Option is not guaranteed to exist)
boost/1.83.0: WARN: Boost component 'stacktrace_addr2line' is missing libraries. Try building boost with '-o boost:without_stacktrace_addr2line'. (Option is not guaranteed to exist)
boost/1.83.0: WARN: Boost component 'stacktrace_backtrace' is missing libraries. Try building boost with '-o boost:without_stacktrace_backtrace'. (Option is not guaranteed to exist)
ERROR: boost/1.83.0: Error in package_info() method, line 1714
raise ConanException(f"These libraries were expected to be built, but were not built: {non_built}")
ConanException: These libraries were expected to be built, but were not built: {'boost_math_c99l', 'boost_stacktrace_addr2line', 'boost_stacktrace_backtrace', 'boost_math_tr1l'}
I changed the conan profile to clang:
[settings]
os=Linux
os_build=Linux
arch=ppc64le
arch_build=ppc64le
compiler=clang
compiler.version=17
compiler.libcxx=libstdc++
build_type=Release
Also, I added the mentioned code parts vi ~/.conan/data/boost/1.83.0/_/_/export/conanfile.py +290
to both sections:
[...]
version_cxx11_standard_json = self._min_compiler_version_default_cxx11
if version_cxx11_standard_json:
if Version(self.settings.compiler.version) < version_cxx11_standard_json:
self.options.without_fiber = True
self.options.without_json = True
self.options.without_nowide = True
self.options.without_url = True
self.options.without_wave=True
self.options.without_locale=True
self.options.without_math=True
self.options.without_graph=True
else:
self.options.without_fiber = True
self.options.without_json = True
self.options.without_nowide = True
self.options.without_url = True
self.options.without_wave=True
self.options.without_wave=True
self.options.without_locale=True
self.options.without_math=True
self.options.without_graph=True
[...]
Any idea what's going wrong here?
@mgiessing Yes,
if self.settings.compiler.get_safe("cppstd"):
if not valid_min_cppstd(self, 11):
self.options.without_fiber = True
self.options.without_nowide = True
self.options.without_json = True
self.options.without_url = True
else:
version_cxx11_standard_json = self._min_compiler_version_default_cxx11
if version_cxx11_standard_json:
if Version(self.settings.compiler.version) < version_cxx11_standard_json:
self.options.without_fiber = True
self.options.without_json = True
self.options.without_nowide = True
self.options.without_url = True
else:
self.options.without_fiber = True
self.options.without_json = True
self.options.without_nowide = True
self.options.without_url = True
// <---Put the code here :)
Also, it seems that you may need to add something like self.options.without_stacktrace_addr2line=True
and self.options.without_stacktrace_backtrace=True
and self.options.without_stacktrace=True
as well. Alternatively, you may figure out the libraries that boost is missing
That worked, the conan install succeeded - thanks!
I'm not very experienced with clang, but the conan build ..
command doesn't seem to pick up openmp although it should be there:
[...]
-- Found LAPACK: /usr/lib64/libopenblas.so;-lpthread;-lm;-ldl
CMake Error at /root/micromamba/lib/python3.9/site-packages/cmake/data/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
Environment variables:
$ rpm -qa | grep libomp
libomp-devel-16.0.6-3.module_el8.9.0+3621+df7f7146.ppc64le
libomp-16.0.6-3.module_el8.9.0+3621+df7f7146.ppc64le
$ env | grep -i clang
LD_LIBRARY_PATH=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/lib:
CMAKE_C_COMPILER=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang
CC=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang
CMAKE_PREFIX_PATH=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8
CXX=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang++
CPPFLAGS=-I/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/include
CMAKE_CXX_COMPILER=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin/clang++
LDFLAGS=-L/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/lib
PATH=/opt/clang+llvm-17.0.5-powerpc64le-linux-rhel-8.8/bin:/root/micromamba/bin:/root/micromamba/condabin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/go/bin:/root/bin
I tried to google for it, but mostly this issue seem to occur on MacOS but not Linux. Also I wonder whether clang17 binaries vs libomp-16 is questionable
@mgiessing for ubuntu the fix is sudo apt install libomp5-17 libomp-17-dev
. So, you need libomp but 17
I was able to install via rpmfind, however same error :/
$ rpm -qa | grep libomp
libomp-17.0.2-1.module_el8+721+8e6a0389.ppc64le
libomp-devel-17.0.2-1.module_el8+721+8e6a0389.ppc64le
I might give ubuntu a try tomorrow, however I assume there must be a way to make this run on rpm distros :)
@alexanderguzhva You used ubuntu:20.04 or newer?
@mgiessing both ubuntu 22.04 and 20.04
May I ask how you installed clang17 on Ubuntu on Power?
The github releases are just built for RPM-based distros (RHEL):
https://github.com/llvm/llvm-project/releases/tag/llvmorg-17.0.5
--> just powerpc64le RHEL8.8 (besides AIX)
And going the official way using
wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
./llvm.sh 17
doesn't work because there is no deb package for Power :/
See:
// Ubuntu20.04 https://apt.llvm.org/focal/dists/llvm-toolchain-focal-17/main/
// Ubuntu22.04 https://apt.llvm.org/jammy/dists/llvm-toolchain-jammy-17/main/
@mgiessing as I'm running on qemu which is very slow, I've decided to start from clang-15, which is available in a form of package. Otherwise, I would compile clang-17 from the scratch, if needed. Please try clang-15 or clang-14 or earlier versions, let's check if the problem is the GCC compiler itself
I've been able to build knowhere with (system) clang-10 on ubuntu:20.04 but faced the same error:
$ ./Release/tests/ut/knowhere_tests "Search binary mmap"
[...]
terminate called after throwing an instance of 'faiss::FaissException'
what(): Error in virtual void faiss::IndexBinaryIVF::train(faiss::IndexBinary::idx_t, const uint8_t *) at /knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:301: IVF not to support Substructure and Superstructure.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
knowhere_tests is a Catch2 v3.3.1 host application.
Run with -? for options
-------------------------------------------------------------------------------
Search binary mmap
Test Search
-------------------------------------------------------------------------------
/knowhere/tests/ut/test_mmap.cc:372
...............................................................................
/knowhere/tests/ut/test_mmap.cc:376: FAILED:
{Unknown expression after the reported line}
due to a fatal error condition:
name := "BIN_IVF_FLAT"
cfg_json := "{"dim":8,"enable_mmap":true,"k":5,"metric_type":
"SUPERSTRUCTURE","nlist":16,"nprobe":8}"
SIGABRT - Abort (abnormal termination) signal
===============================================================================
test cases: 2 | 1 passed | 1 failed
assertions: 141 | 140 passed | 1 failed
Aborted (core dumped)
$ clang --version
clang version 10.0.0-4ubuntu1
Target: powerpc64le-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ env | grep clang
CXX=/usr/bin/clang++
CC=/usr/bin/clang
@mgiessing Well, then I'll need to check this case more carefully. At least, I think that you may use Milvus/Knowhere because it does not throw exceptions too often internally :)
@alexanderguzhva Yeah, I am able to run milvus (v2.3.1) successfully and had no core dump so far :) I appreciate your effort looking into this and also ask our internal linux toolchain team if they are aware about any libunwind issues on Power.
@mgiessing I bet that it is not only libunwind. I've tried a simple standalone throw-catch program, which replicates what happens inside milvus, and I was unable to replicate the issue so far.
@mgiessing I am from the IBM Power porting team. If your sole purpose is to build Milvus, we have a port available here: https://github.com/ppc64le/build-scripts/pull/3467
I built knowhere (v2.2.2) as a part of milvus v2.3.3 on 22.04 Power, and ran the tests. Got this:
...
...
-------------------------------------------------------------------------------
Knowhere SIMD config
-------------------------------------------------------------------------------
/sumit/milvus/cmake_build/thirdparty/knowhere/knowhere-src/tests/ut/test_knowhere_init.cc:44
...............................................................................
/sumit/milvus/cmake_build/thirdparty/knowhere/knowhere-src/tests/ut/test_knowhere_init.cc:48: FAILED:
REQUIRE( s.find(res) != s.end() )
with expansion:
{?} != {?}
...
...
I20240110 11:12:58.496469 382136 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][knowhere_tests] Build index: done (2452.732220 ms)
I20240110 11:13:00.087837 382136 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][knowhere_tests] test: done (0.000184 ms)
===============================================================================
test cases: 30 | 29 passed | 1 failed
assertions: 115116696 | 115116695 passed | 1 failed
Hey @sumitd2 , thanks for your comment - I appreciate your effort. Milvus itself is running fine, only the test case of knowhere is causing a core dump (SIGABRT) as stated in this thread above.
From your code snippet I cannot see if you were able to recreate that core dump (although it looks like because of the failed test case).
Do you see any of these in your test?
[...]
SIGABRT - Abort (abnormal termination) signal
[...]
Aborted (core dumped)
Thank you!
@mgiessing No, I did not see the core dump
Also, can you please try libunwind/1.7.2 and "libunwind:shared": True in conanfile.py. I remember having seen the libunwind crash. You may also have to add gtest/1.14.0
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Hi, I want to build knowhere (as part of milvus) for ppc64le and with minimal changes I'm able to successfully build it. However, when I then run tests it'll fail with this SIGABRT message:
The only changes to the git are the following to enable ppc64 builds:
I'm aware that there will be no SIMD acceleration and only scalar computation is used. I've also tested the exact same code on ubuntu:20.04-aarch64 and there the tests finish successfully.
Anyone know what could be the issue or how to properly debug this?
Thanks!
Information on system & build
OS:
ubuntu:20.04
arch:ppc64le
gcc:9.4.0
(ubuntu 20.04 build-essential default) knowhere version:v2.2.1