wlav / cppyy

Other
405 stars 42 forks source link

Call segfaulting only on linux #169

Open sophiehourihane opened 1 year ago

sophiehourihane commented 1 year ago

I am not sure the best way to frame this but I am testing a build function here:

import unittest
import bayeswavecpp_bindings.autoload_cppyy
import cppyy.gbl as Cpp # here is where bayeswave functions are loaded
import numpy as np

class runBuilderTests(unittest.TestCase):
    def setUp(self):

        split_command_line = COMMAND_LINE.split()
        # RunBuilder's arguments are (int argc, char** argv),
        # e.g. the length of character arrays and a pointer to character arrays
        # shockingly, using cppyy this is as easy as passing the list of argument strings and its length
        # note, Cpp.std.make_unique is not called on dataBuilder
        dataBuilder = Cpp.LalDataBuilder(len(split_command_line), split_command_line)
        self.runBuilder = Cpp.RunBuilder(len(split_command_line), split_command_line, Cpp.std.move(dataBuilder))
        self.runBuilder.__python_owns__ = False
        self.run = self.runBuilder.build()
        self.run.__python_owns__ = False

    def test_evolve_run(self):
        """
        Running bayeswave from python
        :return:
        """
        print("Testing running full MCMC")
        self.run.evolveStateAllCycles()

The same code in C++ looks like this

int main(int argc, char** argv) {
  Version::printCodeVersion(std::cout);

  if (argc == 2) {
    RunBuilder::printHelpMessage();
    return 0;
  }
  auto dataBuilder = std::make_unique<LalDataBuilder>(argc, argv);
  RunBuilder runBuilder{argc, argv, std::move(dataBuilder)};
  auto run = runBuilder.build();
  run->evolveStateAllCycles();
}

When I call the test case from my laptop (Mac) it runs perfectly, but when I call it from a cluster (using linux) the test segfaults when runBuilder.build() is called.

Inside of runBuilder the call looks like this:

RunBuilder::RunBuilder(int argc, char** argv, std::unique_ptr<Builder<Data>>&& dataBuilder) : commandLineInput_{argc, argv}, dataBuilder_{std::move(dataBuilder)} {
  // data_ and chainCollection_ are default constructed as null pointers
  if (dataBuilder_ == nullptr) {
    throw std::invalid_argument{
        "Attempted to construct a RunBuilder with a null dataBuilder; "
        "to construct a RunBuilder, pass in a non-null rvalue reference to a std::unique_ptr<Builder<Data>> containing the object with which the RunBuilder can build its Run's Data"};
  }

  // TODO: LALInferenceReadData does not make t-domain data when simulating data
}
std::unique_ptr<Run> RunBuilder::build() {
  if (hasAlreadyBuilt_) {
    throw std::logic_error("Called build() on a RunBuilder multiple times; build() may only be called once");
  }

  data_ = dataBuilder_->build();
...

It seems like it calls the dataBuilder->build method. (I added print statements and nothing is getting printed from within `dataBuilder->build()`) so it looks like it runBuilder is unable to call databuilder in the first place. However, the actual traceback has little to do with dataBuilder.

The traceback is this:

(ame) [sophie.hourihane@ldas-pcdev1 python_binding_tests]$ python test_cppyy_RunBuilder.py
.Set trigtime to 1168989748.0000000000
Using 0.400000 seconds of padding for IFO H1
Using 0.400000 seconds of padding for IFO L1
 *** Break *** segmentation violation

Thread 8 (Thread 0x7f1d4a605700 (LWP 867381)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f1d52e06700 (LWP 867380)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f1d5b607700 (LWP 867379)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f1d63e08700 (LWP 867378)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f1d6c609700 (LWP 867377)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f1d74e0a700 (LWP 867376)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f1d7560b700 (LWP 867375)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f1d86081b80 (LWP 867351)):
#0  0x00007f1d84fef612 in waitpid () from /lib64/libc.so.6
#1  0x00007f1d84f51ce7 in do_system () from /lib64/libc.so.6
#2  0x00007f1d8484eb65 in CppyyLegacy::TUnixSystem::StackTrace() () from /home/sophie.hourihane/.conda/envs/ame/lib/python3.10/site-packages/cppyy_backend/lib/libCoreLegacy.so
#3  0x00007f1d7cb87e48 in (anonymous namespace)::TExceptionHandlerImp::HandleException(int) () from /home/sophie.hourihane/.conda/envs/ame/lib/python3.10/site-packages/cppyy_backend/lib/libcppyy_backend.so
#4  0x00007f1d8484d861 in CppyyLegacy::TUnixSystem::DispatchSignals(CppyyLegacy::ESignals) () from /home/sophie.hourihane/.conda/envs/ame/lib/python3.10/site-packages/cppyy_backend/lib/libCoreLegacy.so
#5  <signal handler called>
#6  0x00007f1d7a9877c5 in RunBuilder::build (this=0x557d26a22d30) at /home/sophie.hourihane/.conda/envs/ame/x86_64-conda-linux-gnu/include/c++/11.4.0/ext/unconditional_prior_distribution.ipp:421
#7  0x00007f1d38d82028 in ?? ()
#8  0x0000557d269dc1c0 in ?? ()
#9  0x00007fffb5b082c0 in ?? ()
#10 0x0000000000000000 in ?? ()
 *** Break *** segmentation violation

Thread 8 (Thread 0x7f1d4a605700 (LWP 867381)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f1d52e06700 (LWP 867380)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f1d5b607700 (LWP 867379)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f1d63e08700 (LWP 867378)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f1d6c609700 (LWP 867377)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f1d74e0a700 (LWP 867376)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f1d7560b700 (LWP 867375)):
#0  0x00007f1d85c5b45c in pthread_cond_wait

GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f1d7749406c in blas_thread_server () from /home/sophie.hourihane/.conda/envs/ame/lib/././libcblas.so.3
#2  0x00007f1d85c551ca in start_thread () from /lib64/libpthread.so.0
#3  0x00007f1d84f2fe73 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f1d86081b80 (LWP 867351)):
#0  0x00007f1d84fef612 in waitpid () from /lib64/libc.so.6
#1  0x00007f1d84f51ce7 in do_system () from /lib64/libc.so.6
#2  0x00007f1d8484eb65 in CppyyLegacy::TUnixSystem::StackTrace() () from /home/sophie.hourihane/.conda/envs/ame/lib/python3.10/site-packages/cppyy_backend/lib/libCoreLegacy.so
#3  0x00007f1d7cb87cc5 in (anonymous namespace)::TExceptionHandlerImp::HandleException(int) () from /home/sophie.hourihane/.conda/envs/ame/lib/python3.10/site-packages/cppyy_backend/lib/libcppyy_backend.so
#4  0x00007f1d8484d861 in CppyyLegacy::TUnixSystem::DispatchSignals(CppyyLegacy::ESignals) () from /home/sophie.hourihane/.conda/envs/ame/lib/python3.10/site-packages/cppyy_backend/lib/libCoreLegacy.so
#5  <signal handler called>
#6  0x00007f1d7a9877c5 in RunBuilder::build (this=0x557d26a22d30) at /home/sophie.hourihane/.conda/envs/ame/x86_64-conda-linux-gnu/include/c++/11.4.0/ext/unconditional_prior_distribution.ipp:421
#7  0x00007f1d38d82028 in ?? ()
#8  0x0000557d269dc1c0 in ?? ()
#9  0x00007fffb5b082c0 in ?? ()
#10 0x0000000000000000 in ?? ()

Which is confusing for many reasons: 1) I am explicitly using a single thread, why are there multiple threads in the traceback? (It fails with the same error when threading is turned on) 2) unconditional_prior_distribution.ipp is not called by databuilder (it is called later by runBuilder) 3) This exact call works from c++ on the linux machine (and python and c++ on my mac)

If you have any pointers for what I am doing wrong that would be great. I am hoping I am maybe just treating std::make_unique incorrectly?

Thank you!

wlav commented 1 year ago

The multiple threads are started by BLAS, probably b/c of OpenMP, for which there should be a simple way of either switching that off, or setting the number of threads to 1.

As for the crash:

  auto dataBuilder = std::make_unique<LalDataBuilder>(argc, argv);
  RunBuilder runBuilder{argc, argv, std::move(dataBuilder)};
  auto run = runBuilder.build();

but the Python code is:

   dataBuilder = Cpp.LalDataBuilder(len(split_command_line), split_command_line)
   self.runBuilder = Cpp.RunBuilder(len(split_command_line), split_command_line, Cpp.std.move(dataBuilder))
   self.run = self.runBuilder.build()

which is missing that std.make_unique.

What that means is that the std.move is applied to the LalDataBuilder object in Python, but in C++ it's applied to the std::unique_ptr<LalDataBuilder> object. std::move is a cast that doesn't necessarily lead to a call of the move constructor, only if needed, but does LalDataBuilder have one and if yes, is it clearing state it shouldn't?