Closed shenker closed 10 months ago
I do get error msg on my side using your "bad_seqs". Let me try to figure it out and get back to you.
@shenker I was able to see a ENOMEM error with 1.4.2, but not with 1.4.3. With 1.4.3, I tried both ways in your scripts (msa_aligner inside or outside the function), they both work.
How did you install pyabpoa? Via pip or from source? Can you try to remove it and re-install? Also, maybe paste your installation msg here is helpful.
Thanks for looking into this, I really appreciate it!
mamba env create -n abpoa_test2
mamba activate abpoa_test2
mamba install -c conda-forge python=3.11.7
pip install --no-cache-dir pyabpoa
prints
Collecting pyabpoa
Using cached pyabpoa-1.4.3.tar.gz (689 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: pyabpoa
Building wheel for pyabpoa (setup.py) ... done
Created wheel for pyabpoa: filename=pyabpoa-1.4.3-cp311-cp311-linux_x86_64.whl
size=187554 sha256=31f8017d08a554261d49a14151cd81ce89fb153076250764668ac78c66ff
8995
Stored in directory: /home/jqs1/.cache/pip/wheels/08/17/6d/d349b4c15fb131cb555
e4b9bdac82e797d71b52c9bcfa44e2e
Successfully built pyabpoa
Installing collected packages: pyabpoa
Successfully installed pyabpoa-1.4.3
Running the above script outputs:
*** Error in `python': double free or corruption (!prev): 0x0000562e67[819/1878]
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7fb697987329]
/home/jqs1/micromamba/envs/abpoa_test2/lib/python3.11/site-packages/pyabpoa.cpy$
hon-311-x86_64-linux-gnu.so(simd_abpoa_realloc+0x7b)[0x7fb69865854b]
/home/jqs1/micromamba/envs/abpoa_test2/lib/python3.11/site-packages/pyabpoa.cpy$
hon-311-x86_64-linux-gnu.so(simd_abpoa_align_sequence_to_subgraph+0x25d)[0x7fb69
865bd0d]
/home/jqs1/micromamba/envs/abpoa_test2/lib/python3.11/site-packages/pyabpoa.cpyt
hon-311-x86_64-linux-gnu.so(simd_abpoa_align_sequence_to_graph+0x18)[0x7fb69866f
158]
/home/jqs1/micromamba/envs/abpoa_test2/lib/python3.11/site-packages/pyabpoa.cpyt
hon-311-x86_64-linux-gnu.so(abpoa_align_sequence_to_graph+0x1f)[0x7fb6986419cf]
/home/jqs1/micromamba/envs/abpoa_test2/lib/python3.11/site-packages/pyabpoa.cpyt
hon-311-x86_64-linux-gnu.so(+0x1d68b)[0x7fb69863a68b]
python(PyObject_Vectorcall+0x2c)[0x562e670a65ac]
python(_PyEval_EvalFrameDefault+0x716)[0x562e67099a36]
python(+0x2a48bd)[0x562e671508bd]
python(PyEval_EvalCode+0x9f)[0x562e6714ff4f]
python(+0x2c2eaa)[0x562e6716eeaa]
python(+0x2bea23)[0x562e6716aa23]
python(+0x2d3de0)[0x562e6717fde0]
python(_PyRun_SimpleFileObject+0x1ae)[0x562e6717f77e]
python(_PyRun_AnyFileObject+0x44)[0x562e6717f4a4]
python(Py_RunMain+0x374)[0x562e67179b94]
python(Py_BytesMain+0x37)[0x562e6713ff47]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb697928555]
python(+0x293ded)[0x562e6713fded]
======= Memory map: ========
...
It's worth mentioning that I'm running this on my institution's HPC cluster, which is running an ancient CentOS install. Specifically, the installed GCC is very old (9.2.0).
If I install GCC 13.2.0 via conda:
# same as above
mamba install gcc zlib
pip install --force-reinstall --no-cache-dir pyabpoa
outputs
Collecting pyabpoa
Downloading pyabpoa-1.4.3.tar.gz (689 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 689.1/689.1 kB 25.4 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Building wheels for collected packages: pyabpoa
Building wheel for pyabpoa (setup.py) ... done
Created wheel for pyabpoa: filename=pyabpoa-1.4.3-cp311-cp311-linux_x86_64.wh$ size=198827 sha256=3ae1b56742029ebd58cee08c334aaf7adf8a587a8ee553e48c5c8302d59$ca3a
Stored in directory: /tmp/pip-ephem-wheel-cache-yewfte0r/wheels/08/17/6d/d349$4c15fb131cb555e4b9bdac82e797d71b52c9bcfa44e2e
Successfully built pyabpoa
Installing collected packages: pyabpoa
Attempting uninstall: pyabpoa
Found existing installation: pyabpoa 1.4.3
Uninstalling pyabpoa-1.4.3:
Successfully uninstalled pyabpoa-1.4.3
Successfully installed pyabpoa-1.4.3
Now when I run python test.py
, I randomly get one of three different outputs (seemingly with roughly equal probability):
(abpoa_test2) [jqs1@compute-a-16-160 scratch]$ python test.py
[simd_abpoa_align_sequence_to_subgraph] Error in cg_backtrack.
(abpoa_test2) [jqs1@compute-a-16-160 scratch]$ python test.py
Killed
(abpoa_test2) [jqs1@compute-a-16-160 scratch]$ python test.py
Segmentation fault
Now I try a third time compiling from git (instead of using pip). Still using GCC 13.2.0.
Full build log:
Test script crashes in the same three ways as the previous attempt.
Let me know if there's any additional information/debugging I can do on my end that would help.
I see. Can you show me the output of cat /proc/cpuinfo | grep -E "sse|avx" | tail -n1
?
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
Thanks!
Looks like this is an old cpu model.
Can you try re-install it with SSE4=1 python setup.py install
?
Same result. Some more clues: running the test script above results in widely varying runtimes (before it crashes):
(abpoa_test2) [jqs1@compute-e-16-230 scratch]$ time python test.py
Segmentation fault
real 0m0.725s
user 0m0.291s
sys 0m0.402s
(abpoa_test2) [jqs1@compute-e-16-230 scratch]$ time python test.py
Killed
real 0m9.223s
user 0m4.650s
sys 0m4.564s
(abpoa_test2) [jqs1@compute-e-16-230 scratch]$ time python test.py
Killed
real 0m10.231s
user 0m5.201s
sys 0m5.004s
(abpoa_test2) [jqs1@compute-e-16-230 scratch]$ time python test.py
Segmentation fault
real 0m0.741s
user 0m0.312s
sys 0m0.404s
When it says “Killed,” I realized that it was being killed by our SLURM cluster's memory-usage-quota-killer. There's something nondeterministic happening. 50% of the time it starts leaking memory (I've tested that memory usage grows steadily until 64GB, I assume it'll keep going indefinitely if I let it), and the other 50% of the time it segfaults within 1s.
Here's a stacktrace:
When I compile with SSE4=1
, it either has an infinite memory leak or segfaults immediately; when I compile with AVX2=1
, it does either of those two or prints [simd_abpoa_align_sequence_to_subgraph] Error in cg_backtrack.
and exits.
Hi @shenker , I did find a critical memory allocation bug related to the local mode, it should be fixed in the latest commit 04a7f7e . Let me know if this version works on your side. Also, maybe just install it in the normal way (without SSE4/AVX2=1).
It works! I know maintaining a software package like this is a lot of work (especially tracking down tricky bugs like this), so I really appreciate your work on this!
Using pyabpoa 1.4.3.
Full test script:
I get a crash:
Something isn't being cleaned up when the
pyabpoa.msa_aligner
object is garbage-collected, because if I share the samealigner
instance acrossaligner.msa
calls, I get no crash:The above test scripts define two lists of sequences
bad_seqs
andgood_seqs
. Aligningbad_seqs
more than once during the lifetime of the Python process, even if these are interspersed with aligninggood_seqs
(or any other list of sequences), is sufficient to trigger a crash.Out of ~100k groups of 10-50 sequences (all ~4kb) I've tried aligning (in the context of the sequencing pipeline I'm working on), I've found ~4 sequence groups that trigger a crash. I've only double-checked that I can get a reproducible crash for one of them (
bad_seqs
), but I can go try to find other examples if that'd help.Another clue, in case it's useful: while I was coming up with this minimal reproducible example occasionally I would get the output
[abpoa_graph_node_id_to_index] Wrong node id: 19464488
before Python crashed. I can't seem to get it to print that any more, not sure why.