Closed olebole closed 2 years ago
(sorry for flowing you with exotic bugs)
It's fine, or anyway it's much better than keeping them to yourself :)
@olebole do you have access to a machine with any of the architecture where the problem arises? If so, it would be useful to be able run the test through gdb to get where in particle_oct_container.pyx
the segfault happens.
This may be an endianness problem, as we mostly test stuff on little-endian architectures.
@cphyc that's a good point -- and we might be particularly susceptible due to the bit twiddling we do in the coarse index initialization.
It is not endianess -- all failing architectures are little endian (as x86), and the only big endian arch we have (s390x) builds fine (all tests passing). I would bet it is an alignment problem. I have access to a mips64el machine, so I will try to get a proper stack trace from there. It may take a few days, however.
@olebole Not to sound too amateur, but do you think we could conceivably reproduce the error with an emulator or something, even if it's very slow?
I am not sure whether an emulator is as picky as the hardware when it comes to alignment errors. You could try it, and if you are lucky, then it reproduce the problem.
I could create a stack trace on mipsel64:
Program received signal SIGBUS, Bus error.
0x719e27a8 in __pyx_fuse_0__pyx_f_2yt_8geometry_22particle_oct_container_14ParticleBitmap___coarse_index_data_file (__pyx_v_self=0x70dcf8e8,
__pyx_v_pos=0x70be06b0, __pyx_v_hsml=0x91e05c <_Py_NoneStruct>, __pyx_v_file_id=0) at yt/geometry/particle_oct_container.cpp:12618
12618 *((__pyx_t_5numpy_uint8_t *) ( /* dim=0 */ (__pyx_v_mask.data + __pyx_t_13 * __pyx_v_mask.strides[0]) )) = 1;
(gdb) bt
#0 0x719e27a8 in __pyx_fuse_0__pyx_f_2yt_8geometry_22particle_oct_container_14ParticleBitmap___coarse_index_data_file (
__pyx_v_self=0x70dcf8e8, __pyx_v_pos=0x70be06b0, __pyx_v_hsml=0x91e05c <_Py_NoneStruct>, __pyx_v_file_id=0)
at yt/geometry/particle_oct_container.cpp:12618
#1 0x719e0d04 in __pyx_pf_2yt_8geometry_22particle_oct_container_14ParticleBitmap_80_coarse_index_data_file (__pyx_v_self=0x70dcf8e8,
__pyx_v_pos=0x70be06b0, __pyx_v_hsml=0x91e05c <_Py_NoneStruct>, __pyx_v_file_id=0) at yt/geometry/particle_oct_container.cpp:12078
#2 0x719e09dc in __pyx_fuse_0__pyx_pw_2yt_8geometry_22particle_oct_container_14ParticleBitmap_81_coarse_index_data_file (
__pyx_v_self=0x70dcf8e8, __pyx_args=0x7076a3e8, __pyx_kwds=0x0) at yt/geometry/particle_oct_container.cpp:12028
#3 0x73010f54 in __Pyx_CyFunction_CallMethod (func=0x71de4c68, self=0x70dcf8e8, arg=0x7076a3e8, kw=0x0)
at yt/geometry/selection_routines.c:83554
#4 0x7301134c in __Pyx_CyFunction_CallAsMethod (func=0x71de4c68, args=0x709b53c0, kw=0x0) at yt/geometry/selection_routines.c:83617
#5 0x73012308 in __pyx_FusedFunction_callfunction (func=0x71de4c68, args=0x709b53c0, kw=0x0) at yt/geometry/selection_routines.c:83898
#6 0x73012820 in __pyx_FusedFunction_call (func=0x71de4c68, args=0x709b53c0, kw=0x0) at yt/geometry/selection_routines.c:83986
#7 0x00432220 in _PyObject_MakeTpCall ()
#8 0x00421b34 in _PyEval_EvalFrameDefault ()
#9 0x004173d4 in _ftext ()
#10 0x00465aec in PyDict_GetItemWithError ()
Backtrace stopped: frame did not save the PC
These lines look like this:
/* "yt/geometry/particle_oct_container.pyx":597
* mi = bounded_morton_split_dds(ppos[0], ppos[1], ppos[2], LE,
* dds, mi_split)
* mask[mi] = 1 # <<<<<<<<<<<<<<
* particle_counts[mi] += 1
* # Expand mask by softening
*/
__pyx_t_13 = __pyx_v_mi;
*((__pyx_t_5numpy_uint8_t *) ( /* dim=0 */ (__pyx_v_mask.data + __pyx_t_13 * __pyx_v_mask.strides[0]) )) = 1;
Note that the particle_oct_container.cpp
file was generated from the pyx file here It points to line 597 in particle_oct_container.pyx
: https://github.com/yt-project/yt/blob/d0e12572233cf4a728dde642e9c67f6280f61740/yt/geometry/particle_oct_container.pyx#L595-L597.
One observation was that this seems not 100% reproducible; I needed actually two attempts to get it (in the first one, the test passes).
To complete this, here is an info locals
dump:
__pyx_v_i = 2
__pyx_v_p = 48
__pyx_v_mi = 2635249153387078802
__pyx_v_miex = 7
__pyx_v_mi_split = {0, 9223372034707292159, 0}
__pyx_v_ppos = {1.7913391018714546e-38, nan(0x7ffffffffffff), 2.2360368128959752e-21}
__pyx_v_s_ppos = {1.0609624037829015e-314, 3.5740411942693072e+265, 2.9908604455725968e-307}
__pyx_v_clip_pos_l = {0.40239225998491079, 0.40239225998491079, 0.40239214951068192}
__pyx_v_clip_pos_r = {2.7175623358972077e+234, 2.71756163009643e+234, 2.3547636263540591e+234}
__pyx_v_skip = 0
__pyx_v_bounds = {{9223068829353259392, 23001201026473008, 9712000}, {23586070613733760, 42027869908775296, 20100601573710868}}
__pyx_v_xex = 9712000
__pyx_v_yex = 1
__pyx_v_zex = 1
__pyx_v_LE = {0, 0, 0}
__pyx_v_RE = {1, 1, 1}
__pyx_v_DW = {1, 1, 1}
__pyx_v_PER = "\001\001\001"
__pyx_v_dds = {0.5, 0.5, 0.5}
__pyx_v_radius = 0.59760785102844238
__pyx_v_mask = {memview = 0x70e7e298, data = 0x1d40908 "\001\001\001\001\001\001\001\001", shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {1,
0, 0, 0, 0, 0, 0, 0}, suboffsets = {-1, 0, 0, 0, 0, 0, 0, 0}}
__pyx_v_particle_counts = {memview = 0x70e7eb20, data = 0x1c43638 "\020", shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {8, 0, 0, 0, 0, 0, 0,
0}, suboffsets = {-1, 0, 0, 0, 0, 0, 0, 0}}
__pyx_v_msize = 8
__pyx_v_axiter = {{0, 999}, {0, 999}, {0, 999}}
__pyx_v_axiterv = {{0, 8.096809699390939e+233}, {0, 2.7175616335247829e+234}, {0, 4.1519961721059194e+235}}
__pyx_v_xi = 1
__pyx_v_yi = 1
__pyx_v_zi = 1
__pyx_pybuffernd_hsml = {rcbuffer = 0x7ffee6c8, data = 0x0, diminfo = {{shape = 0, strides = 0, suboffsets = 2013024304}, {
shape = 1073741824, strides = 4, suboffsets = 5213996}, {shape = 9743480, strides = 2147483647, suboffsets = 2008105888}, {
shape = 4969628, strides = 9712000, suboffsets = 0}, {shape = 9743408, strides = 9781416, suboffsets = 9712000}, {shape = 2013189992,
strides = 2001061624, suboffsets = 1}, {shape = 9712000, strides = 2008105912, suboffsets = 9712000}, {shape = 5100,
strides = 9712000, suboffsets = 0}}}
__pyx_pybuffer_hsml = {refcount = 0, pybuffer = {buf = 0x0, obj = 0x0, len = 0, itemsize = 2013227512, readonly = 1891301036, ndim = 4,
format = 0x5 <error: Cannot access memory at address 0x5>, shape = 0x71ad4434 <__Pyx_zeros>, strides = 0x71ad4434 <__Pyx_zeros>,
suboffsets = 0x71ad0000 <__Pyx_minusones>, internal = 0x0}}
__pyx_pybuffernd_pos = {rcbuffer = 0x7ffee6f8, data = 0x0, diminfo = {{shape = 100, strides = 12, suboffsets = 1891501744}, {shape = 3,
strides = 4, suboffsets = 1902433016}, {shape = 2147412648, strides = 1887721480, suboffsets = 9712000}, {shape = 0,
strides = 1888950784, suboffsets = 4514316}, {shape = 9712000, strides = 4513340, suboffsets = 9712000}, {shape = 1891501744,
strides = 9712000, suboffsets = 0}, {shape = 33513416, strides = 1887721480, suboffsets = 1907209456}, {shape = 2013227512,
strides = 2147412504, suboffsets = 1906789492}}}
__pyx_pybuffer_pos = {refcount = 0, pybuffer = {buf = 0x1e08220, obj = 0x70be06b0, len = 1200, itemsize = 4, readonly = 0, ndim = 2,
format = 0x1ff5f40 "f", shape = 0x1ff5fe0, strides = 0x1ff5fe8, suboffsets = 0x71ad0000 <__Pyx_minusones>, internal = 0x0}}
__pyx_t_1 = 0x0
__pyx_t_2 = 0x0
__pyx_t_3 = {memview = 0x0, data = 0x0, shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {1, 0, 0, 0, 0, 0, 0, 0}, suboffsets = {-1, 0, 0, 0, 0,
0, 0, 0}}
__pyx_t_4 = {memview = 0x0, data = 0x0, shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {8, 0, 0, 0, 0, 0, 0, 0}, suboffsets = {-1, 0, 0, 0, 0,
0, 0, 0}}
__pyx_t_5 = 48
__pyx_t_6 = 100
__pyx_t_7 = 100
__pyx_t_8 = 3
__pyx_t_9 = 0
__pyx_t_10 = 48
__pyx_t_11 = 2
__pyx_t_12 = 0
__pyx_t_13 = 2635249153387078802
__pyx_t_14 = 0
__pyx_t_15 = 0
__pyx_t_16 = 0x0
__pyx_t_17 = 2
__pyx_t_18 = 2
__pyx_t_19 = 2
__pyx_t_20 = 2
__pyx_t_21 = 2
__pyx_t_22 = 2
__pyx_t_23 = 2
__pyx_t_24 = 2
__pyx_t_25 = 2
__pyx_t_26 = 2
__pyx_t_27 = 2
__pyx_t_28 = 7
__pyx_lineno = 0
__pyx_filename = 0x0
__pyx_clineno = 0
EDIT I updated this and the previous post with running a build without optimization (-O0
) to show all variables and to make sure it is not an optimization problem.
What may be the problem is this: __pyx_v_mi = 2635249153387078802
, which may be (???) caused by an NaN in the ppos
array, __pyx_v_ppos = {1.7913391018714546e-38, nan(0x7ffffffffffff), 2.2360368128959752e-21}
.
Thanks so much for the detailed log. I think I may have tracked the issue down to the fact ppos
should be normal. I realize this may be too much to ask, but if you happen to have a chance at trying the following patch, that would be very handy:
--- a/yt/geometry/particle_oct_container.pyx
+++ b/yt/geometry/particle_oct_container.pyx
@@ -587,11 +587,11 @@ cdef class ParticleBitmap:
for i in range(3):
axiter[i][1] = 999
# Skip particles outside the domain
- if pos[p,i] >= RE[i] or pos[p,i] < LE[i]:
+ if not (LE[i] <= pos[p, i] < RE[i]):
skip = 1
break
ppos[i] = pos[p,i]
- if skip==1: continue
+ if skip == 1: continue
mi = bounded_morton_split_dds(ppos[0], ppos[1], ppos[2], LE,
dds, mi_split)
mask[mi] = 1
@@ -756,11 +756,11 @@ cdef class ParticleBitmap:
skip = 0
for i in range(3):
axiter[i][1] = 999
- if pos[p,i] >= RE[i] or pos[p,i] < LE[i]:
+ if not (LE[i] <= pos[p, i] < RE[i]):
skip = 1
break
ppos[i] = pos[p,i]
- if skip==1: continue
+ if skip == 1: continue
# Only look if collision at coarse index
mi1 = bounded_morton_split_dds(ppos[0], ppos[1], ppos[2], LE,
dds1, mi_split1)
Essentially the issue seems to be that bounded_morton_split_dds
compute an integer index from spatial coordinates, which should be normal values. This is fenced by checking that the position is between the left edge (LE
) and the right edge (RE
). The logic of the test however fails with NaN
, because NaN > anything
... but also, NaN < anything
!
[EDIT: easier fix!]
Small update @olebole I have tested the proposed bugfix (#3688) locally and it seems to be passing. I have left it to draft so that it isn't fixed if it doesn't solve this issue. Would you be in a position to confirm it does solve the issue you raised on your side? Note that the PR contains exactly the patch in my previous comment, so no need to test both.
Bug report
Bug summary
When running the tests, on some platforms appears a segmentation fault in
test_gadget_binary
.The platforms are MIPS (32+64 bit; official Debian architectures), HP-PA, RiscV64, Sparc64 (unofficial architectures). (sorry for flowing you with exotic bugs)
Actual outcome
Full build log for RiscV64
This seems to happen in
yt/geometry/particle_oct_container.pyx
Version Information