Closed gcv closed 8 years ago
This constitutes an ABI breakage, so IMHO we can't lightly push it to stable releases, as it will mess badly with users and package maintainers.
I'm actually contemplating a different solution, see this on the mailing list: http://lists.zeromq.org/pipermail/zeromq-dev/2016-August/030799.html
What about an #ifdef
for the definition of zmq_msg_t
that selectively fixes it on platforms where just instantiating the object causes a bus error? Should keep the ABI safe for most users, and will fix the exceedingly nasty and hard-to-find problem for everyone else.
I don't have access to SPARC64 machines, but can confirm that (at least some versions of 64-bit) ARM CPUs are affected, making 0MQ unusable on them without this patch.
Clarifying edit: In the case of AARCH64, 0MQ doesn't work, period, so ABI breakage should not be a concern.
Having a different ABI version per architecture would be a nightmare to maintain I think. Also it would depend a lot on the build system.
On what ARM CPUs does the problem happen? Is that arm64?
I edited my message above. AARCH64 is completely broken as things stand, so there's nothing to lose by changing the ABI.
Ok, I'll try to reproduce on an arm64 chroot + qemu.
I'm thinking of using pragma pack(8) as I explained in the email - as that doesn't trigger the ABI checkers. The change for x86/64 would make no difference anyway since that typedef is only used to allocate the space externally.
I'm working with a Jetson TX1. The following program causes a bus error (extracted from a project, so it uses zmqpp which instantiates a zmq_msg_t
internally, and Boost for saner error reporting):
#include <boost/test/prg_exec_monitor.hpp>
#include <zmqpp/zmqpp.hpp>
int cpp_main(int argc, char** argv) {
zmqpp::message msg_outbound;
msg_outbound << 1;
}
I don't have arm64 machines available unfortunately (time for a raspberry pi3 probably), so if you don't mind helping, could you please try with this diff instead of the union:
-typedef struct zmq_msg_t {unsigned char _ [64];} zmq_msg_t;
+#pragma pack(push)
+#pragma pack(8)
+typedef struct zmq_msg_t {unsigned char _ [64];} zmq_msg_t;
+#pragma pack(pop)
Unfortunately it can't be reproduced with qemu + arm64 chroot :-(
There are a number of reasons why that could happen.
Yes it is certainly a problem with qemu. Do you know if the raspberry pi 3, which has a cortex-a53, would have the alignment restriction? Any way to check from the spec sheet?
The #pragma pack
technique does not work. I tried it as follows:
#pragma pack(push)
#pragma pack(8)
typedef struct zmq_msg_t {unsigned char _ [64];} zmq_msg_t;
#pragma pack(pop)
Output:
**** exception(225): memory access violation at address: 0xff829079: invalid address alignment
Which, after looking at this a little further, makes sense. According to this SO answer, #pragma pack
packs things (as the name implies), it doesn't pad them.
I tried this instead:
typedef struct zmq_msg_t {
unsigned char _ [64] __attribute__((aligned(8)));
} zmq_msg_t;
That does work, and is more readable than the union
trick. I don't know if that patch will keep the ABI checkers happy, so the following should preserve compatibility where it matters:
typedef struct zmq_msg_t {
#if defined(__aarch64__) || defined(__ARM_ARCH_7A__)
unsigned char _ [64] __attribute__((aligned(8)));
#else
unsigned char _ [64];
#endif
} zmq_msg_t;
This may not cover all breaking cases, but it does fix the Jetson TX1. I don't know the correct preprocessor checks for the SPARC architectures which break. I would guess __sparc_v9__
and __sparc_v8__
, but have no way to test. Perhaps @baryonix, who reported the problem (well over a year ago), could help figure this out. (Invoking gcc -dM -E - < /dev/null
is an excellent way to look at all available constants, but realize that some are only available on more recent compilers than those typically shipped by Debian-based distributions.)
Thanks for testing it! The problem is that we support not only gcc but also windows compilers (and really, really old gcc like on solaris). I will check if the __attribute extension is supported by them.
#if
preprocessor directives solve this problem (the ones I suggested will not be available on old gcc on Solaris in the first place, and we can lock out Windows by adding a check for __linux__
). This work aims to expand the range of platforms on which ZeroMQ works without collateral damage, not solve all compatibility problems.
Any more thoughts on rolling this or similar patch into the release branch?
Hi, sorry for the delay! I've got a raspberry pi 3 arriving today, so I'll give it one more shot tonight on arm64.
If there is no other choice then we'll have to bump ABI, and we might take advantage of it by doing other incompatible changes we have been talking about (like msg_t size)
On Aug 23, 2016 01:00, "Constantine Vetoshev" notifications@github.com wrote:
Any more thoughts on rolling this or similar patch into the release branch?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zeromq/zeromq4-1/pull/137#issuecomment-241587625, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvvcYdAYqd-a7JN4KTCkFNGd-FU2H_Rks5qijf_gaJpZM4Jl0hX .
Unfortunately on a Rpi3 running fedora 24 with aarch64 kernel and userspace I cannot repro. But on the other hand, I found a way to enable alignment check on x86:
#if defined(__GNUC__)
#if defined(__i386__)
__asm__("pushf\norl $0x40000,(%esp)\npopf");
# elif defined(__x86_64__)
__asm__("pushf\norl $0x40000,(%rsp)\npopf");
#endif
#endif
With this hack I can reproduce the issue. I'll see if I can create a test case that works reliably. And both this commit or the following snippet fix it:
typedef struct zmq_msg_t {
#if defined(__GNUC__) || defined( __INTEL_COMPILER) || (defined(__SUNPRO_C) && __SUNPRO_C >= 0x590)
unsigned char _ [64] __attribute__((aligned(sizeof(void*))));
#elif defined(_MSC_VER)
__declspec(align(sizeof(void*))) unsigned char _ [64];
#else
unsigned char _ [64];
#endif
} zmq_msg_t;
See my last message to the ML: http://lists.zeromq.org/pipermail/zeromq-dev/2016-August/030831.html
For now I'll merge, but depending on what decide to do for the next major release, and then whether to backport to 4.1 too, we might change it again.
Sorry for the back and forth, but bumping ABI has vast implications for all users of all distributions, so it cannot be done lightly.
@gcv - I got around to do more testing and I pushed an alternative that uses aligned as above: https://github.com/zeromq/libzmq/pull/2177
Could you please give it a try when you have a moment? I managed to verify with the asm hack on x86_64, but on all ARM platforms I try I don't get the sigbus to happen.
I also verified that change is ABI compatible: tests built from this repository work fine when run with the new library from libzmq master with that changes. Also it does not trigger ABI checkers, so it's preferable IMHO.
If it works for you then I'll backport it here.
Problem occurs on SPARC and ARM CPUs. This commit is a backport of https://github.com/zeromq/libzmq/commit/d9fb1d36ff2008966af538f722a1f4ab158dbf64 (from libzmq), first reported in https://github.com/zeromq/libzmq/issues/1325.