zeromq / zeromq4-1

ZeroMQ 4.1.x stable release branch - bug fixes only
GNU General Public License v3.0
125 stars 137 forks source link

Fix alignment problem with zmq_msg_t #137

Closed gcv closed 8 years ago

gcv commented 8 years ago

Problem occurs on SPARC and ARM CPUs. This commit is a backport of https://github.com/zeromq/libzmq/commit/d9fb1d36ff2008966af538f722a1f4ab158dbf64 (from libzmq), first reported in https://github.com/zeromq/libzmq/issues/1325.

bluca commented 8 years ago

This constitutes an ABI breakage, so IMHO we can't lightly push it to stable releases, as it will mess badly with users and package maintainers.

I'm actually contemplating a different solution, see this on the mailing list: http://lists.zeromq.org/pipermail/zeromq-dev/2016-August/030799.html

gcv commented 8 years ago

What about an #ifdef for the definition of zmq_msg_t that selectively fixes it on platforms where just instantiating the object causes a bus error? Should keep the ABI safe for most users, and will fix the exceedingly nasty and hard-to-find problem for everyone else.

I don't have access to SPARC64 machines, but can confirm that (at least some versions of 64-bit) ARM CPUs are affected, making 0MQ unusable on them without this patch.

Clarifying edit: In the case of AARCH64, 0MQ doesn't work, period, so ABI breakage should not be a concern.

bluca commented 8 years ago

Having a different ABI version per architecture would be a nightmare to maintain I think. Also it would depend a lot on the build system.

On what ARM CPUs does the problem happen? Is that arm64?

gcv commented 8 years ago

I edited my message above. AARCH64 is completely broken as things stand, so there's nothing to lose by changing the ABI.

bluca commented 8 years ago

Ok, I'll try to reproduce on an arm64 chroot + qemu.

I'm thinking of using pragma pack(8) as I explained in the email - as that doesn't trigger the ABI checkers. The change for x86/64 would make no difference anyway since that typedef is only used to allocate the space externally.

gcv commented 8 years ago

I'm working with a Jetson TX1. The following program causes a bus error (extracted from a project, so it uses zmqpp which instantiates a zmq_msg_t internally, and Boost for saner error reporting):

#include <boost/test/prg_exec_monitor.hpp>
#include <zmqpp/zmqpp.hpp>

int cpp_main(int argc, char** argv) {
  zmqpp::message msg_outbound;
  msg_outbound << 1;
}
bluca commented 8 years ago

I don't have arm64 machines available unfortunately (time for a raspberry pi3 probably), so if you don't mind helping, could you please try with this diff instead of the union:

-typedef struct zmq_msg_t {unsigned char _ [64];} zmq_msg_t;
+#pragma pack(push)
+#pragma pack(8)
+typedef struct zmq_msg_t {unsigned char _ [64];} zmq_msg_t;
+#pragma pack(pop)
bluca commented 8 years ago

Unfortunately it can't be reproduced with qemu + arm64 chroot :-(

yn commented 8 years ago

There are a number of reasons why that could happen.

  1. Does zmq_msg_t by accident align on an 8 byte boundary in the QEMU env?
  2. It may be the case that the alignment restriction on ARM64 is optional, and just happens to be turned on in the Jetson TX1, but not on QEMU
  3. Maybe QEMU does not implement CPU alignment restrictions at all.
bluca commented 8 years ago

Yes it is certainly a problem with qemu. Do you know if the raspberry pi 3, which has a cortex-a53, would have the alignment restriction? Any way to check from the spec sheet?

gcv commented 8 years ago

The #pragma pack technique does not work. I tried it as follows:

#pragma pack(push)
#pragma pack(8)
typedef struct zmq_msg_t {unsigned char _ [64];} zmq_msg_t;
#pragma pack(pop)

Output:

**** exception(225): memory access violation at address: 0xff829079: invalid address alignment

Which, after looking at this a little further, makes sense. According to this SO answer, #pragma pack packs things (as the name implies), it doesn't pad them.

I tried this instead:

typedef struct zmq_msg_t {
  unsigned char _ [64] __attribute__((aligned(8)));
} zmq_msg_t;

That does work, and is more readable than the union trick. I don't know if that patch will keep the ABI checkers happy, so the following should preserve compatibility where it matters:

typedef struct zmq_msg_t {
#if defined(__aarch64__) || defined(__ARM_ARCH_7A__)
  unsigned char _ [64] __attribute__((aligned(8)));
#else
  unsigned char _ [64];
#endif
} zmq_msg_t;

This may not cover all breaking cases, but it does fix the Jetson TX1. I don't know the correct preprocessor checks for the SPARC architectures which break. I would guess __sparc_v9__ and __sparc_v8__, but have no way to test. Perhaps @baryonix, who reported the problem (well over a year ago), could help figure this out. (Invoking gcc -dM -E - < /dev/null is an excellent way to look at all available constants, but realize that some are only available on more recent compilers than those typically shipped by Debian-based distributions.)

bluca commented 8 years ago

Thanks for testing it! The problem is that we support not only gcc but also windows compilers (and really, really old gcc like on solaris). I will check if the __attribute extension is supported by them.

gcv commented 8 years ago

#if preprocessor directives solve this problem (the ones I suggested will not be available on old gcc on Solaris in the first place, and we can lock out Windows by adding a check for __linux__). This work aims to expand the range of platforms on which ZeroMQ works without collateral damage, not solve all compatibility problems.

gcv commented 8 years ago

Any more thoughts on rolling this or similar patch into the release branch?

bluca commented 8 years ago

Hi, sorry for the delay! I've got a raspberry pi 3 arriving today, so I'll give it one more shot tonight on arm64.

If there is no other choice then we'll have to bump ABI, and we might take advantage of it by doing other incompatible changes we have been talking about (like msg_t size)

On Aug 23, 2016 01:00, "Constantine Vetoshev" notifications@github.com wrote:

Any more thoughts on rolling this or similar patch into the release branch?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zeromq/zeromq4-1/pull/137#issuecomment-241587625, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvvcYdAYqd-a7JN4KTCkFNGd-FU2H_Rks5qijf_gaJpZM4Jl0hX .

bluca commented 8 years ago

Unfortunately on a Rpi3 running fedora 24 with aarch64 kernel and userspace I cannot repro. But on the other hand, I found a way to enable alignment check on x86:

#if defined(__GNUC__)
#if defined(__i386__)
__asm__("pushf\norl $0x40000,(%esp)\npopf");
# elif defined(__x86_64__)
__asm__("pushf\norl $0x40000,(%rsp)\npopf");
#endif
#endif

With this hack I can reproduce the issue. I'll see if I can create a test case that works reliably. And both this commit or the following snippet fix it:

typedef struct zmq_msg_t {
#if defined(__GNUC__) || defined( __INTEL_COMPILER) || (defined(__SUNPRO_C) && __SUNPRO_C >= 0x590)
    unsigned char _ [64] __attribute__((aligned(sizeof(void*))));
#elif defined(_MSC_VER)
    __declspec(align(sizeof(void*))) unsigned char _ [64];
#else
    unsigned char _ [64];
#endif
} zmq_msg_t;

See my last message to the ML: http://lists.zeromq.org/pipermail/zeromq-dev/2016-August/030831.html

For now I'll merge, but depending on what decide to do for the next major release, and then whether to backport to 4.1 too, we might change it again.

Sorry for the back and forth, but bumping ABI has vast implications for all users of all distributions, so it cannot be done lightly.

bluca commented 7 years ago

@gcv - I got around to do more testing and I pushed an alternative that uses aligned as above: https://github.com/zeromq/libzmq/pull/2177

Could you please give it a try when you have a moment? I managed to verify with the asm hack on x86_64, but on all ARM platforms I try I don't get the sigbus to happen.

I also verified that change is ABI compatible: tests built from this repository work fine when run with the new library from libzmq master with that changes. Also it does not trigger ABI checkers, so it's preferable IMHO.

If it works for you then I'll backport it here.

bluca commented 7 years ago

https://github.com/zeromq/zeromq4-1/pull/155