zlib-ng / zlib-ng

zlib replacement with optimizations for "next generation" systems.
zlib License
1.53k stars 253 forks source link

build option defaults are inappropriate for a compression library #1100

Closed manxorist closed 1 year ago

manxorist commented 2 years ago

UNALIGNED_OK Allow unaligned reads ON (x86, arm): WITH_UNALIGNED Allow optimizations that use unaligned reads if safe on current arch ON

Any unaligned access is always undefined behaviour in C, the architecture does not matter. C code that uses unaligned access is broken by definition. Unaligned access, and also type punning through pointers, is not safe for any C code.

It is very obvious that nobody has ever fuzzed zlib-ng with ubsan, or even just run the tests with ubsan:

./configure --with-sanitizer=undefined
make
make test

throws a ton of errors.

Any system's most basic compression library is probably one of the last places where I would tolerate any undefined behavior code. This can easily result in data loss and exploitable security vulnerabilites.

The build option defaults give the impression that zlib-ng's goal is "performance at all cost", sacrificing stability, reliability, and hardened code at every possible point. With this attitude, zlib-ng has sadly no chance to ever replace original zlib.

Please change your build defaults to be correct instead of performance-at-all-cost.

mtl1979 commented 2 years ago

Unaligned access on little-endian processors (and multi-endian processors in little-endian mode) is well-defined, no matter what sanitizers say... We test our code thoroughly and avoid anything that reliably fails the included tests.

Most compilers in use nowadays don't follow any known standards word by word (ANSI, ISO, JIS, etc.), so it's irrelevant what those standards say.

If we would follow everything word by word, the code would get a lot bigger, slower, and a lot more unreadable.

Type punning and aliasing is safe as long as all intermediate pointers are stored in variables. That way the compiler will track all the variables that point to overlapping memory locations, and reload and spill when necessary, this is similar to what volatile keyword does.

manxorist commented 2 years ago

Well, that's really a phenomenal attitude for a library that exclusively deals with user data ...

Unaligned access on little-endian processors (and multi-endian processors in little-endian mode) is well-defined

Wrong at least on little-endian ARMv5. And it does not matter whatsoever at all when writing C code. Whether the architecture can do things is only relevant when implementing for that particular architecture, i.e. assembly.

We test our code thoroughly

You very much dont, as that would involve testing with undefined behavior sanitizer.

and avoid anything that reliably fails the included tests.

That's the wrong criterion. It does not need to fail reliably to be unacceptable. It only needs to fail in 1 single instance, and it's broken.

Most compilers ...

So you risk losing user data with compilers that are perfectly standard-conforming?

Type punning and aliasing is safe as long as all intermediate pointers are stored in variables. That way the compiler will track all the variables that point to overlapping memory locations, and reload and spill when necessary

That's wrong. Only pointers to character types are allowed to alias memory belonging to other types. The compiler is free to assume that a pointer to uint32_t will never alias memory belonging to uint8_t. That's called alias-analysis. For GCC and clang, this optimization can be disabled via -fno-strict-aliasing, but even still then, the compiler can and will assume that a pointer to uint32_t is aligned to the natural alignment of that type. If the code does not guarantee that, it's undefined behavior.

The standard exists for a reason. It defines the minimum set of features and semantics that should work with ALL compilers, instead of just "most". If you know of compilers that allow these kind of things, you may green-list them optionally at the users choice on a case by case basis.

If we would follow everything word by word, the code would get a lot bigger, slower, and a lot more unreadable.

That's just wrong.

What's your policy on fixing bugs reported by ubsan?

mtl1979 commented 2 years ago

I already know that there is some old processors that don't support unaligned access, we already disable support for unaligned access on older ARM processors.

gcc already knows what architectures can use unaligned access and will generate correct code as long as the pointer is known at compile time, for pointers that are known only at runtime, exception is raised even without sanitizers.

zlib-ng is not supposed to be compatible with every possible compiler that exist in the world, but we will test with more compilers that are open-sourced or free-to-use.

We only fix bugs that cause incorrect behavior with no sanitizers enabled... The sanitizers itself might cause issues that don't manifest on normal running environment. As such we also try to not enable two sanitizers that are known to behave badly when enabled together.

Our main policy on finding bugs is to use as many compiler versions as possible and enable all warnings. We compile on all architectures that can be emulated on qemu or wine inside Ubuntu Linux (using GitHub Actions). For some architectures we can (and will) also test on native hardware without emulation.

manxorist commented 2 years ago

We only fix bugs that cause incorrect behavior with no sanitizers enabled... The sanitizers itself might cause issues that don't manifest on normal running environment. As such we also try to not enable two sanitizers that are known to behave badly when enabled together.

Your understanding of why it is important to use ubsan is fundamentally flawed. Ubsan warns in situations where behavior occurs that is not allowed by the standard. A compiler, at any time (and any future version), can assume that the program is correct according to the given standard and therefore conclude that the precondition that would cause undefined behavior will not occur, and optimize code with that precondition in mind.

Even testing with ALL compilers is not enough here. You would have to test with all future compilers as well (which is impossible). Ubsan is useful to catch bugs which you absolutely cannot catch by testing with a finite set of known compilers. It does so by enforcing that only the behavior mandated by the standard are required to successfully compile and/or run the program.

You are risking data corruption. Willingly.

mtl1979 commented 2 years ago

Compilers don't change fundamental behavior suddenly. It's common practice to first issue warning and then change that to error in subsequent versions... This is why we try to test with most recent compiler versions in addition to oldest supported compiler version and major versions in-between.

In the past we have caught "errors" that even the compiler didn't report. We work closely with the development teams of compilers so any issues can be caught and resolved.

manxorist commented 2 years ago

Look, I was only asking to change your defaults. Any user who is of the same opinion as you would still be able to use the incorrect code paths at will.

However, it frankly looks like my initial intuition, that the defaults are a manifestation of zlib-ng's attitude and goals and non-goals, more and more manifests as a fact.

Correctness and not losing user data is THE most important aspect of any library dealing with data of any form. Even the slightest remote risk of causing data corruption is not acceptable, especially not if motivated by performance or ease-of-development reasons.

mtl1979 commented 2 years ago

Build of zlib-ng will fail if there is a change of data corruption... That should be obvious from looking at or running the build scripts.

manxorist commented 2 years ago

It's common practice to first issue warning and then change that to error in subsequent versions...

And again you are relying only on exemplatory evidence, which does not and cannot prove the general case at all.

Especially changes in assumptions and optimizations surrounding alias analysis are difficult or even provably impossible to warn about at compile time. Ubsan catches these at run time. They are at least as serious as a compiler warning. I do not understand why you insist on them being utterly irrelevant. It honestly makes no sense to me.

mtl1979 commented 2 years ago

Because sanitizers replace essential functions required by zlib-ng, the results can contain false positives. This is essentially same as using non-conformant memory allocation callbacks that don't honor alignment requests.

Same also applies for static code analysis tools that assume something can happen even though the preceding code explicitly checks that the condition can never be true.

manxorist commented 2 years ago

Can you link respective bug reports for the sanitizers?

Edit: Sorry, i meant to say sanitizers.

manxorist commented 2 years ago

I assume you cannot, because such problems do not exist. If a sanitizer replaces allocation functions, it does so honoring their respective standard's requirements regarding alignment. If the alignment provided by the regular function was stricter than required by the standard (and thus provided by the sanitizer), the sanitizer again has found a bug in your code, because you made assumptions that are not universally guaranteed. A sanitizer will not inject non-conforming behavior into standard functions.

Of course, there may also be bugs in analyzers, but they should get fixed timely and the resulting false-positives would be triggered by a lot of projects that actually run ubsan regularly with no errors. I have yet to see a single false-positive with any of ubsan, asan, or msan ever.

In this specific case, all ubsan needs to do here, is locally, at the point of every pointer deref (at the source code level), add a check that verifies that the pointer is properly aligned.

Again, you appear to be misinformed about how ubsan works.

mtl1979 commented 2 years ago

@manxorist Sanitizers are not my responsibility in zlib-ng, so I seriously don't bother reporting any bugs in them or remember links to them... All team members and long-time contributors have main area of focus in this project.

manxorist commented 2 years ago

Fine, let's wait what others have to say.

nmoinvaz commented 2 years ago

There are some differences in configure and cmake build scripts. In CMake, we do detection based on architecture to see if unaligned access is supported. If it is not supported then we disable it even when WITH_UNALIGNED is enabled. Please take a look at it and see if we are missing something for your architecture. I also do have a related PR #1086 for unaligned reads open.

As far as the sanitizers go, we run them for each commit in the repository using CMake. Fuzzers are also run the same way for each commit, and our project is fuzz tested nightly by Google. I recommend trying to get more familar with the level of testing that is done on this project.

With this attitude, zlib-ng has sadly no chance to ever replace original zlib.

Zlib-ng is not for everybody. You may be better off with zlib given your concerns, however you will have less luck getting any type of code changed as the project has seen little movement these past years. You are always free to open a pull request here if you think you have a better way of doing something.

manxorist commented 2 years ago

There are some differences in configure and cmake build scripts. In CMake, we do detection based on architecture to see if unaligned access is supported.

As far as I can see, the configure script does the same or at least something similar. This is not enough though. Not using any unaligned access or pointer type punning must be the default for all situations.

I also do have a related PR #1086 open.

If similar things are done for all cases (which I did not check), this is progress in the right direction. https://github.com/zlib-ng/zlib-ng/pull/1086/commits/36ee2bddb2d8f12607e4aa1413c7c606a8550dbb#diff-ac7490066d9edf8ce5bc500cc55e7326b2564a93f2707418d1bd5f9ea3896a64R264 however is still wrong. The special case for old GCC should go away. The whole of WITH_UNALIGNED, UNALIGNED_OK and UNALIGNED64_OK can also completely go away, and even all of zmemfoo may go away. This will make your source code correct, smaller, easier to read and understand, easier to maintain, simplify your build systems, shorten your build documentation, and this whole issue could be void.

As far as the sanitizers go, we run them for each commit in the repository. Fuzzers are also run the same way for each commit, and our project is fuzz tested nightly by Google.

I have already shown in my initial post that absolutely nothing got tested with ubsan. This is insufficient testing, no matter how much you supposedly test otherwise. You need to also run tests and fuzzers with ubsan active.

With this attitude, zlib-ng has sadly no chance to ever replace original zlib.

Zlib-ng is not for everybody.

Can you then please

manxorist commented 2 years ago

You are always free to open a pull request here if you think you have a better way of doing something.

I would rather first ensure that zlib-ng's goals are actually somewhat aligned with my own goals before I contribute to this project. I would rather not see my contributions in a compression library project where code correctness and safety of user data is not without any doubt whatsoever the paramount number 1 priority.

nmoinvaz commented 2 years ago

If you want to wait for a declaration of priorities before you contribute then it is up to you, but I would be interested in seeing your proposals as PRs. You could start with something small if you don't want to waste much time. We are not as organized or as large as may be you think. We all do this in our spare time and like most open source nobody is compensated for it.

KungFuJesus commented 2 years ago

For architectures where unaligned access is not allowed you will see a SIGBUS abort of the process well before you see any data corruption. The decision as to whether or not unaligned access is allowed is very much baked into the architecture's ABI and is plenty appropriate as a compile-time decision.

manxorist commented 2 years ago

For architectures where unaligned access is not allowed you will see a SIGBUS abort of the process well before you see any data corruption.

And wrong again. The first thing you experience is the compiler making assumptions about alignment that it is allowed to make. The data corruption is already baked in before running anything at all.

That's exactly what happens on e.g. ARMv5 (https://developer.arm.com/documentation/ddi0100/i/ A2.8):

Prior to ARMv6, doubleword (LDRD/STRD) accesses to memory, where the address is not doubleword-aligned, are UNPREDICTABLE. Also, data accesses to non-aligned word and halfword data are treated as aligned from the memory interface perspective. That is: • the address is treated as truncated, with address bits[1:0] treated as zero for word accesses, and address bit[0] treated as zero for halfword accesses. • load single word ARM instructions are architecturally defined to rotate right the word-aligned data transferred by a non word-aligned address one, two or three bytes depending on the value of the two least significant address bits. • alignment checking is defined for implementations supporting a System Control coprocessor using the A bit in CP15 register 1. When this bit is set,a Data Abort indicating an alignment fault is reported for unaligned accesses.

The compiler generates a load or store, and the CPU just completely ignores the lower bits that cause incorrect alignment. The result is accessing just the wrong memory location. This is a direct consequence of code that is incorrect according to the C standard. The fact that on other architectures a compiler might choose to generate an instruction or sequence of instructions that can do unaligned access is PURE LUCK.

The C standard says: C11 6.3.2.3 Pointers p.7:

A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

and regarding type-punning in general, it says: C11 6.5 Expressions p.7:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types: — a type compatible with the effective type of the object, — a qualified version of a type compatible with the effective type of the object, — a type that is the signed or unsigned type corresponding to the effective type of the object, — a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object, — an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or — a character type.

The decision as to whether or not unaligned access is allowed is very much baked into the architecture's ABI and is plenty appropriate as a compile-time decision.

Wrong. The ABI does not talk about unaligned access. The C standard does. And the ISA does, but that's not relevant when writing C code.

I honestly do not get all your stubborn obsession with relying on undefined behavior. You are all defending a risk of user data corruption for what reason exactly?!? The fix is rather easy, and it has already been mostly done by @nmoinvaz . It does not even lead to worse code generation: https://godbolt.org/z/e8oMz64ja, but that actually does not even matter, because the efficiency of incorrect code is completely irrelevant.

Dead2 commented 2 years ago

@manxorist I am very much open to hearing arguments for improving our code.

However I am not liking your hot and heavy approach to the discussion, it makes you look more like a random internet troll. Please lets have a reasonable discussion without accusations and blaming. The zlib-ng team is entirely made up of volunteers that do this because it is fun and interesting without any personal benefit beyond that, lets not make things discouraging and difficult.

You are unknown to me, and your word carries no more weight than those that I do know. If you want something to change, then please provide a calm and reasonable explanation of why, without attacking those that disagree or are questioning your points.

As far as I know (I am no expert in that area, and as far as I know you are not either), unaligned access is safe in the cases where we enable it. Can you deliver proof that it is in fact not safe, perhaps an example? Or is this all based on what the C standard says is undefined behavior?

manxorist commented 2 years ago

I have explained my arguments very thoroughly and in great detail. I honestly cannot think of what you might want to hear as any additional proof. The C standard is 100% clear, and I have already explained why this is important, even if it appears to work for you.

If you disagree, feel free to close this issue as wontfix or similar. I have invested enough time here.

manxorist commented 2 years ago

I am sorry that you feel attacked personally by my arguments. I certainly did not intend that.

However, if you repeatedly claim facts that are simply not true, I am basically forced to disagree unambiguously. And now at the end, you appear to be ignoring all my arguments again. This is not constructive, and I do not know how I could provide any further constructive feedback.

The first response I got here literally reads "it's irrelevant what those standards say", which frankly set the tone also from your side.

If you do not trust my knowledge, fine. But maybe you could try to trust what compiler implementers (who also implement the sanitizers) tell you.

Dead2 commented 2 years ago

@manxorist I do not feel personally attacked, since I did not join this discussion until today, I happen to have a life beside zlib-ng so zlib-ng cannot always be my main focus. I have not ignored any comments of either side of this discussion, but I am as yet uncertain of what weight to attribute to this alleged problem. I do not deny this problem exists, but I also cannot confirm it.

I unfortunately can not take responsibility for what others say during a debate, only what I do with the combined information in the end. We might want to implement a code of conduct for zlib-ng, that can be used as a guideline during discussions in the future.

So, I have several people telling me this is fine, and one person telling me this is bad. I am sure you can see that this is not an ideal situation. I will attempt to verify either side, but that might take some time, as I will need to find someone whom I know I can trust to have expert knowledge in this very specific field.

mtl1979 commented 2 years ago

If our code would be as shit as what was claimed in this issue, it wouldn't have worked on 32-bit PowerPC as it has quite high restrictions on unaligned access, to similar level like was mentioned here about older ARM processors.

I've worked personally over 5 years with multiple compiler teams, so I have quite a lot of knowledge about how the internals of compilers work. Just asserting that we would have to file a bug about every issue we find in every compiler version we use is pointless as the maintainers only support the very latest version. Latest versions often have issues that need to be fixed before we can use them in continuous integration as otherwise we would need to reject all incoming pull requests as none of them would pass required amount of tests.

mtl1979 commented 2 years ago

As was mentioned already in the description, zlib-ng is not about supporting every possible processor that is either 32-bit or 64-bit or every possible operating system running on them, it's all about having better support for recent processors that implement vector instructions or other useful extensions, like hardware CRC or hardware-assisted compression/decompression.

manxorist commented 2 years ago

I unfortunately can not take responsibility for what others say during a debate, only what I do with the combined information in the end.

Agreed, fair enough.

So, I have several people telling me this is fine, and one person telling me this is bad. I am sure you can see that this is not an ideal situation. I will attempt to verify either side, but that might take some time, as I will need to find someone whom I know I can trust to have expert knowledge in this very specific field.

You could maybe trust the tools instead of the persons? ubsan reports errors when zlib-ng uses the mentioned build options.

manxorist commented 2 years ago

If our code would be as shit as what was claimed in this issue, it wouldn't have worked on 32-bit PowerPC as it has quite high restrictions on unaligned access, to similar level like was mentioned here about older ARM processors.

Well, I am claiming that UNALIGNED_OK will not work on ppc32. ./configure does not set that on ppc32 for me, which is good. I am additionally claiming, that it is a problem also for all other architectures for the reasons already explained.

Now let's look if zlib-ng works on ppc32, even without UNALIGNED_OK:

bleifus$ uname -a
OpenBSD bleifus 7.0 GENERIC#940 macppc
bleifus$ dmesg | grep cpu
cpu0 at mainbus0: 7447A (Revision 0x101): 1416 MHz: 512KB L2 cache
bleifus$ git clone https://github.com/zlib-ng/zlib-ng
Cloning into 'zlib-ng'...
remote: Enumerating objects: 12982, done.
remote: Counting objects: 100% (1295/1295), done.
remote: Compressing objects: 100% (593/593), done.
remote: Total 12982 (delta 855), reused 921 (delta 669), pack-reused 11687
Receiving objects: 100% (12982/12982), 7.07 MiB | 2.95 MiB/s, done.
Resolving deltas: 100% (8972/8972), done.
bleifus$ cd zlib-ng/
bleifus$ ./configure
Checking for compiler... cc
Checking for shared library support... Building shared library libz-ng.so.2.1.0.devel with cc.
Checking for off64_t... No.
Checking for _off64_t... No.
Checking for fseeko... Yes.
Checking for posix_memalign... Yes.
Checking for strerror... Yes.
Checking for unistd.h... Yes.
Checking for ptrdiff_t... Yes.
Checking for ANSI C compliant compiler...  Yes.
Checking for -fno-semantic-interposition... No.
Checking for __builtin_ctz ... Yes.
Checking for __builtin_ctzll ... Yes.
ARCH: macppc
Using arch directory: arch/generic
Architecture-specific static object files:
Architecture-specific shared object files:
grep: /home/manx/tmp/zlib-ng/arch/generic/*.c: No such file or directory
grep: /home/manx/tmp/zlib-ng/arch/generic/*.c: No such file or directory
bleifus$ gmake
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o adler32_test.o /home/manx/tmp/zlib-ng/test/adler32_test.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o gzlib.o /home/manx/tmp/zlib-ng/gzlib.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o gzread.o /home/manx/tmp/zlib-ng/gzread.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o gzwrite.o /home/manx/tmp/zlib-ng/gzwrite.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o adler32.o /home/manx/tmp/zlib-ng/adler32.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o chunkset.o /home/manx/tmp/zlib-ng/chunkset.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o compare258.o /home/manx/tmp/zlib-ng/compare258.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o compress.o /home/manx/tmp/zlib-ng/compress.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o crc32.o /home/manx/tmp/zlib-ng/crc32.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o crc32_comb.o /home/manx/tmp/zlib-ng/crc32_comb.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o crc32_fold.o /home/manx/tmp/zlib-ng/crc32_fold.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate.o /home/manx/tmp/zlib-ng/deflate.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate_fast.o /home/manx/tmp/zlib-ng/deflate_fast.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate_huff.o /home/manx/tmp/zlib-ng/deflate_huff.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate_medium.o /home/manx/tmp/zlib-ng/deflate_medium.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate_quick.o /home/manx/tmp/zlib-ng/deflate_quick.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate_rle.o /home/manx/tmp/zlib-ng/deflate_rle.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate_slow.o /home/manx/tmp/zlib-ng/deflate_slow.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o deflate_stored.o /home/manx/tmp/zlib-ng/deflate_stored.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o functable.o /home/manx/tmp/zlib-ng/functable.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o infback.o /home/manx/tmp/zlib-ng/infback.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o inffast.o /home/manx/tmp/zlib-ng/inffast.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o inflate.o /home/manx/tmp/zlib-ng/inflate.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o inftrees.o /home/manx/tmp/zlib-ng/inftrees.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o insert_string.o /home/manx/tmp/zlib-ng/insert_string.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o insert_string_roll.o /home/manx/tmp/zlib-ng/insert_string_roll.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o slide_hash.o /home/manx/tmp/zlib-ng/slide_hash.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o trees.o /home/manx/tmp/zlib-ng/trees.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o uncompr.o /home/manx/tmp/zlib-ng/uncompr.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o zutil.o /home/manx/tmp/zlib-ng/zutil.c
ar rc libz-ng.a adler32.o chunkset.o compare258.o compress.o crc32.o crc32_comb.o crc32_fold.o deflate.o deflate_fast.o deflate_huff.o deflate_medium.o deflate_quick.o deflate_rle.o deflate_slow.o deflate_stored.o functable.o infback.o inffast.o inflate.o inftrees.o insert_string.o insert_string_roll.o slide_hash.o trees.o uncompr.o zutil.o  gzlib.o gzread.o gzwrite.o
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o adler32_test adler32_test.o gzlib.o gzread.o gzwrite.o libz-ng.a
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o crc32_test.o /home/manx/tmp/zlib-ng/test/crc32_test.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o crc32_test crc32_test.o libz-ng.a
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o example.o /home/manx/tmp/zlib-ng/test/example.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o example example.o gzlib.o gzread.o gzwrite.o libz-ng.a
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o minigzip.o /home/manx/tmp/zlib-ng/test/minigzip.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o minigzip minigzip.o gzlib.o gzread.o gzwrite.o libz-ng.a
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o makefixed.o /home/manx/tmp/zlib-ng/tools/makefixed.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o makefixed makefixed.o gzlib.o gzread.o gzwrite.o libz-ng.a
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o maketrees.o /home/manx/tmp/zlib-ng/tools/maketrees.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o maketrees maketrees.o gzlib.o gzread.o gzwrite.o libz-ng.a
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -I/home/manx/tmp/zlib-ng -c -o makecrct.o /home/manx/tmp/zlib-ng/tools/makecrct.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o makecrct makecrct.o gzlib.o gzread.o gzwrite.o libz-ng.a
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o adler32.lo /home/manx/tmp/zlib-ng/adler32.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o chunkset.lo /home/manx/tmp/zlib-ng/chunkset.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o compare258.lo /home/manx/tmp/zlib-ng/compare258.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o compress.lo /home/manx/tmp/zlib-ng/compress.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o crc32.lo /home/manx/tmp/zlib-ng/crc32.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o crc32_comb.lo /home/manx/tmp/zlib-ng/crc32_comb.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o crc32_fold.lo /home/manx/tmp/zlib-ng/crc32_fold.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate.lo /home/manx/tmp/zlib-ng/deflate.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate_fast.lo /home/manx/tmp/zlib-ng/deflate_fast.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate_huff.lo /home/manx/tmp/zlib-ng/deflate_huff.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate_medium.lo /home/manx/tmp/zlib-ng/deflate_medium.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate_quick.lo /home/manx/tmp/zlib-ng/deflate_quick.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate_rle.lo /home/manx/tmp/zlib-ng/deflate_rle.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate_slow.lo /home/manx/tmp/zlib-ng/deflate_slow.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o deflate_stored.lo /home/manx/tmp/zlib-ng/deflate_stored.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o functable.lo /home/manx/tmp/zlib-ng/functable.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o infback.lo /home/manx/tmp/zlib-ng/infback.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o inffast.lo /home/manx/tmp/zlib-ng/inffast.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o inflate.lo /home/manx/tmp/zlib-ng/inflate.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o inftrees.lo /home/manx/tmp/zlib-ng/inftrees.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o insert_string.lo /home/manx/tmp/zlib-ng/insert_string.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o insert_string_roll.lo /home/manx/tmp/zlib-ng/insert_string_roll.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o slide_hash.lo /home/manx/tmp/zlib-ng/slide_hash.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o trees.lo /home/manx/tmp/zlib-ng/trees.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o uncompr.lo /home/manx/tmp/zlib-ng/uncompr.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -I/home/manx/tmp/zlib-ng -c -o zutil.lo /home/manx/tmp/zlib-ng/zutil.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o gzlib.lo /home/manx/tmp/zlib-ng/gzlib.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o gzread.lo /home/manx/tmp/zlib-ng/gzread.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DPIC -DWITH_GZFILEOP -I/home/manx/tmp/zlib-ng -c -o gzwrite.lo /home/manx/tmp/zlib-ng/gzwrite.c
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -shared  -o libz-ng.so.2.1.0.devel  adler32.lo chunkset.lo compare258.lo compress.lo crc32.lo crc32_comb.lo crc32_fold.lo deflate.lo deflate_fast.lo deflate_huff.lo deflate_medium.lo deflate_quick.lo deflate_rle.lo deflate_slow.lo deflate_stored.lo functable.lo infback.lo inffast.lo inflate.lo inftrees.lo insert_string.lo insert_string_roll.lo slide_hash.lo trees.lo uncompr.lo zutil.lo  gzlib.lo gzread.lo gzwrite.lo
rm -f libz-ng.so libz-ng.so.2
ln -s libz-ng.so.2.1.0.devel libz-ng.so
ln -s libz-ng.so.2.1.0.devel libz-ng.so.2
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o adler32_testsh adler32_test.o gzlib.o gzread.o gzwrite.o libz-ng.so.2.1.0.devel
cc  -o crc32_testsh crc32_test.o libz-ng.so.2.1.0.devel
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o examplesh example.o gzlib.o gzread.o gzwrite.o libz-ng.so.2.1.0.devel
cc -O -DHAVE_POSIX_MEMALIGN -DWITH_GZFILEOP -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL  -o minigzipsh minigzip.o gzlib.o gzread.o gzwrite.o libz-ng.so.2.1.0.devel
bleifus$ gmake test
gmake -C test
gmake[1]: Entering directory '/home/manx/tmp/zlib-ng/test'
hello world
zlib-ng version 2.1.0.devel = 0x02010000, compile flags = 0x95
uncompress(): hello, hello!
gzread(): hello, hello!
gzgets() after gzseek:  hello!
gzgets(): hello, hello!
inflate(): hello, hello!
large_inflate(): OK
large_inflate(): OK
after inflateSync(): hello, hello!
inflate with dictionary: hello, hello!
deflateBound(): OK
deflate_copy(): OK
deflateGetDictionary(): hello, hello!
deflateSetHeader(): OK
deflateTune(): OK
deflatePending(): OK
deflatePrime(): OK
                *** zlib test OK ***
Segmentation fault (core dumped)
                *** zlib shared test FAILED ***
gmake[1]: *** [Makefile:59: testshared] Error 1
gmake[1]: Leaving directory '/home/manx/tmp/zlib-ng/test'
gmake: *** [Makefile:175: test] Error 2
bleifus$
manxorist commented 2 years ago

We compile on all architectures that can be emulated on qemu or wine inside Ubuntu Linux (using GitHub Actions).

Unfortunately, QEMU, at least for ARM, up until very recently did not accurately emulate behavior for unaligned access: See for example https://bugs.launchpad.net/qemu/+bug/1905356. For ARM, this appears to be fixed for recent QEMU, but I am not aware how this is currently handled for other architectures (or how it has been in the past).

KungFuJesus commented 2 years ago

Ok, so the ABI doesn't define behavior for what to do for unaligned access, but for most architectures I've dealt with the ISA defines some sort of behavior to handle unaligned access, even if in software. I was sort of conflating the ISA's calling conventions and contract with the OS as how it interacts with unaligned load faults. Typically the operating system can step up and handle unaligned load in software, sometimes at a performance penalty when it happens.

Most of the time I've dealt with unaligned access it's been with either SPARC or PowerPC. SPARC with Solaris is far less forgiving when it comes to that (you can, with the compiler, allow the binary to handle unaligned access by the OS, very very slowly). I have a ppc64 machine here on Linux I can test this on as well to get to the bottom of it later tonight. I do find it interesting that your segfault is happening on OpenBSD. It's also interesting that you're seeing this without any sort of assumed unaligned loads to GPRs. Perhaps you can provide a stack trace for where it's happening?

KungFuJesus commented 2 years ago

Also the "arch" string being pulled out of OpenBSD's uname is likely why your configure script isn't using any of the altivec code, despite it clearly being a G4. Perhaps the configure scripts need to be modified to include 'macppc' in there. In any case the generic C oughtn't be giving you segfaults, which is likely what it's using for an unrecognized arch.

mtl1979 commented 2 years ago

For me it seems the shared tests are mixing PIC and non-PIC files... That causes duplicate symbols that the linker doesn't properly handle... This can be fixed easily.

KungFuJesus commented 2 years ago

Cool. Also sounds like CMakeList macros and configure scripts need to be modified to accept OpenBSD's somewhat strange uname -m output?

KungFuJesus commented 2 years ago

@manxorist Your points about the ub sanitizer are valid, but zlib-ng is pretty careful to only allow such behavior on ISAs where it's handled gracefully. On x86 I don't think it's ever been an issue, and it's certainly faster to do unaligned type punning there. Armv6 and newer definitely handle these things fine, and those are the two ISAs where it's allowed.

manxorist commented 2 years ago

@KungFuJesus

and it's certainly faster to do unaligned type punning there.

It's still irrelevant though whether the architecture can do it, when the compiler is allowed to assume that it will not happen. I have already explained that I consider a performance argument not convincing here.

I have already shown that correct code compiles to the exact same instructions as the undefined behavior code at least on amd64 with all 3 major compilers. What kind of speed improvement are you talking about here?

KungFuJesus commented 2 years ago

For where the unaligned access is happening, I can see an argument being made where there's little benefit to dereferencing these arbitrary pointers, especially if on appropriate ISAs memcmp gets subbed in with a "builtin" memcmp that inlines the expression if it's small enough (something GCC likes to do). However, the semantics of memcmp force byteswaps for both inputs in order to make the return argument reflect >, <, or =, when all we actually care about is =. That's a lot of byte swapping for a swath of 258 bytes, the largest run length that's compared.

Note: your load8() + compare chain does do more what we'd want, assuming that GCC appropriately subs in the builtin.

manxorist commented 2 years ago

@KungFuJesus

For where the unaligned access is happening, I can see an argument being made where there's little benefit to dereferencing these arbitrary pointers, especially if on appropriate ISAs memcmp gets subbed in with a "builtin" memcmp that inlines the expression if it's small enough (something GCC likes to do). However, the semantics of memcmp force byteswaps for both inputs in order to make the return argument reflect >, <, or =, when all we actually care about is =. That's a lot of byte swapping for a swath of 258 bytes, the largest run length that's compared.

If you do not actually need -1/0/+1, then just don't use memcmp, but something like my eq8. If you think you can do better for longer sequences of bytes just do so as needed. However, if the argument is, that type-punning consecutive 64bit values instead of using my load8 would be faster, I fail to see what's the difference here.

KungFuJesus commented 2 years ago

Not my argument, see my note added later. If the compiler is doing the right thing, then it's perhaps not a good argument at all. It's worth noting that almost all of the code in question is in the generic C code (though, compare258 does do this behavior for the first 2 bytes as well before moving onto the SIMD routines). I do have an x86 machine that would be affected by the proposed change, as it doesn't have a SIMD accelerated 258 byte comparison (though it could, I plan to write it soon).

I do worry that crappier compilers may not make the same decisions. It does look like MSVC is making the same determination, though... @mtl1979 @nmoinvaz @Dead2 thoughts? It seems that the worst case here is that there's marginally more C code to tell the compiler to do the extraneous (byte level) load that it should be inlining away. Perhaps we should test this in a loop to make sure that everything still gets appropriately inlined?

Presumably the aligned comparisons could be faster for a larger run, but you'd need code that loaded the first 1 or 2 bytes to hit an aligned offset (also this becomes basically impossible when dealing with two arbitrary pointers that are short strings, as you have to do this up to the least common multiple of the two unaligned offsets). And unless you are unlucky enough to have the load span a cache line boundary, it's more code for no gain on x86 (yes, I realize that's not what you're proposing, I'm just pondering the possibilities here while we're thinking about it).

nmoinvaz commented 2 years ago

@KungFuJesus I have tried to optimize the compare258 for unaligned and aligned in the past, but I do have doubts about it. I do have a compare258 benchmark that might be useful in this instance for testing.

KungFuJesus commented 2 years ago

Regarding memcmp: an explicit comparison with 0 following it makes it eliminate the bswap entirely, so that's not even a real concern.

@KungFuJesus I have tried to optimize the compare258 for unaligned and aligned in the past, but I do have doubts about it. I do have a compare258 benchmark that might be useful in this instance for testing.

I think part of the issue is that 258 bytes is a pretty short string to begin with. Cache lines are 64 bytes, so you won't have that many cache line crossing unaligned loads to begin with. Aligning both reads is also a chore.

I do wonder if SIMD aligned reads would be worthwhile in that function for things in the pre-nehalem era, though. Aligned loads certainly made a huge difference in the adler checksum, of which I don't have a great way to force alignment because I can't figure out yet a way to do generic i-1's sum carryovers without constantly applying a mask and doing a secondary sum of absolute differences each loop iteration. I'm also pretty perplexed at SSE4's string comparison instructions - they seem to have absurd latency and eat up a lot of execution ports compared to the alternative (straight 128 bit comparisons followed by movemask and a tzcnt or bsf to determine first non matching byte). I'm not sure they make much sense at all unless we need the characteristics of a null terminated string (including things like fast case insensitivity).

Dead2 commented 2 years ago

If we can do "the right thing™️" without it having a significant affect on performance or maintainability, then there should be little resistance to doing so.

So, something that has been mentioned before (but not in this thread) is that we should go through the places where we do unaligned access today, and test/figure out which ones we can improve our standards-compliance without sacrificing performance. If that leaves us with a lower number of places than today, then that is good and can also make improving the remaining areas a smaller burden. Currently, at least I have little to no real idea what the performance boost is for each instance(I am thinking function level) of unaligned access, I have so far only looked at completely on/completely off.

Doing this testing is not very difficult, but is a bit time consuming. To make it easier, I think doing just x86-64 with gcc 4.8 should be sufficient, unless it is also valid for newer gcc versions after #1086. These are just me thoughts at the moment though, my head is throbbing today so I am not getting any good work/concentration done.

nmoinvaz commented 2 years ago

The whole of WITH_UNALIGNED, UNALIGNED_OK and UNALIGNED64_OK can also completely go away..

I'm not sure that is the case because we use UNALIGNED_OK to determine which algorithms are available in functable, chunkset, and match_tpl. Do you know of another way of handling these scenarios?

nmoinvaz commented 2 years ago

The only other area where unaligned reads occur are in crc32_little and crc32_big. Both are also in zlib.

KungFuJesus commented 2 years ago

The whole of WITH_UNALIGNED, UNALIGNED_OK and UNALIGNED64_OK can also completely go away..

I'm not sure that is the case because we use UNALIGNED_OK to determine which algorithms are available in functable, chunkset, and match_tpl. Do you know of another way of handling these scenarios?

The reason for that is mainly because of the unaligned implicit loads that happen for the first two bytes on the x86 SIMD variants. Afterward, everything else is explicitly using unaligned 128/256-bit load intrinsics. For those, we definitely don't need to protect with UNALIGNED_OK. We could roll the first 2 byte comparison into the loads and just do a relatively wasteful load at the end of len 2 bytes. The vanilla C on the other hand is doing unaligned loads into GPRs when unaligned loads are ok. That is the use case we're talking about unifying with memcmp to load up 4/8 bytes at a time.

The UNALIGNED64_OK is mostly there for architectures which support 64 bit wide loads. I think on some architectures it's safe to do up to 32 bit loads unaligned and that's why those are separated. In any case this mostly goes away with a loop of memcmp's of the respective size, as I think if the len is known at compile time, it uses __builtin_memcmp which can be inlined and on supported architectures, just does the unaligned load right there instead of copying all 4 or 8 bytes into an aligned location + compare or shift'ing and OR'ing the byte values into a given register or whatever inefficient but compatible sequence the compiler might derive.

nmoinvaz commented 2 years ago

We could roll the first 2 byte comparison into the loads and just do a relatively wasteful load at the end of len 2 bytes.

The reason why I initially did the 2 byte comparison was because it was faster in some cases (I don't remember which). I would welcome somebody else checking to see if there are some optimizations to be had there.

KungFuJesus commented 2 years ago

We could roll the first 2 byte comparison into the loads and just do a relatively wasteful load at the end of len 2 bytes.

The reason why I initially did the 2 byte comparison was because it was faster in some cases (I don't remember which).

I could see that being the case - a direct comparison even if going into flag registers could be faster than relying on 16 or 32 bytes worth of data being shoved into a vector register - especially if it's often the case that the match fails in the first 2 bytes. In any case, memcmp can accomplish nearly the same thing and should compile to the same code (we have to do another comparison if false for the first 2 bytes to determine which of the 2 failed the match).

I do highly suspect the SSE4.2 string compares would be slower than SSE2 pcmpeqb + movemask + bsf/tzcnt. Chaining all of those together still ends up being lower latency than a single one of those string comparison instructions. It probably has much less crazy decode complexity, too.

nmoinvaz commented 2 years ago

Another thing we could do is move the 2 byte check into match_tpl.h and use compare256 instead of compare258. In this case we would not incur the cost of the jmp if the first two bytes don't match.

nmoinvaz commented 2 years ago

Actually in match_tpl.h compare258 is compiled inline. But I think I had noticed that sometimes the compiler chooses not to inline these functions (MSVC?). Anyways, this is off-topic.

nmoinvaz commented 2 years ago

It looks like crc32_little and crc32_big shouldn't be affected because it aligns the reads.

KungFuJesus commented 2 years ago

So it's just the memcmp changes that need to occur to be able to close out this bug? :)

turol commented 2 years ago

An example where unaligned loads caused incorrect code generation: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html