lz4 r124 breaks ABI without bumping soname

GoogleCodeExporter commented 9 years ago

You can't make changes like this without increasing lz4.so's soname version:

https://code.google.com/p/lz4/source/diff?spec=svn124&r=124&format=side&path=/tr
unk/lz4.h

ABI changes this like cause programs which link to lz4.so to behave strangely, 
or more likely, crash:

https://bugs.archlinux.org/task/42944

Please be considerate of your downstream consumers.

Original issue reported on code.google.com by d...@falconindy.com on 28 Nov 2014 at 6:34

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

To be clear, the breakage is on line 237 where you change 
LZ4_STREAMDECODESIZE_U32 from 4 to 8 bytes, changing the size of 
LZ4_streamDecode_t.

Original comment by d...@falconindy.com on 28 Nov 2014 at 7:14

GoogleCodeExporter commented 9 years ago

Indeed,
the internal content of state LZ4_streamDecode_t has changed,
in order to support some new scenario
(namely, short distance ring buffers).

As a consequence, it requires to track more values, 
hence to increase the size of state.

Sorry about inconvenience.
Note that the library version has been upgraded from 1.3.1 to 1.4.0.
Consequently, the new soname is liblz4.so.1.4.0.

Systems can still host both versions concurrently liblz4.so.1.3.1 and 
liblz4.so.1.4.0. But I suspect your program uses instead the symbolic link 
"liblz4.so" or "liblz4.so.1" which probably tracks latest version.

Original comment by yann.col...@gmail.com on 28 Nov 2014 at 10:11

GoogleCodeExporter commented 9 years ago

No, the soname is not the filename.

$ readelf -d /usr/lib/liblz4.so | grep SONAME
 0x000000000000000e (SONAME)             Library soname: [liblz4.so.1]

If you change the ABI or API (not simply extend), you *must* bump the soname 
version. r124 should have produced liblz4.so.2.

Original comment by d...@falconindy.com on 28 Nov 2014 at 10:53

GoogleCodeExporter commented 9 years ago

OK, I guess it comes from this line in the Makefile :

SONAME_FLAGS = -Wl,-soname=liblz4.$(SHARED_EXT).$(LIBVER_MAJOR)

So only LIBVER_MAJOR is part of the soname.

I don't know much the logic behind soname. So I'm not sure if it's normal that 
only LIBVER_MAJOR is part of the name.

But it looks strange if every minor change requires a major numbering update.
What the point of having a 3-parts numbering system then ? Is that just to 
track comment changes ?

Original comment by yann.col...@gmail.com on 28 Nov 2014 at 11:14

GoogleCodeExporter commented 9 years ago

I don't have a better answer than simply: this is the way things are. It would 
be atypical that the soname contains more than a single number to version it. 
The tag really is quite arbitrary, and so long as you never reuse a tag to 
refer to 2 different ABIs, you can name it whatever you want.

https://autotools.io/libtool/version.html

This site is specific to libtool/autotools, but the versioning recommendations 
(especially section 4.1) apply to any shared library released into the wild.

Original comment by d...@falconindy.com on 28 Nov 2014 at 11:31

GoogleCodeExporter commented 9 years ago

Thanks for the link. It's pretty clearly described.

Original comment by yann.col...@gmail.com on 28 Nov 2014 at 11:41

GoogleCodeExporter commented 9 years ago

Please note that the "raw" sonames and the libtool version-info are not a 
match, the soname is built out of the version-info, so you should not apply the 
rules from my guide to non-libtool usage.

The three-parts version system is a bit mental, to be honest — I would 
suggest you keep it simpler as a two-components version, and bump the major 
(soname) version if you change ABI, and the minor (non-soname) if you extend it 
(add functions).

Note that you *can* change the ABI without changing the API (e.g. by changing 
an int32_t parameter to int64_t), as well as you can change the API without 
changing the ABI (more involved, requires the use of symbol versioning which is 
mostly GNU-specific.)

https://blog.flameeyes.eu/tag/abi for reference.

Original comment by flameeyes@flameeyes.eu on 3 Dec 2014 at 7:43

GoogleCodeExporter commented 9 years ago

Yeah, I agree and definitely use the 3 parts version-info the way you describe.

To be fair, I wasn't expecting the small change introduced into r124 to result 
into an ABI break. That's a lesson.

But to be more complete, there are more points at stake :

- The affected part of the API is within a section labelled "experimental", 
which basically poses the question of how to introduce new features which are 
not yet "completely settled". It's unreasonable to expect the same stability 
level from new features compared to other ones supported since many years.
If tagging a section of the API "experimental" is not good enough, then how to 
do it ?

- The issue comes from using types which are statically defined into the .h. 
Hence they can become out of synch with the library.

But to fair, I wasn't expecting these type definitions to be used in tandem 
with a shared library strategy. I was explained not so long ago that everything 
within a shared library should be accessible without relying on a *.h file.
Hence, there are 2 functions, which create and free memory for the state 
object. These functions are "by design" synchronized with the library, and 
therefore safer to use.

The statically defined type was intended to be used with static library in mind.
For dynamic library, the constructor/destructor functions are supposed to be 
used instead, because this method is more stable in relation with version 
management.

How to achieve that ? Is that a reasonable objective ?
I suspect that adding comments within the header file will not be enough.

Original comment by yann.col...@gmail.com on 3 Dec 2014 at 11:37

GoogleCodeExporter commented 9 years ago

Ah, and I should add :

I'm heavily using Continuous Integration tests with each minor release of LZ4 :
https://travis-ci.org/Cyan4973/lz4#

It's an invaluable tool and works very well.
The number of tests run is now pretty large and cover a large palette of 
situations.

However, there is no test related to Dynamic Library usage.
Hence there is no test able to spot an ABI breaking change.
It would be great too if some new tests could be added to the CI to keep an 
automated eye on such issues.

Original comment by yann.col...@gmail.com on 3 Dec 2014 at 11:45

GoogleCodeExporter commented 9 years ago

What you can do is make sure that the type is opaque, by not defining it at all 
in the header file; as long as your API calls are only accepting a pointer, you 
don't have to define the structure publicly at all (just say "struct foobar;", 
then you can use "struct foobar *" as you need).

If you define the structure in the header file, there is no hope to just say 
"don't allocate it statically", somebody will.

Original comment by flameeyes@flameeyes.eu on 3 Dec 2014 at 8:42

GoogleCodeExporter commented 9 years ago

Well, to be clear, the intention is :

allocating the structure is a valid operation
(actually, some embedded systems do require such a behavior)
*when* the source code is compiled directly within the program.
In such case, there is no risk of mismatching versions.

However when the library is compiled as a shared DLL,
it is no longer recommended, due to risks of different versions.
Better use the constructor functions.

Basically, they look like 2 incompatible objectives.

The only solution that came to my mind is
having 2 separate *.h header files,
one for direct compilation/integration,
and one for DLL.

But it looks hackish. 
Furthermore, I'm not aware of any other project having done that before.

Original comment by yann.col...@gmail.com on 3 Dec 2014 at 8:50

GoogleCodeExporter commented 9 years ago

Many projects use the idea of an opaque context object. It's a reliable method 
of creating an API which can be evolved while maintaining backwards 
compatibility. It doesn't matter what the true size of the context struct is 
because it's self-contained within the lib. You're right to want to avoid 
having multiple header files -- this is not a solution you will be happy with 
in the long term.

Could you explain what you mean about allocating the structure being a valid 
operation? Why would allocating the struct on the heap be a problem? Other code 
inside lz4 allocates memory on the heap. What's the difference?

Original comment by d...@falconindy.com on 3 Dec 2014 at 9:05

GoogleCodeExporter commented 9 years ago

For me, and for PC environment in general, I feel there is no problem with that.

Such requests come from embedded environment, which can be very different.
I remember having met in the past several programmers which were suspicious of 
standard C libs, let alone malloc(). So I wasn't surprised when I received 
several requests to allow a way for "custom allocation methods" and even 
"statically allocated state". Of course within such environments, the concept 
of DLL doesn't even exist.

The first solution proposed was to provide the size of structure to allocate.
Hence function such as :
int LZ4_sizeofState(void);

It works, but it's still not completely fine, because it doesn't help static 
allocation scenarios, and there is also a need to ensure right alignment. Sure, 
malloc() is safe regarding alignment, but a "custom function" may be not.

"unalignment" costs little penalty on x86 environment, but for embedded 
systems, it can be catastrophic (CPU exception, program crash).
Putting an advise in the comment is better than nothing, but still not really 
automated enough.

Hence the second idea.
The size to allocate is expressed as a table of int32 or int64.
It will force the compiler to ensure alignment.

Note that current structures are already partially opaque : only the size is 
accessible, for allocation purpose. The content remains hidden, and can change. 
This trade-off was looking good so far.

Well, that is, before this new issue appeared : different versions, one used to 
compile the program, and another one (DLL) present on the target system, with 
difference table sizes as a consequence.

As I said, this "table" methodology was primarily intended for static 
allocation within embedded environment. More powerful systems, those using DLL, 
should really better use the constructor methods.
Except I can't force them to do so if a table structure is also present in the 
header.
Plus, to be fair, static allocation has some merits, it's simply more 
convenient to use in many circumstances (no malloc() also means no free(), less 
complexity and less risks of memory leaks).

Another tentative idea to reduce occurrence of such situation is to "cheat" a 
little on the real size of the tables, by declaring them larger than they 
really need to be. This way, it would provide a bit of room for potential 
future evolution, limiting the risk to introduce breaking changes.

This is all very prospective for me.
It looks difficult to please both worlds (embedded, and DLL) simultaneously.
Hence the suggestion of 2 different API.
But I agree, such solution doesn't look good from a maintenance perspective...

last minute idea : maybe a kind of "header extension for embedded environment" ?

Original comment by yann.col...@gmail.com on 3 Dec 2014 at 9:45

GoogleCodeExporter commented 9 years ago

The more I'm thinking about it, the more I'm suspecting that having 2 separate 
headers, for Dynamic and Static linking, might be the better way to go.

Basically, the "normal" header would remain :
lz4.h
and be targeted at "DLL" API.

A more complete 
lz4_static.h
would then be created, which includes "lz4.h", and also adds definitions only 
valid for static linking, such as macros and structure sizes.

The last point which I'm worried about is that
advantages of strongly typed arguments might be lost in the process.
Structures with undefined size will have to be casted as void*.
Even a typedef won't help the compiler making a difference between 2 different 
undefined structures, since they all must be :
typedef void* structPtrX;

So an error such as structPtr1 instead of structPtr2 won't be noticed.

Finally, something I'm wondering : differences between API & ABI.
Suppose that I remove macros from API.
It obviously impact API, but I suspect it doesn't change ABI.
Is that correct ?

I'm asking because soname seems tied to ABI changes.

Original comment by yann.col...@gmail.com on 8 Dec 2014 at 1:06

GoogleCodeExporter commented 9 years ago

You don't need to typedef everything to void. Your .c file can define a struct, 
say 'struct foo_t', include the public header, and your public header can 
define 'typedef struct foo_t foo_t'. Consumers of the API won't have the 
private definition of the struct's members, so they'll only be able to define 
pointers to foo_t, relying on library calls to perform allocation. The library 
still knows the real size of foo_t.

You're correct that soname is tied to the ABI. Any time the ABI changes, so 
must the soname. And as you've discovered, it's possible to change API without 
affecting ABI. It's also possible to change ABI without affecting API, for 
example, by reordering members within a struct.

I notice r125 was pushed and the build still produces liblz4.so.1 =(

Original comment by d...@falconindy.com on 13 Dec 2014 at 11:15

GoogleCodeExporter commented 9 years ago

The current issue is caused by a minor change in an experimental section of the 
API which affected only one user so far. The said user is no longer affected, 
both because it updated its reference lib (*.h) and moved to lz4frame API. So I 
believe changing soname at this stage is more likely to generate confusion than 
help.

I prefer keeping soname bump for a more drastic API change, that may happen 
sometimes in 2016 (or not).

Original comment by yann.col...@gmail.com on 24 Dec 2014 at 2:24

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

That's a truly disappointing and disheartening response. I can't take lz4 
seriously as a project if this is going to be your approach to maintaining it. 
This bug caused broken binaries in Arch Linux for thousands of users. To claim 
that only "one" user was affected is gravely understating the effect.

Original comment by d...@falconindy.com on 24 Dec 2014 at 3:19

GoogleCodeExporter commented 9 years ago

Indeed, this is disappointing.

There are three reasons why WONTFIX resolution is wrong:
- there's no way to know how many programs were compiled with liblz4.so.1. Many 
Linux distributions provide liblz4 as a shared library, and certainly there are 
users out there linking their own programs with it. (I am, for one).

- tagging something as "experimental" in the documentation simply does not 
work. For example the discussions about CONFIG_EXPERIMENTAL in Kernel config, 
which is enabled by all distributions.

- there's no need to "conserve" soname numbers. They are not visible like 
project version numbers. The right thing is to bump the number, even if the 
incompatible change was detected afterwards. This will cause rebuilds to 
happen, and the potential crashes will be avoided, even if people are not aware 
of the issue yet, or haven't diagnosed it as caused by the liblz4 change. This 
*is* the cleanest solution.

Original comment by zbys...@in.waw.pl on 24 Dec 2014 at 3:48

GoogleCodeExporter commented 9 years ago

Although we have not yet packaged r124, this will also cause problems on Gentoo 
Linux, where all of our users build from source.

Please update the soname for ABI changes.

Original comment by floppymaster@gmail.com on 24 Dec 2014 at 6:54

GoogleCodeExporter commented 9 years ago

this bug with inconsistent .so as caused to systemd [1] to questioned the usage 
of lz4 as a valid option as show in the subsequent message "o the question is 
whether to replace lz4 with something more stable or to ignore the issue and 
hope it doesn't happen again." also that question the stability of lz4.

[1] 
http://lists.freedesktop.org/archives/systemd-devel/2015-February/028803.html

Original comment by prflr88@gmail.com on 26 Feb 2015 at 5:25

GoogleCodeExporter commented 9 years ago

It's probably a good idea to open this discussion again.
There's a need to keep a constructive level, seeking ways to improve situation 
for the future. Blame game doesn't move us anywhere.

Zbigniew, contributed to this thread, and provided some interesting comments. 
They can be a good reference for discussion.

The second comment (tagging something as "experimental" in the documentation 
simply does not work) is indeed very relevant. 
As a consequence, I decided to modify the way library is shipped.

There are now 2 different types of header files. 
For an example, look at lz4frame.
There is an "lz4frame.h", which is the normal header, including prototypes 
considered stable and supported. This header is installed within "include" 
directory as part of the library installation process.

Another file, "lz4frame_static.h" is also proposed. It contains definitions 
which can only be used **in the context of static linking**. As a consequence, 
it's not installed within "include" directory. The file also clearly states "do 
not use with a dynamic library". 
Its objective is to publish definitions which can still evolve in a potentially 
breaking way in future versions. They are therefore not stable enough to use 
with multiple versions of the dynamic library. However, for a static linking 
strategy, it's not an important consideration, and the advanced possibilities 
it offers could be useful for some specific targets.

A difficult line must be drawn between "future proof API", which are best 
supported when remaining "high level" and simple enough.
On the other hand, "very fine-grained API for advanced user" can be necessary 
typically to support embedded environments. They have much stricter and 
specific requirements.
These 2 set of requirements are almost opposite.

For example, embedded developper is typically directly in charge of memory 
allocation, using static methodology. Also noteworthy, even standard C library 
support might be very limited (such as malloc).
As a consequence, the number of "gears" (prototype) required for the developper 
to take in charge all these parameters and produce an environment-compatible 
program is much larger.
And obviously, more numerous and finer prototypes translates into increasing 
chances of breaking changes.
It's not necessarily a bad thing, refusing changes is like refusing learning 
and improvements.
But there should be a way to allow such kind of flexibility on new, 
experimental and highly targeted prototypes,
without jeopardizing the whole numbering scheme, which objective is more in 
line with the vast majority of users using stable prototypes, and which do not 
care about such advanced functions.

With the new separation between "*.h" and "*-static.h", I hope to offer this 
kind of flexibility (now located into *-static.h, targeting new usages or 
embedded environments), while offering stability (now located into *.h) for 
common environments using dynamic linking, and potentially facing multiple 
library versions.

Note, by the way, that the part of API which used to be tagged "experimental" 
and caused unfortunate troubles for systemd
is no longer tagged so. It is also included into lz4.h, meaning it is 
considered stable from now on. A major number up will be granted in case of any 
interface-breaking changes (none of which is planned for the time being).

Regards

Original comment by yann.col...@gmail.com on 26 Feb 2015 at 2:42

xiaolu / lz4

lz4 r124 breaks ABI without bumping soname #147