zherczeg / sljit

Platform independent low-level JIT compiler
Other
844 stars 74 forks source link

Port PCRE2 JIT to Linux on IBMz (s390x) #89

Open edelsohn opened 4 years ago

edelsohn commented 4 years ago

Enable PCRE2 JIT on Linux on IBMz (s390x-linux) and optimize to achieve equivalent speedup over non-JIT code as x86_64. Goal is full functionality and passing testsuite with JIT enabled.

An unfinished port of PCRE2 JIT to s390x exists in the Linux-on-IBM-z Github account and can be used as a starting point. IBM will contribute the code as necessary.

https://github.com/linux-on-ibm-z/pcre2/tree/s390x-experimental

A $5000 bounty from IBM available posted on Bountysource https://www.bountysource.com/issues/92353837-port-pcre2-jit-to-linux-on-ibmz-s390x

Systems access is available through the LinuxONE Community Cloud at Marist https://linuxone.cloud.marist.edu/#/register?flag=VM

This issue is tied to a feature request to the Exim community for PCRE2 support. https://bugs.exim.org/show_bug.cgi?id=2635

edelsohn commented 4 years ago

@zherczeg Do you know of someone in the PCRE2 JIT community who would like to work on this issue?

zherczeg commented 4 years ago

Perhaps @carenas might be interested. As for me I am working in a University which has many industrial collaborations, that could be an option, although that will be more pricey.

carenas commented 4 years ago

the development branch s390x shows (only in the cloud though, not qemu) some encouraging progress :

$ bin/regex_test
Pass -v to enable verbose, -s to disable this hint.

REGEX tests: all tests are PASSED on s390x 64bit (big endian + unaligned)

pcre2 itself fails (even without JIT) and will need work (pcre2test segfaults with anything that touches invalid UTF) functionality wise (ex: non including a commented out test that will segfault for the JIT part) :

Successful test ratio: 50% (330 failed)
Invalid UTF8 successful test ratio: 0% (129 failed)

and as zherczeg pointed out will likely need a significant amount of work to get close to that performance objective

edelsohn commented 4 years ago

I can't access the branch created by carenas. Is that the same as the branch from linux-on-ibm-z?

How much effort do you estimate to make PCRE2 functional?

How much effort to optimize it for the platform?

What amount of bounty would make it interesting, probably split into sub-milestones.

carenas commented 4 years ago

I can't access the branch created by carenas. Is that the same as the branch from linux-on-ibm-z?

my bad, fixed the link. it is mostly the linux-on-ibm-z code but with a few fixes on top to fight the bitrot so it will build on top of the current tree for further development.

How much effort

it is too early to know but the original code didn't support an FPU and will need to have put_label support added before it can work with current PCRE2 (including bad UTF8), so the sooner we can get it in a shape good enough for merging (even if it doesn't perform well) the better IMHO to make sure it will not bitrot further.

edelsohn commented 4 years ago

IBM can redirect the current bounty towards the basic enablement. We would appreciate guidance from the PCRE2 community on a plan and a sizing for the various steps to achieve basic enablement suitable for merger and then further optimization to achieve optimal performance,

carenas commented 4 years ago

@edelsohn: is instruction cache coherency an issue with Z?, the original code went into great extents to try to avoid clearing the instruction cache when updating code is required (ex: when jump address or constants were updated, and now when put_labels are updated), and eventhough it is using a somehow independent pool to keep those values, it is still in the same memory segment than the rest of the code and therefore likely to be considered for caching with the instructions AFAIK.

in positive news side while it doesn't yet fully work (because of put_labels) at least doesn't segfault.

edelsohn commented 4 years ago

Thanks for the progress.

IBMz has a fairly strong memory consistency model, like x86. Are you experiencing unusual behavior or just checking? I'm not aware of other self-modifying IBMz code explicitly flushing / clearing caches.

carenas commented 4 years ago

just checking, and hoping to have a (likely incomplete) version that could pass all tests by end of the week.

carenas commented 4 years ago

building and running all tests successfully in :

https://travis-ci.com/github/carenas/sljit/builds/182588796

as well as when applied to a TRUNK version of PCRE as shown by (running in the IBM cloud) :

make  check-TESTS
make[2]: Entering directory '/home/linux1/src/pcre'
make[3]: Entering directory '/home/linux1/src/pcre'
PASS: pcre2_jit_test
PASS: RunTest
PASS: RunGrepTest
============================================================================
Testsuite summary for PCRE2 10.36-RC1
============================================================================
# TOTAL: 3
# PASS:  3

note that it is not enabled by default and it is mainly available to coordinate further development, as it is missing functionality which would trigger abort(), has no FPU (neither SIMD) support, and has not been optimized or even profiled so it might be even slower than the interpreter.

edelsohn commented 4 years ago

Awesome progress, @carenas ! Let us know when you and the community have analyzed the situation and have sizings for the next steps.

carenas commented 4 years ago

Let us know when you and the community have analyzed the situation and have sizings for the next steps.

@zherczeg: would the following make sense for sizing and next steps?

  1. get s390x branch developer ready so it is maintained in-tree to avoid further bitrot (ETA end of the week, will need review and while I agree it has some rough edges should be safe as it is disabled from autodetection and I am hoping to keep further development rebased and under CI), to hopefully aid testing and even contributions from the wider community (still hoping those IBM insiders bring back some of their very useful architecture knowledge, this way).
  2. get a newer version that completes the implementation needed to fully support s390x (FPU and vector operations, mainly what I expect will be needed, with a target CPU of Z13), this I am expecting to be about 40 man hours, and will also include a repository with patches to PCRE to be kept under CI
  3. tie loose ends (most of the abort() should be gone by now, but there are likely still some important TODO and other work that will be needed for performance, as well as support for anything that was not path critical like fallback for non FPU or older Z with a target to hopefully make it run in QEMU) and there is likely here were most of the optimization and performance work will be done with a suggested option to include this code (still disabled for auto) in the next PCRE release to broaden user base, recruit contributors and gather information to help guide future development, and in auto hopefully in PCRE + 1.

note that the original fork was maintained for almost 2 years but it might be still 60% done and there is also the need to learn the architecture (which by being proprietary makes things slightly more difficult), but I am optimistic it could be done thanks to the cloud availability and a good starting base, which is why I'd been focused on last week to make sure all previous investment is not wasted.

zherczeg commented 4 years ago

I would like to avoid what happened with the TileGX port. It has never been completed, no maintainers and thus it would be the best to remove it. As for a new port, I would prefer to have a cross compiler, qemu, gdb tools which I can use to build and test the new port and also some cpu / abi documentation since I need to learn it at some level. I am curious what "proprietary" exactly means here. No public documentation available? No free tools available?

edelsohn commented 4 years ago

Yes, s390x is proprietary, compared with, say, RISC-V, but not compared with Intel x86_64. Proprietary doesn't mean secret. GCC, LLVM, PyPy (including NumPyPy), node.js, OpenJDK, and Mono all are ported to s390x. There is plenty of documentation.

z/Architecture ISA

Linux on z ABI

QEMU for s390x

You can request access to a system (VM/VPS) through the LinuxONE Community cloud that I mentioned in the first message.

The LinuxONE Community Cloud also hosts Travis-CI instances, as Carlo used.

IBM s390x experts are available to answer questions.

40 hours seems a reasonable estimate. We can continue to discuss how to set up incremental milestones and associated bounties.

carenas commented 4 years ago

I would prefer to have a cross compiler, qemu, gdb tools which I can use to build and test the new port and also some cpu / abi documentation since I need to learn it at some level.

both gcc and clang can crosscompile code for s390x (and indeed I have both setup in the CI in 2 versions of Ubuntu), an I was originally using the standard gcc crosscompiler package from Ubuntu 20.04 to build the code and develop it further, so unlike TileGX this is fairly more open (with CI / cloud VMs available and several linux distributions to use in real hardware)

qemu-user targets s390x and does work, but it fails tests either because the instructions we are using are not correctly emulated in the default CPU it uses, or because the original code targets z12 and the implementation differs enough. running with an emulated z12 or z13 might help but using a real VM is easier.

which is why I mention CI and was hoping in the third step to broaden cpu support, which will also help real users with older hardware, even if initially the target will be z13 (which includes vector support).

in that context my "proprietary" label, meant we will need to rely for now in the linux one cloud for development, but unlike TileGX it is likely to get better later.

zherczeg commented 4 years ago

I have checked the documentations. At first glance the system design matches well to sljit, it uses two-complement number format, the ABI is similar to PowerPC, and it has condition codes, wide multiply, and IEEE single / double precision floating point. It also has an incredible number of instruction forms, even more than ARM-T2, probably because it is an old architecture. I have one more question, do we need to support EBCDIC? Because PCRE2 has many ASCII / UTF related optimizations which will not work with it.

edelsohn commented 4 years ago

Linux on IBMz is ASCII and we only are asking to port PCRE2 JIT for Linux, not to z/OS in EBCDIC.

z14 added some additional SIMD instructions, so you might want to target that instead of z13, if it's beneficial.

carenas commented 4 years ago

z14 added some additional SIMD instructions, so you might want to target that instead of z13, if it's beneficial.

is z14 what is being used by the Linux ONE cloud provided vm? the travis containers are running older than usual Ubuntu versions as well so I was suspecting they might be constrained by the CPU version there as well.

note the main concern here is on being able to maintain the code base moving forward, which is why QEMU was ideal (since it can run locally on each developer workstation, and we have a lot of different OS to support there, including one folk with z/OS that is likely to benefit from this port if done in a compatible way). Of course CI and remote access to a VM in native hardware is a good enough substitute but it is less scalable, hence why I was hoping will be only needed through the original bootstrap.

edelsohn commented 4 years ago

I believe that the LinuxONE systems (hosted at Marist College) now are z15. Mainframe processors are not available in laptops, sorry. I understand the appeal, but I would recommend relying on the remote access over QEMU.

I am not certain what OS levels are available through Travis CI on s390x.

carenas commented 4 years ago

looked at the qemu side, and definitely we are unlikely to get that going regardless of how much we tweak the code generator CPU support; FWIW even qemu system (running fedora rawhide for s390x) will segfault with "interruption code 0010 ilc:3" :

Screen Shot 2020-09-04 at 7 32 12 PM
zherczeg commented 4 years ago

I think this issue should not be closed

edelsohn commented 4 years ago

How is the enablement work proceeding?

carenas commented 4 years ago

How is the enablement work proceeding?

slower than planned, mainly to my poor planning; but in the right track after the first phase was completed (with a lot more changes and still a few more reservations than were originally expected).

zherczeg commented 4 years ago

I have started to do some fixes but I need to learn a lot more before I can do them in a way I want and my time on voluntary work is quite limited.

edelsohn commented 4 years ago

@carenas Any updates about this project?

carenas commented 4 years ago

phase 1 is included in the RC1 for PCRE 10.36, which means that with the right setup we could now get more people working in parallel to cleanup and complete the implementation to be ready for end users.

after merging phase 1 it was clear that my original plan was going to incur on too much tech debt and so it is reasonable to expect that phase 2 (getting vector operation support) and phase 3 (cleaning up old tech debt) will benefit to have wider distribution and therefore more hands than what I was originally expecting to have (only me)

this obviously doesn't qualify for the bounty terms (which in all fairness I have to admit, I was uncomfortable with, after pulling so much volunteer work reviewing phase 1), but I am still committed to get this out (hopefully with a little more help, and even if that means I will have to "subcontract") that work.

apologies for the delays, but I am hoping at the end the PCRE release that will be enabled for user support with JIT in s390x will be then of better quality.

it is important to note that most of the work I'd been doing (in PCRE, not in sljit) was to add for phase 2 support for FPU in the same way it is done for the other architectures, but my concern is that it might be too hacky and therefore increase tech debt unnecessarily even if it is possible (was hoping to get mostly vector instructions in, while leaving everything else untouched as was planned originally for phase 1), but I am concerned that with adding a third implementation it might also make sense to refactor the other 2 moving code from PCRE into sljit as well, which will be obviously a bigger undertaking that planned.

edelsohn commented 4 years ago

@carenas Let's figure out how to solve this. There are multiple options:

  1. IBM can increase the bounty, within reason, if the project is larger than originally estimated.
  2. Bounties can be split among multiple people in any proportion.
  3. IBM can re-arrange the bounties into multiple parts associated with incremental milestones.

I'm a little concerned with the architecture redesign because IBM would like to have some PCRE JIT available on IBMz sooner rather than later. We can collaborate on a redesign as a second phase.

carenas commented 4 years ago

@zherczeg for fairness sake could we figure out what would be required for you to implement phase2 (without doing refactoring, and since it will be easier to follow your current pattern) and how big do we need phase3 to be to make sure everything that is currently in TODO/FIXME gets fixed within reason?

it is obviously too late for 10.36, but I could maintain a semi stable experimental patchset on top of it to aid anyone interested on backporting/testing this feature into that release and until we hopefully can get everything cleaned up and released for user consumption with 10.37.

zherczeg commented 4 years ago

If I understand correctly phase 2 means getting vector operation support. In PCRE2, there are SIMD functions, which can search characters or character pairs. Each of them has a corresponding HAS macro, so you don't need to implement all, only those you want. It is true, SIMD registers are often use the "same" registers as FPU, but these two things are nothing to do with each other. On 32 bit ARM for example, it is a bad practice if fpu and simd instructions affect the same registers, because internally they are different registers, and the CPU have to copy things. The reason why I never attempted to do a vector instruction set in sljit, because these instruction sets have many specialized instructions, and their approach (concept) for handling things are surprisingly different. Hence each SIMD accelerated function while doing the same thing from the perspective of PCRE2, may work quite differently.

I would like to know something. How much work from my side is needed for this project. If the effort is bigger, I would like to have a formal contract, preferrably with the University where I am working.

edelsohn commented 4 years ago

IBM would like PCRE on Linux on Z to achieve feature and optimization parity with other architectures, such as x86, ARM and Power. I'm confused if the FPU and SIMD vector support are enabled on other architectures or this is new functionality. Or you want to engineer it in a different manner on Z. Or you want to use this as an opportunity to redesign the support.

Also, I would prefer to avoid a university contract because that introduces a huge amount of bureaucracy and delays. I have been able to flexibly work with many other Open Source projects through bounties, from LLVM to PyPy to OpenBLAS to OpenCV to VLC to Sleef with developers in a variety of continents. I hope that we can make progress without undue complexity.

zherczeg commented 4 years ago

SIMD is currently used on x86 and aarch64. FPU is supported by sljit feature, but PCRE2 does not use it.

edelsohn commented 4 years ago

IBM would like sljit for PCRE2 to be enabled and optimized on Linux on Z, equivalent to x86 and AArch64. And, ideally, SIMD optimization, equivalent to x86 and AArch64.

I am unclear if @carenas is suggesting a re-engineering of sljit as part of the implementation. Why can't sljit for Linux on Z be implemented in a manner equivalent to x86 and AArch64? I thought that was the proposal.

carenas commented 4 years ago

would love to suggest if we could just jump into a meeting, to get all our ideas clarified in a more effective way?, I am available anytime you need me to, and could setup a google meeting if given appropriate ids

edelsohn commented 4 years ago

@carenas Who are you proposing for the meeting? You and I? Or @zherczeg as well? I'm available Friday after 11:00 ET, but that may be too late for you and Zoltan.

carenas commented 4 years ago

perfect for me (I am PST), but might be too late for Zoltan who is AFAIK somewhere in Europe. @zherczeg do you have a suggestion, should we include Philip for the PCRE part?

@edelsohn I know you are very busy but if that time doesn't work for Zoltan I am also available for a 1-on-1 which I think it is long overdue anyway at the time of your convenience and based on the availability below.

hopefully wouldn't take more than 30min; to easy coordination of the time had setup the following: http://whenisgood.net/bka3b5y

zherczeg commented 3 years ago

I don't think Philip is needed unless you want to touch code outside of jit. I am in CET time zone, and a call starting after 9.30pm on Friday (which seems 3.30pm in ET and 12.30pm in PST) could work for me. I would prefer a service which works in a browser under Linux, and no registration is needed to join. But you can have a call without me of course.

carenas commented 3 years ago

I am in CET time zone, and a call starting after 9.30pm on Friday (which seems 3.30pm in ET and 12.30pm in PST) could work for me.

sadly I already have a conflict that I won't be able to reschedule around that time (which is why my proposed times in the "whenisgood" link above started at 3PM PST for Friday and that I now realize was pushing you into Saturday (because of the time differences).

my earlier available hours don't work with @edelsohn constrain of "after 11AM his time", but I could do anytime before 11AM PST in case that could be resolved (even though I think we are too close for comfort and might be better served with a later schedule)

I have plenty of time during the weekend though, or we could push it a few more days until beginning of next week, which will allow us also some more time to collaboratively come out with some minutes to make this more effective.

I would prefer a service which works in a browser under Linux, and no registration is needed to join.

@zherczeg: could you host such a service?, all the ones I can think of might require some sort of user account or a proprietary solution (ex: slack), if using slack with google accounts is good enough I can provide a slack channel, which could be used also long term to make sure there are no more misunderstandings: https://join.slack.com/share/zt-jfs6uef4-Ojeu02hll4EL0dXFhULgmA

But you can have a call without me of course

my one-on-one with David would be mostly to make sure that all the misunderstandings between the two of us are resolved earlier and in preparation with talking with you, the same also applies if you would like to have a one-on-one at a time when David might not be available.

I am afraid though that without your participation there is no way to solve the current impasse I might had gotten us into, and for that I apologize.

edelsohn commented 3 years ago

@carenas

You wrote:

I am available anytime you need me to,

I wrote:

I'm available Friday after 11:00 ET, but that may be too late for you and Zoltan.

I didn't write every day after 11:00 ET. You said any time and I proposed the first time available. Apparently you are not available any time. Please be precise in what you write.

I am available other days after 8:00 AM ET, but you did not provide those times in whenisgood.

I also do not understand what is so complicated about the proposed project that we need to talk in person. IBM wants sljit to function in PCRE2 on Linux on Z with equivalent functionality to x86 and AArch64 - at least integer and, ideally, SIMD. Presumably the Z support can use the same design and infrastructure as existing architectures.

zherczeg commented 3 years ago

It seems the main question is quality. For exaple:

These all are quality questions. The code can work without it, and you may consider it as "equivalent functionality". Probably these can be discussed without a call, but it would be good to know the qualty targets for IBM.

I read the terms of use of the bounty provider, and it seems it does not handle taxes. This looks like a big difficulty for me.

carenas commented 3 years ago

I think I can setup a video meeting with the required constrains using Jitsi Meet (video encouraged but not required)

https://meet.jit.si/pcre2-linux-s390x

could you both be available for 1h (hopefully will take less) around Mon Nov 16th noon ET (AKA EST/GMT-5/UTC-5); whenisgood UX might be a little confusing so will avoid it this time, but I am available for 2h around that time and hopefully fits everyone's constrains or can be adjusted easily.

From my own experience (and I understand the frustrations) getting "together" to meet and understand each other's motivations could go a long way towards resolving difficult issues and finding a common ground that benefit us all with compassion.

Agenda would be (open to further adjustments and not to be followed too strictly) :

Discussion will be done in english and I would share minutes of them after for revision (within participants, which might require a google account to allow for collaboratively editing) and once agreed to be acurate publish here (in text) for the rest of the community.

edelsohn commented 3 years ago

I can meet on Monday, Nov 16, at 12n EST.

zherczeg commented 3 years ago

That is 6pm for me, I can join for half an hour.

carenas commented 3 years ago

That is 6pm for me, I can join for half an hour.

if we move it earlier or later could we get a full 1h (my hope is we won't need really a full hour) and will adapt the agenda for a shorter timeframe otherwise but I am hard to understand when I speak too fast

thanks both for your help and understanding, below an alternative bluejeans which might be more robust as a fallback:

Meeting URL https://bluejeans.com/313712622?src=join_info

Meeting ID 313 712 622

Want to dial in from a phone?

Dial one of the following numbers: +1.408.419.1715 (United States(San Jose)) +1.408.915.6290 (United States(San Jose)) (see all numbers - https://www.bluejeans.com/numbers)

Enter the meeting ID and passcode followed by #

Connecting from a room system? Dial: bjn.vc or 199.48.152.152 and enter your meeting ID & passcode

and that we could use as a backup if we have logistical problems with Jitsi and to avoid wasting precious time

edelsohn commented 3 years ago

I never saw any Jitsi invitation. I can join via Bluejeans.

carenas commented 3 years ago

I never saw any Jitsi invitation. I can join via Bluejeans.

Jitsi doesn't do invitations AFAIK, but I was going to create the meeting and share the link at that time; agree Bluejeans is a nicer solution though and can manage the calendar but I am not sure if it will work for Zoltan (hence why I would like to keep it as a backup until we confirm otherwise)

sent you an invite to the email associated with your github so you can manage your calendar, anyway

zherczeg commented 3 years ago

Maybe I can be there a bit longer. Jitsy seems like a good solution.

edelsohn commented 3 years ago

And share the link in Github comments? This really isn't the right medium for an interactive conversation to schedule a meeting, nor to share links to a meeting.

zherczeg commented 3 years ago

I have tried to do some minor improvements in the code, so I tried to log in into the virtual machine I created. However I got an "An unknown error has occurred,Please try again later." After a few days I still got the same error, so I decided to delete the vm and create a new one (the vm quota is 1). However, when I try to create or upload an ssh key for the new vm I got the same error. Since this is the only service where we can test the code, we probably need to wait until they fix it.

edelsohn commented 3 years ago

There is no known, system-wide problem with the VMs at Marist. Have you reported the problem through the support system?

zherczeg commented 3 years ago

Is there an easy way to insert a breakpoint instruction on s390? I tried svc and trap but no success so far. I could probably call a function as the worst case, but that is not the nicest soultion.