zrlio / disni

DiSNI: Direct Storage and Networking Interface
Apache License 2.0
186 stars 66 forks source link

Removed shared maps. #10

Closed asqasq closed 7 years ago

asqasq commented 7 years ago

Removing the maps, which are shared by all processing cores in the fast path, increases multicore performance significantly.

Please have a look at the proposed change.

A simple test with 8 processing server cores in the DaRPC benchmark show this improvement:

Original (8 server cores) Statistics: 1149839.26 IOPS, diff: 5751496, time diff(ms): 5002 Statistics: 1148448.51 IOPS, diff: 5743391, time diff(ms): 5001 Statistics: 1151963.01 IOPS, diff: 5760967, time diff(ms): 5001 Statistics: 1126356.40 IOPS, diff: 5631782, time diff(ms): 5000 Statistics: 1126461.31 IOPS, diff: 5633433, time diff(ms): 5001 Statistics: 1128071.59 IOPS, diff: 5641486, time diff(ms): 5001 Statistics: 1126013.60 IOPS, diff: 5630068, time diff(ms): 5000 Statistics: 1126457.00 IOPS, diff: 5632285, time diff(ms): 5000 Statistics: 1126983.20 IOPS, diff: 5636043, time diff(ms): 5001

Modified (8 server cores) Statistics: 2390920.62 IOPS, diff: 11956994, time diff(ms): 5001 Statistics: 2392582.08 IOPS, diff: 11965303, time diff(ms): 5001 Statistics: 2387272.00 IOPS, diff: 11936360, time diff(ms): 5000 Statistics: 2389885.22 IOPS, diff: 11951816, time diff(ms): 5001 Statistics: 2392866.00 IOPS, diff: 11964330, time diff(ms): 5000 Statistics: 2386649.67 IOPS, diff: 11935635, time diff(ms): 5001 Statistics: 2387749.60 IOPS, diff: 11938748, time diff(ms): 5000 Statistics: 2387559.69 IOPS, diff: 11940186, time diff(ms): 5001 Statistics: 2383972.20 IOPS, diff: 11919861, time diff(ms): 5000 Statistics: 2385133.37 IOPS, diff: 11928052, time diff(ms): 5001

yuvaldeg commented 7 years ago

Hi Adrian, First of all, thank you for sharing your code!

I have been thinking on a solution for the issue at hand for a while now, and it is now evident that it is needed and has significant performance implications.

However, these maps are used to protect libdisni from wrong calls that may originate in the Java disni library. E.g., the Java disni library may try to access a stale resource and can call the JNI call with an invalid pointer. In this case, with your proposed change, the pointer will not be verified against the map, and will be accessed as is. This may result in an access violation, or if less fortunate, with a memory corruption.

I think that instead of removing these maps altogether, we have to think on another solution that will allow the necessary protection, but will still provide high performance fine-grain locking. I can suggest using some C++ implementation of Java's ConcurrentHashMap, that will allow mapping from obj_id to objects, but will not require any locking.

I'll of course let my colleagues comment as well.

Best regards, Yuval.

PepperJo commented 7 years ago

Hi Yuval,

I don't agree. We completely own the JNI interface and there is no way for a user of the library to inject anything that would result in an invalid pointer to be passed.

Regards, Jonas

patrickstuedi commented 7 years ago

Hi Yuval, Is one specific concern you have about the case where the QP has been closed but the DiSNI Java part by accident still issues an operation on that QP which causes the access violation? But that cause, a post send or recv call will just fail with error code, no?

What about trying to destroy a QP multiple times? Does libibverbs fail gracefully there? I think I do remember this case and it was problematic.

In principle, as Jonas pointed out -- we should be able to avoid those cases because we entirely own the layer on top. If, however, additional protection can be added at the JNI level without taking performance penalties, we should discsuss. Using lock free map like Java's ConcurrentHashMap I'm not sure that will work, even those maps use locks just more fine grained. Already we are using read/write locks, so I'm not sure there is anything we can do to keep the protection but avoid the locking overhead. @yuvaldeg can you be more specific about the maps you had in mind?

Cheers, Patrick

asqasq commented 7 years ago

Hi Yuval

Actually it is not only about locking. As Patrick pointed out, we are using read locks for read accesses, which, as such, should be ok.

The main point is that these maps are shared among cores and this causes interconnect transactions, which reduces performance a lot in a multicore environment, especially, because some of them are accessed frequently on the fast path.

I would also argue that DiSNI is an overall library containing of Java and C code. I would not see libdisni as a separate thing, which can be accessed from outside. libdisni should only be used by the Java part of DiSNI and if it passes invalid pointers, I would consider this as an internal bug, which we have to fix.

Cheers Adrian

yuvaldeg commented 7 years ago

Hi everyone,

Adrian, I fully agree with everything you said. We can and should assume that the Java DiSNI library will not do anything funny with libdisni, and that they should always be in sync. Also, I don't question the performance benefits that can be gained by avoiding these maps and locking - your results are evidence of how impactful these maps are.

However, Patrick just gave a few examples of how this may go wrong (with the current code). To be more specific: When ibv_destroy_qp/cq/() are called, the ibv_qp/cq/ structures are freed and are no longer valid. Any access to those will result in a hard error (e.g. access violation). These errors can easily happen if the Java application holds any references to that resource, e.g. a postSend call on a QP - calling it will then crash the process.

I think that this sort of behavior should be mitigated in some way or another. Today we have an inefficient mechanism with these maps, but we can definitely think of something better.

A few options on top of my head:

  1. Find some better container than those maps (e.g. lock-free non-blocking concurrent struct) - will probably be the easiest to implement and will have minimum code changes, but will definitely not yield the best performance, as we will still have cross-NUMA access to structures in the IO path.
  2. Add some proper access protection in the Java objects. E.g. Track state of each QP/CQ/* in its object, and block any operations when destroyed. That will require changing multiple objects such as the PostSend command so that it will pass through the QP object so the state can be checked. This state will have to be saved in some thread-safe variable as well, but it will be mostly local accesses and not across NUMA nodes like the locking on maps.

I'm inclined to go with option 2 obviously, as it will have the best performance, but it will require some major work in the Java code.

Thanks, Yuval.

PepperJo commented 7 years ago

Hi Yuval,

I agree option 2 is the option we should go with. We will have a look at the Java code and figure out what needs to be changed and create another pull request with the changes.

Regards, Jonas

yuvaldeg commented 7 years ago

Jonas, sounds good. Please let me know if I can help

Yuval

patrickstuedi commented 7 years ago

All,

I don't think the current code provides protection against stale QPs, even with the maps. Thread 1 may access the QP from the map, thread 2 may erase the QP from the map, thread 1 may then post a send or a recv.

If we want that level of protection we have to take the lock on the QP for calls like post/destroy. I'm happy to make a preliminary implementation of this and test the performance if that's what we want. But this will be a level of protection the C verbs API doesn't have. Correct me if I'm wrong..

-Patrick

asqasq commented 7 years ago

Hi all

Yes, that's correct, once the data structures of the QP or CQ are destroyed, the pointer must not be accessed anymore. I think, DiSNI (the Java part) should be implemented in a way that it does not access destroyed QPs or CQs.

I agree with Patrick, in the current code it is well possible that a race condition happens, where one thread is just about to post a send and the other one is erasing it/destroying the QP data structure. So this means that the upper layers have to be implemented in a way that they manage the resources, like QP or CQ allocation, and deallocation and also using the data structure, correctly.

In my opinion, option 1 won't work performance-wise. With the current code, I tried an experiment with the maps, but without any locks (of course this is not thread-safe, but for my experiments it worked fine). The performance was very poor (same as with locks), therefore we can conclude that the poor performance is indeed due the sharing of the maps across cores and not due to the locks/ This means that a concurrent map won't help.

Option 2 might lead to some additional security, but not to a 100%. The race condition, Patrick mentioned, will still be possible. Also, this additional checks in Java could always be circumvented, in worst case by using the lowest-level calls or even the native calls. But again, I think users of DiSNI should use the higher-level APIs not necessarily the lowest-level ones.

Let's see how the measurement of Patrick's preliminary implementation look like.

Cheers Adrian

PepperJo commented 7 years ago

All,

Patrick is right. I did not think about the race condition. This makes it harder to implement it in Java without performance penalty. One idea I had is to not let the user explicitly destroy QP/CQs etc. but e.g. use a finalizer to clean up the resources when they are collect by the GC. I think it is one of the few use-cases where a finalizer is sensible.

Cheers, Jonas

yuvaldeg commented 7 years ago

Hi all,

Re PostSend protection, the current code indeed does not protect from such issues, and protection from these kind of issues is always under debate whether they should be handled, or should the consumer be trusted.

One option for handling such issues, is to have a Semaphore on the QP/CQ/XX, that all operations will acquire and release on each usage (this is not a reference count, but rather a "use" count). When QP/CQ/XX is being destroyed, the Semaphore will be drained and further operations will be blocked and fail immediately. Only when the Semaphore is fully drained, then the resource will be destroyed. I think this kind of protection can be utilized for all of the protection needs we are planning to fulfill. Alternatively, this can be done with a lock as well, but performance will probably have a greater penalty, as a Semaphore is lock-free, but it may be worth trying as it will be significantly easier to implement.

Re finalizer, I'm a little concerned about such solution. The reason is that GC is very unpredictable, and we may end up with a large amount of resources staying open when they are expected not to, and in systems which does not reuse resources, this may become an issue.

Thanks, Yuval.

patrickstuedi commented 7 years ago

Hi Yuval,

Thanks for bringing the semaphore solution up, I think that's certainly one option and we've discussed similar options yesterday also with Bernard.

Right now I just wanted to see if we can move forward as follows: would you agree that question of whether to eliminate the maps as Adrian proposed -- which is what this pull request is about -- or not to do so, has basically nothing to do with the question of whether we want protection for stale QPs, CQs etc? Or in other word, eliminating the maps would not lower the existing protection level in any form, but only improve the performance.

In that case, wouldn't it make sense to accept the pull request at hand, and open a separate issue on the protection stuff? Personally I'm in favor of eliminating the maps, while evaluating options for better protection. But I consider those two things as orthogonal to each other.

Thanks, Patrick

patrickstuedi commented 7 years ago

I think the formulation of mine above is unfortunate and also incorrect. Maps of course provide an expensive best effort protection against stale QPs and probably in practice cover most of the scenarios, but they do not provide guarantees as such. I guess if we want to retain that same best effort property we could just add a boolean to the Java QP and CQ objects and set that to false once the resource is destroyed and prevent further send/recv calls. That would be a preliminary work around until we have a proper solution that provides guarantees... What you guys think?

yuvaldeg commented 7 years ago

Hey Patrick, Yes, I agree that we can start with a basic preliminary replacement for the maps in the form of a lock+boolean per QP/CQ/XX. That will give us an immediate performance value by removing the maps, and will still provide a comparable amount of protection.

Thanks, Yuval.

asqasq commented 7 years ago

Dear all

I implemented the boolean-based protection in the Java part of DiSNI (see the second commit).

The measurement look like this with the same setup:

Statistics: 2368929.23 IOPS, diff: 11849384, time diff(ms): 5002 Statistics: 2366066.59 IOPS, diff: 11832699, time diff(ms): 5001 Statistics: 2359708.06 IOPS, diff: 11800900, time diff(ms): 5001 Statistics: 2350056.60 IOPS, diff: 11750283, time diff(ms): 5000 Statistics: 2353003.40 IOPS, diff: 11767370, time diff(ms): 5001 Statistics: 2357717.40 IOPS, diff: 11788587, time diff(ms): 5000 Statistics: 2353972.21 IOPS, diff: 11772215, time diff(ms): 5001 Statistics: 2352053.79 IOPS, diff: 11762621, time diff(ms): 5001 Statistics: 2344629.40 IOPS, diff: 11723147, time diff(ms): 5000 Statistics: 2344026.59 IOPS, diff: 11722477, time diff(ms): 5001 Statistics: 2342004.80 IOPS, diff: 11710024, time diff(ms): 5000

The protection is comparable with the maps in the sense that we can catch most cases of access of destroyed objects, but race conditions may still occur.

Please have a look at the second commit.

Cheers Adrian

patrickstuedi commented 7 years ago

Hi Yuval,

What do you think? Are you ok with merging this?

-Patrick

yuvaldeg commented 7 years ago

Hi all, The code looks good to me. Adrian, thanks again for contributing!

I reviewed some of the code already and was planning to do some further testing with it, but I will need a few more days to get it done.

Thanks, Yuval.

patrickstuedi commented 7 years ago

Hi Yuval,

That's good, let us know if once you're done with the testing, if successful we can merge, if not let us know what issues you encountered.

-Patrick

yuvaldeg commented 7 years ago

Hi guys, I apologize for the delay. We ran some rigourus testing for many hours and everything appears to be running fine.

Adrian, will you be kind enough to resolve the conflicts so we can merge it?

Cheers, Yuval.

asqasq commented 7 years ago

Thanks, Yuval!

I resolved the conflicts and merged the pull request.

Cheers Adrian