buffer overlap and fragmentation due to race condition in Block.allocate

thiyanesh commented 3 years ago

Fail eagerly on allocated size comparison to prevent overlap and fragmentation due to race condition.

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Eshcar commented 3 years ago

Thanks @thiyanesh for this PR

I recommend breaking it into several PRs It can be easily be broken into 3 PRs

allocate race condition
changes to blocks pool - better practice in allocating a singleton
adding final to potentially improve performance and avoid changes to parameters

The main issue here is the race condition, that is the result on non-atomic (non-transactional) increment and decrement operations on allocated which captures the next location to allocate in a buffer

If indeed the allocate method is called not in exclusive mode there is the potential for the race you describe But I am not sure a (costly) synchronize is necessary, or other complicated solutions The only reason this race is created is due to the non-monotonic behavior of allocated. But it is only decreased after exceeding the threshold. In a normal workload we can expect this to happen only once when the block is almost full. So a possible solution would be to have two monotonic counters. The allocate method would be as follows

Increase the first counter.
If it is below the threshold then increase the second counter and treat it same as allocated. It is guaranteed allocated does not exceed the threshold.
if the first counter exceeds the threshold then throw an exception no reduction of any counter. no race condition. The only caveat is an unused space at the end of a block. but to me it seems reasonable that once an exception is thrown no additional allocation is allowed from the block

thiyanesh commented 3 years ago

Thank you for the detailed feedback. I will update the PR with Block.allocate change alone. also, will try to solve in using only one existing atomic variable.

thiyanesh commented 3 years ago

@Eshcar , If this PR is fine, please squash merge to avoid unnecessary commit history with other changes. Or do you recommend to close this PR and create a new PR with this change?

sanastas commented 3 years ago

Hi @thiyanesh !

I am so happy you opened this PR and found that real problem! Thank you! I am Anastasia and I am the lead of Oak project, nice to e-meet you!

The scenario you present is real and thus should be definitely fixed. You have a great eye on concurrency problems and it would be pleasure to continue collaborating!

I less like any locks or synchronized being added. Also, I less like the CompareAndSwap loop, this is because Fetch-And-Add still has better performance. Previously AtomicInteger.getAndAdd() (redirected to unsafe.getAndAddInt()) was implemented with Fetch-and-Add and now I indeed see it turns to CompareAndSwap-loop in Java8. However, the implementations can be different and I wouldn't straightforward use compareAndSet().

What I would suggest as a solution is first of all to remove the decreasing of allocated once the request is above capacity and to throw the exception. Looking on Block.allocate() usage (in NativeMemoryAllocator) we can see that the exception is just caught and if the request is not problematic (i.e. above the block size in general), allocation request is redirected to next block.

In order not to increase allocated too much, I would add a check if (allocated.get() + size > this.capacity) just before the allocated.getAndAdd(size) to decrease the probability of unnecessary increase.

What do you think @thiyanesh ? I would very much like to hear your thoughts! If you really want to progress with the CompareAndSet loop this is also a possibility. Also, leave this PR for dealing with this problem only. Please go ahead and open other PRs with all other improvements you have suggested before. Looking forward!

Thanks, Anastasia

thiyanesh commented 3 years ago

@sanastas, Thank you for the detailed feedback.

What I would suggest as a solution is first of all to remove the decreasing of allocated once the request is above capacity and to throw the exception. Looking on Block.allocate() usage (in NativeMemoryAllocator) we can see that the exception is just caught and if the request is not problematic (i.e. above the block size in general), allocation request is redirected to next block. True, this can be done if the following scenario is acceptable

T1 requests bigger size
T2 and T2 requests valid size (combined size will also be valid)
Depending on the order of context switches in concurrently executing T1, T2, T3, the complete block offset(corresponding bytebuffer) might be left unused unless the whole block is released later.

If there will be more contentions and its ok to tradeoff space to achieve better time, then can use the approach you have suggested. (throw exception on the first overflow using 2 counters to simulate a completion and throw exceptions) If there will be less such contentions, compareAndSet might be fine as there will be minimal loops.

In order not to increase allocated too much, I would add a check if (allocated.get() + size > this.capacity) just before the allocated.getAndAdd(size) to decrease the probability of unnecessary increase. As per my understanding, this would not guarantee the operation to be atomic. Further, since the allocated is maintained as AtomicInteger and not as AtomicLong, there can different side effects of overflow(negative) which can result in inconsistencies in long allocated() { return allocated.get(); } response.

Please do share your comment about this(or if i misunderstood your comment).

sanastas commented 3 years ago

@thiyanesh , good to hear from you again!

True, this can be done if the following scenario is acceptable

T1 requests bigger size T2 and T2 requests valid size (combined size will also be valid) Depending on the order of context switches in concurrently executing T1, T2, T3, the complete block offset (corresponding bytebuffer) might be left unused unless the whole block is released later.

Do you mean T1 requests a size (single slice) bigger than a block? According to what I think, T1 will fail the pre-check (wether T@ and T3 proceed or not) and than fail into exception on the NativeMemoryAllocator level. While, T2 and T3 can continue to allocate from the same block. So no unused block in this scenario.

If you meant that T1's size is bigger when summed with T2 and/or T3 sizes. Than yes, there might be big portion of block unused. I believe this should be a very rare scenario, as our blocks are quite big, and the concurrency should be very specific. However, truly, we planned to create a slice from this unused space and add it to the free list (of NativeMemoryAllocator) so it can be used by others. But I didn't want to request a complicated solution from you, unless you want to write it. The issue of transferring such an unused space to free list can be fixed later in other PR. Or here if you like.

If there will be more contentions and its ok to tradeoff space to achieve better time, then can use the approach you have suggested.

I do not understand this comment fully. I think the solution, I explain above, is good both in space and in time.

(throw exception on the first overflow using 2 counters to simulate a completion and throw exceptions)

I didn't suggest using two counters, this is a possibility of course. I do not see a big reason to use two Fetch-and-Add (expensive) instructions instead of one. Why do you prefer 2 counters? Probably I am missing something.

If there will be less such contentions, compareAndSet might be fine as there will be minimal loops.

You are right if there is little to no contention compareAndSet loop is almost the same as Fetch-and-Add. However, Fetch-and-Add is almost the same as compareAndSet while no contention, and with contention Fetch-and-Add is faster. So, doesn't it sound like Fetch-and-Add is preferable?

As per my understanding, this would not guarantee the operation to be atomic.

You are right. The check I suggested is for the cases with no concurrency. Just to decrease the possibility of unnecessary increase in the allocated above the capacity. The check is not atomic.

Further, since the allocated is maintained as AtomicInteger and not as AtomicLong, there can different side effects of overflow(negative) which can result in inconsistencies in long allocated() { return allocated.get(); } response.

Very good point! As for now, block capacity can not be above 2GB (ByteBuffer limitation, inherited from the past). Thus we can hardly get to this problem. But you are right if we allocate 2GB block and then go a little above capacity. If you would like to fix it, go ahead and define allocated to be AtomicLong.

Looking forward to hear from you! Please let me know what do you think about my comment. Thanks!

thiyanesh commented 3 years ago

Do you mean T1 requests a size (single slice) bigger than a block?

As you have mentioned, this will result in failure of precondition. No. i didn't intent this.

If you meant that T1's size is bigger when summed with T2 and/or T3 sizes.

Yes, i meant this. But even ignoring concurrently executing threads, this will be an issue even for a sequential execution. So, as you have mentioned, this has to be handled independently for all scenarios.

I do not see a big reason to use two Fetch-and-Add (expensive) instructions instead of one. Why do you prefer 2 counters?

The primary reason suggestion of 2 counter fetch-and-add was: long allocated() { return allocated.get(); } - It might not be good to return an invalid value(either negative/large value if overflow or outside the range of integer for 2GiB even if it's long). I am not sure about being able to do this in single fetch-and-add. With 2 counters as mentioned by @Eshcar, the flow will be

supporting internal counter - getAndAdd. if this is invalid(negative or above capacity), then throw exception. don't modify the allocated counter and it stays in a valid state.
2 counters simulate the possibility of crossing the limit without actually crossing the limit in the exposed state(allocated)

As for now, block capacity can not be above 2GB (ByteBuffer limitation, inherited from the past). Thus we can hardly get to this problem. But you are right if we allocate 2GB block and then go a little above capacity.

Yes, as you have mentioned the following can simulate the behavior: With a 2GiB (Integer.MAX_VALUE) limit, using a fetch-and-add can result in negative(as integer) or an invalid bigger value(long).

allocated = Integer.MAX_VALUE - 3
2 new requests each with size = 2
any check (allocated.get() + size <= capacity) on these independently will succeed
both of then will proceed to update allocated.getAndAdd(2)
this will result in an invalid state of long allocated() { return allocated.get(); }

I not sure whether the above descriptions convey the intent. The actual intent was

always increment allocated to only valid states within the range of capacity.
ensure the long allocated, stays inline with a valid state.
(in the existing master branch code with subtraction), avoid overlapping buffer state.

Personally, to satisfy all these 3 conditions in a concurrent execution environment, we might need synchronized or CAS loop or 2 counters with fetch-and-add.

Please do share your feedback and decide whether this PR is relevant or can be solved in different approach.

sanastas commented 3 years ago

Hi @thiyanesh ,

Thank you for clarifying! I think I understand now your reason to use two counters and your concern about Block's method long allocated() { return allocated.get(); }

In what I have previously suggested there is always a possibility of allocated counter of Block being above the capacity. So you see it as wrong output, which is very reasonable.

Please pay attention that Block's allocated() method is used only once in NativeMemoryAllocator class, just to check whether the block is allocated above its capacity. It is not used for true accounting of Oak's allocated memory. Inside, NativeMemoryAllocator class we have another AtomicLong allocated counter which is indeed used for true accounting of Oak's allocated memory.

Therefore, the fact that Block's allocated() method returns a value above capacity, actually makes no harm. However, your concern is absolutely legitimate, as the code looks like not working right. What do you think about just changing the name of the Block's allocated() method (something about being upper limit) and/or adding a good comments around it?

(in the existing master branch code with subtraction), avoid overlapping buffer state.

I am not sure I understand the requirement above.

So still I would prefer no synchronize, no CAS loop and no double F&A, because I do not work the good path to work slower in order to resolve the corner cases that may rarely happen.

In my eyes the following solution is optimal:

Rename Block's allocated() method and/or comment
Make Block's allocated counter AtomicLong
Remove decrease of the allocated counter when going above capacity
[Optional] In order not to increase allocated in sequential execution, add a check if (allocated.get() + size > this.capacity) although it is not atomic.
[Very optional] add the remaining unused part of the block tot he free list of NativeMemoryAllocator

Or any other solution that doesn't impact the normal/good path execution that you want to suggest. What do you think? Does it make sense to you?

Thank you for taking care of Oak!!!

sanastas commented 3 years ago

@thiyanesh , anything new? Would be great to hear from you! I wish your ideas will land to Oak main code. Looking forward!

thiyanesh commented 3 years ago

@sanastas , please find the updated PR as per the discussions.

sanastas commented 3 years ago

@thiyanesh ,

Thank you! All looks good. I run mvn clean package and it passes well. Will merge your PR.

yahoo / Oak

buffer overlap and fragmentation due to race condition in Block.allocate #147