Closed tianyin closed 2 years ago
Could you inject the faults during the workload run? It seems to me that you inject the workloads before the workload run? That's the reason that your leader crash does not even affect performance. Leader election means availability which should affect performance.
Yeah, currently the code is written in a way that the failure is injected before the workload is run, which is exactly the same as the depfast scripts we run. But yes ideally the failure needs to be injected while the workload is running. It can be implemented, have a parallel thread inject the failure after a set amount of time. Will add it to the to-do jobs.
But I suspect the results will still be the same just like when the node crashes when the memory limit is less than 50MB. There will be a few failure requests of course due to availability issues but the results will almost be the same.
For Q5 and Q6, the choice of 50MB does not look right. You are supposed to use a value that leads to a fail-slow behavior. The fact that 50MB >> the memory needed by the RethinkDB node means a bad choice.
The reason I used 50MB is to show that the node either crashes or works normally there is nothing in between, the node doesn't slow down for any amount of limit on mem. It either crashes or works normally. Not exactly sure what the issue is. But yeah I'll try with slownesses like 5MB to see if there's a difference.
You said that it is expected that the performance goes down when a follower node fails slow (no matter whether it is CPU or memory). Why is it expected? For a quorum write, the write only needs to persist on two nodes (a leader and a follower). So one slow follower is not supposed to cause problems.
Yes, I missed the point that just the majority of acknowledgments are enough for the write operation in the raft protocol. Not exactly sure what the reason might be. Have to investigate more.
Ritesh previously observed that a slow follower could crash the leader (https://tianyin.github.io/pub/depfast.pdf). Do you observe that?
No, the slow follower did not crash, Even with lesser slowness than used in depfast (2.5%) the node did not crash.
In Q10, you mentioned that there is a leader crash? Why it crashes?
the crash log shows that the reason is due to OOM. The group oom killer kills the process.
@varshith15
https://github.com/xlab-uiuc/slooo_internal/issues/38#issuecomment-1037411617
No. We need to inject during the workload run. The leader re-election is not an easy task and you will observe the difference.
But I suspect the results will still be the same just like when the node crashes when the memory limit is less than 50MB. There will be a few failure requests of course due to availability issues but the results will almost be the same.
I do not agree.
Not exactly sure what the issue is
Not exactly sure what the reason might be
Can you investigate them?
It is in general pretty nicely done! Good job!
There are some information needs to be clarified. Please add those.
There are a few bigger problems. Let me list them here:
[ ] Could you inject the faults during the workload run? It seems to me that you inject the workloads before the workload run? That's the reason that your leader crash does not even affect performance. Leader election means availability which should affect performance.
[ ] For Q5 and Q6, the choice of 50MB does not look right. You are suppose to use a value that leads to a fail-slow behavior. The fact that 50MB >> the memory needed by RethinkDB node means a bad choice.
[ ] You said that it is expected that the performance goes down when a follower node fails slow (no matter whether it is CPU or memory). Why is it expected?
[ ] For a quorum write, the write only needs to persist on two nodes (a leader and a follower). So one slow follower is not supposed to cause problems.
[ ] In Q10, you mentioned that there is a leader crash? Why it crashes?
[ ] Ritesh previously observed that a slow follower could crash the leader (https://tianyin.github.io/pub/depfast.pdf). Do you observe that?