Varshith's Assignment #2 - Githubissues

xlab-uiuc / slooo

Slooo: A Fail-slow Fault Injection Testing Framework

11 stars 1 forks source link

Varshith's Assignment #2 #38

Closed tianyin closed 2 years ago

tianyin commented 2 years ago

It is in general pretty nicely done! Good job!

There are some information needs to be clarified. Please add those.

There are a few bigger problems. Let me list them here:

[ ] Could you inject the faults during the workload run? It seems to me that you inject the workloads before the workload run? That's the reason that your leader crash does not even affect performance. Leader election means availability which should affect performance.
[ ] For Q5 and Q6, the choice of 50MB does not look right. You are suppose to use a value that leads to a fail-slow behavior. The fact that 50MB >> the memory needed by RethinkDB node means a bad choice.
[ ] You said that it is expected that the performance goes down when a follower node fails slow (no matter whether it is CPU or memory). Why is it expected?
[ ] For a quorum write, the write only needs to persist on two nodes (a leader and a follower). So one slow follower is not supposed to cause problems.
[ ] In Q10, you mentioned that there is a leader crash? Why it crashes?
[ ] Ritesh previously observed that a slow follower could crash the leader (https://tianyin.github.io/pub/depfast.pdf). Do you observe that?

varshith15 commented 2 years ago

Could you inject the faults during the workload run? It seems to me that you inject the workloads before the workload run? That's the reason that your leader crash does not even affect performance. Leader election means availability which should affect performance.

Yeah, currently the code is written in a way that the failure is injected before the workload is run, which is exactly the same as the depfast scripts we run. But yes ideally the failure needs to be injected while the workload is running. It can be implemented, have a parallel thread inject the failure after a set amount of time. Will add it to the to-do jobs.

But I suspect the results will still be the same just like when the node crashes when the memory limit is less than 50MB. There will be a few failure requests of course due to availability issues but the results will almost be the same.

varshith15 commented 2 years ago

For Q5 and Q6, the choice of 50MB does not look right. You are supposed to use a value that leads to a fail-slow behavior. The fact that 50MB >> the memory needed by the RethinkDB node means a bad choice.

The reason I used 50MB is to show that the node either crashes or works normally there is nothing in between, the node doesn't slow down for any amount of limit on mem. It either crashes or works normally. Not exactly sure what the issue is. But yeah I'll try with slownesses like 5MB to see if there's a difference.

varshith15 commented 2 years ago

You said that it is expected that the performance goes down when a follower node fails slow (no matter whether it is CPU or memory). Why is it expected? For a quorum write, the write only needs to persist on two nodes (a leader and a follower). So one slow follower is not supposed to cause problems.

Yes, I missed the point that just the majority of acknowledgments are enough for the write operation in the raft protocol. Not exactly sure what the reason might be. Have to investigate more.

varshith15 commented 2 years ago

Ritesh previously observed that a slow follower could crash the leader (https://tianyin.github.io/pub/depfast.pdf). Do you observe that?

No, the slow follower did not crash, Even with lesser slowness than used in depfast (2.5%) the node did not crash.

varshith15 commented 2 years ago

In Q10, you mentioned that there is a leader crash? Why it crashes?

the crash log shows that the reason is due to OOM. The group oom killer kills the process.

tianyin commented 2 years ago

@varshith15

https://github.com/xlab-uiuc/slooo_internal/issues/38#issuecomment-1037411617

No. We need to inject during the workload run. The leader re-election is not an easy task and you will observe the difference.

But I suspect the results will still be the same just like when the node crashes when the memory limit is less than 50MB. There will be a few failure requests of course due to availability issues but the results will almost be the same.

I do not agree.

Not exactly sure what the issue is

Not exactly sure what the reason might be

Can you investigate them?