xlab-uiuc / slooo

Slooo: A Fail-slow Fault Injection Testing Framework
11 stars 1 forks source link

Paper Submission Plan #45

Closed varshith15 closed 2 years ago

varshith15 commented 2 years ago

Evaluation Section

Point Break

Repo Updates

Paper Submission

varshith15 commented 2 years ago

Differences compared to Sachin's report

Whereas I did not have a fixed number of clients but instead plotted the throughputs and latencies I got using a different number of clients each time.

These 2 approaches are quite different.

But Sachin's approach makes sense in that case as he uses an open-loop benchmark and it can generate as much throughput as we want (of course considering the bottlenecks) which is not the case with a closed-loop

Takeaways from Sachin’s report

Questions

tianyin commented 2 years ago

Just a note that if the goal is to write a tool paper, your goal is to justify that your tool is useful. In this context, it means your tool can support what Sachin needs in his experiments. If you cannot support (which is likely the reason he didn’t use slooo), then the question is how to make slooo be able to support.

On Sun, May 22, 2022 at 2:07 PM Varshith Bathini @.***> wrote:

Differences compared to Sachin's report

-

Sachin uses an open-loop and asynchronous benchmark tool as compared to YCSB which is a closed-loop and synchronous

It seems like Sachin had used a fixed number of clients (2) and used QPS as the constraint to plot Throughput vs Latency.

Whereas I did not have a fixed number of clients but instead plotted the throughputs and latencies I got using a different number of clients each time.

These 2 approaches are quite different.

But Sachin's approach makes sense in that case as he uses an open-loop benchmark and it can generate as much throughput as we want (of course considering the bottlenecks) which is not the case with a closed-loop

  • Sachin uses stress-ng to apply stress on the CPU, Slooo uses cgroups. This shouldn't cause a big difference in the results.

Takeaways from Sachin’s report

-

Sachin points out we should be looking at RSS instead of mem usage to get an idea as to how much mem the process is actually using which I completely missed before. Will revisit the experiments with this new info.

Sachin also makes a brilliant observation that PySyncObj violates the RAFT algorithm by not creating a persistent log journal which causes mem contention throughput to be lesser than the baseline result. I will check the log replication logic of RethinkDB

Questions

-

Sachin mentiones for slow CPU on the follower "When a follower is subjected to CPU contention, throughput drops only very slightly which probably can be attributed to some CPU contention at cache level for the leader in this pseudo-distributed setup."

^ I don't completely understand the reasoning. Will revisit this later.

— Reply to this email directly, view it on GitHub https://github.com/xlab-uiuc/slooo_internal/issues/45#issuecomment-1133955105, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASPIR3MVTM6NDFP5EE7TC3VLKAWTANCNFSM5WSFJWAQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Tianyin Xu, https://tianyin.github.io/

varshith15 commented 2 years ago

Just a note that if the goal is to write a tool paper, your goal is to justify that your tool is useful. In this context, it means your tool can support what Sachin needs in his experiments. If you cannot support (which is likely the reason he didn’t use slooo), then the question is how to make slooo be able to support.

@tianyin what I believe the reason Sachin did not use Slooo is that PySyncObj is not a quorum system, it's a python library and he created a simple KV store system out of it. Either way, I was going to talk to him about the reasons. I was also looking at his assignment to understand the holes you had pointed out in mine which I plan to fix as well.

Is slack the best way to contact him?

varshith15 commented 2 years ago

Point Break Review

Seems like the solution Wenshan and Shichen have implemented is a pretty straightforward approach of iterating over a set of predetermined faults (given by the user) to check if the fault causes a crash, if it does then the fault before this is the point break.

This is kind of already implemented in the code revamp we did.

IMO it isn't a smart enough solution as it still needs the user's input as to what values of faults Slooo needs to be looking for the point break and also depending on how accurate the point break is needed, it may take a long time to find as it's iterative.

This seems like the only solution for now though (ironically first thing I and @Essoz came up with). Idk if it's a good enough solution to put the paper.

@tianyin thoughts?