Closed tianyin closed 2 years ago
Note that the assignments will not only for fail-slow faults, but also fail-stop faults, like killing a follower or a leader.
I think that is a great idea!
I think the answer would be yes even if fail-stop faults are to be considered.
Just in case your expectation went beyond the tool's capability, I want to clarify that the tool cannot test a system on its own. A user has to use a benchmark and adapt the tool to the target system by implementing the interface provided by the RSM class. And currently, the tool does not support killing a node (but of course we can add support for that with little effort).
@Essoz I totally understand that!
The goal is to ask students to read some systems code and measure their fault tolerance.
of course we can add support for that with little effort
Are you able to add the support, say in two or three days?
Also, the assignment will be done on their local machines, rather than Azure.
Therefore, we need to make sure the local mode can be used.
I can add that local support in two days, say tomorrow.
The local mode is usable, but it comes with limitations. To perform disk experiments, each instance has to be assigned a different partition as its datapath. Memory & Network experiments require extra work on the user side: they have to start instances with resource isolation by using tools like docker so that the experiments such as memory contention do not affect other instances.
I don't think I understand.
Memory & Network experiments require extra work on the user side: they have to start instances with resource isolation by using tools like docker
Why memory can't be done by cgroup
?
I apologize. I just checked the paper and the code, and the limitations of local mode are listed below:
Memory contention is available.
Great @Essoz ! CPU and memory are all what I need!!
What I hope you can support is as follows:
The node can be either a leader or a follow, or both (e.g., in Copilot).
@tianyin the slooo framework provides the code to inject various slowness (CPU, memory, network) to the desired node (leader/follower) but the user has to add the code to figure out the desired node (leader/follower) for the given system because the logic for figuring out leader/follower is different for different systems like we see when compared to mongo and tidb and rethink and like in copilot no leader and follower, so the user has to implement that part of the code(logic).
In copilot, we just slow down any one of the nodes as there isn't a leader/follower distinction.
but the user has to add the code to figure out the desired node
This exactly is the purpose of the course assignment -- students need to write some code and understand the system to be measured!
@varshith15 @Essoz
I'm teaching a grad-level course on reliable software systems. I want to design an assignment on testing quorum systems.
Can I ask the students to use your framework?