Feature Request: Simulating slow or unstable connections

Qqwy commented 5 years ago

I'd like to use this library not inside unit-tests, but inside some integration tests that ensure that the core functionality of our distributed application (https://planga.io) keeps functioning within expected bounds, even if the connection between nodes is slow, or if from time to time requests are being dropped.

What do you think? :slightly_smiling_face:

whitfin commented 5 years ago

@Qqwy this seems like an interesting idea; how would you go about this?

keathley commented 5 years ago

I've been working on something similar for testing my raft implementation and other distributed systems things I've got going on. Currently we only simulate inter-leavings of messages (no slowness or other transient failures) by directing all of the messages through a module that wraps the underlying rpcs. In test mode we can control the interleavings deterministically so that we can test in conjunction with proper. The downside to this approach is that it only works if everyone uses the same interface.

The alternative approach I've used is just to run everything in docker containers and simulate network failures with iptables or similar. This is more realistic and doesn't require changing the underlying code but it takes a lot longer to explore the search space for faults.

An idea that I had a while ago (which might not even be viable) would be to provide a custom distribution module to replace the standard erlang distribution similar to whats done for distribution over tls. This module would only need to run in test mode but could be deterministically controlled. The benefit would be that the code under test doesn't need to change. There's the obvious drawback that in test mode you're using a different protocol then whats running in production and it may fail in different ways under real-world failure conditions. But for finding low-hanging fruit it might be worth it.

keathley commented 5 years ago

I'm still working on a few ideas but I recently built a new package for testing node disconnects: https://github.com/keathley/schism. Currently it only does disconnects but I want to play around with some ideas for adding message latency, corruption, etc.

I considered opening a PR to add this to local-cluster but in the end I decided to create a separate repo. I really like how small the surface area of local cluster's api is and find it much more useful and powerful because of it. I also wanted to keep experimenting with some ideas around fault injections and it just seemed to make sense to me to do that as a standalone effort. If people feel differently about that then I'm happy to discuss it :smile:.

whitfin commented 5 years ago

@keathley neat. The API you have is small enough that I'm probably not against including in here, but if you'd rather keep it separate that makes sense too!

It might be a neat idea to write a blog post (if you're into that) giving demonstrations of using both schism and local-cluster together, so we have a resource we can point people to!

keathley commented 5 years ago

I'm happy to write that blog post :+1:

Qqwy commented 5 years ago

@keathley Please do! I think it would be very interesting and educational :smile: .

whitfin commented 5 years ago

@keathley did you ever publish anything around this? If so, feel free to link and I can probably drop in the README if it's helpful!

whitfin / local-cluster

Feature Request: Simulating slow or unstable connections #2