zehome / MLVPN

Multi-link VPN (ADSL/SDSL/xDSL/Network aggregation / bonding)
http://www.mlvpn.fr/
BSD 2-Clause "Simplified" License
519 stars 126 forks source link

About loss ratio computation #61

Open wolfgar opened 8 years ago

wolfgar commented 8 years ago

Hi I have just setup mlvpn to aggregate 2 ADSL links and, as the performance is not the expected one (I have half the bandwidth of one link when I use mlvpn), I had a closer look at configuration and at verbose messages...

I have not fixed my bandwidth issue but a first remark is : As soon as I begin to download, the loss ratio of each tunnel reaches 50%

When looking at the code this is inherent to the way loss is computed : it counts (using seq number) how many packets are received among the 64 last ones per tunnel. As the sequence number is global, in a configuration with 2 tunnels, if we use evenly each tunnel then one packet out of 2 is sent in each tunnel and loss ratio is quickly computed as 50% while there is not a single "real" packet loss

I guess if we have 3 similar tunnels then it would stabilize around 66% and so on...

Is it the expected behavior ?

Regards Stéphan

zehome commented 8 years ago

No, this is not the expected behavior at all. You are absolutely right. As we use a global sequence, it's not possible to get this loss detection system working.

Regarding the performance issue, if you are using the current "master", it's expected to have poor performance when the underlying link is dropping packets (slow connections). The only way to get around that is to maintain a sequence per connection, and doing the reordering on a connection basis instead of a global one. (I mean, connections INSIDE the tunnel)

wolfgar commented 8 years ago

Thanks a lot for your feedback regarding loss computation, then if it is not too much for the protocol overhead, using an additional seq number at tunnel scope would fix this ratio...

Regarding performance, I have to say that so far I have been unable to finely understand what really goes wrong... I have not very lossy links : 2 similar ADSL lines, everything is wired between ISP gateways and my host running mlvpn. Also when I analyze the logs I cannot see missing sequence numbers : I should write a script to parse them and be 100% sure but if it happens it would be very sparse... When I capture the exchanges on the TUN device, I can see that at some stage the tcp connection experiences abnormal behavior (ack are resent) but it rather seems that for a short period the server simply stops to send new packets. Not so sure, deeper investigations are required...

You are right that I currently use a master build, would you have a specific tag to advise for testings that could/should behave better than the work in progress master ?

zehome commented 8 years ago

I use in production with aggregation but reordering disabled using latest release version

zehome commented 8 years ago

I've experienced the corruptions / send pauses you are talking about, I've not found out why still

wolfgar commented 8 years ago

Thanks a lot, I tried with V2.3.1, no reorder (anyway in this version no reorder was automatic because of a3f78e6525450181158c27f9a64444fef23832e0) Upload works almost as expected but download behaves the same I have setup a very simple test using netperf benchmark to better understand the data flow but there is nothing obvious (what I analyze is just fine, I suspect the tcp flow control is somehow fooled but have no proof)

For my immediate operational needs I think I will try to deploy MultiPath TCP but I will come back and investigate further mlvpn if I can as it has less low level prerequisites and is not limited to tcp protocol. I will tell you if I find something that deserves sharing...

markfoodyburton commented 7 years ago

To return to the issue of loss calculation, I now have a (partial) solution to this. I'll open a PR once I've cleaned up the patch. It relies on some guess-work. basically we assume that a tunnel that advances the sequence number is OK - and any 'holes' that it leaves in it's wake in the sequence list should be filled by the other tunnel. If it fails to fill those wholes, then it's "blamed" for the loss. This is clearly no more than a 'heuristic', but it gives reasonable results in my tests so far, it at least seems to reliably "point the finger" at a lossy tunnel. It means that we can now set loss tolerances on tunnels, and have them treated sensibly which helps stability a lot....

markfoodyburton commented 7 years ago

see https://github.com/zehome/MLVPN/pull/69