zehome / MLVPN

Multi-link VPN (ADSL/SDSL/xDSL/Network aggregation / bonding)
http://www.mlvpn.fr/
BSD 2-Clause "Simplified" License
518 stars 127 forks source link

Issues with MLVPN #94

Open markfoodyburton opened 7 years ago

markfoodyburton commented 7 years ago

After all my efforts, I still have problems :-(

I am able to 'stress' my ADSL lines, but decreasing the requested SNR. Hence I get an increased bandwidth (theoretically), but the line will be less stable, and will drop packets (I assume).

with one line stressed, I want MLVPN to start to use the other line. BUT

With my lines set up like this, 1/ My bandwidth reduces A LOT ... (on BOTH lines). I loose almost all my bandwidth (not just half). In the end, both tunnels 'freeze' (but they do not drop, or get marked as 'lossy'!) 2/ The measured SRTT are both 'low' - and reflect the line bandwidth (In other words, the stressed line has lower SRTT than the normal line, as you would expect if it was working properly !!!!) 3/ I see no lost packets, or any other errors.

I just dont understand. The line looks perfect, but MLVPN refuses to send data to the line(s). and One line collapsing seems to stop the other line working....

@zehome can you help? What could be causing this?

(The problem is that with both lines set normally, I see this behaviour every few days, meaning that the whole thing is very unstable over a couple of 'poor' ADSL lines)

If you can point me in the right direction to investigate, I'd appreciate it !

Cheers Mark.

markfoodyburton commented 7 years ago

Little bit more info. I just had a case (with the lines supposedly in a reasonable working condition), where they had seemed to stop. As I was in a rush, I simply re-booted the modems... when they came back - the throughput was the same. I re-started both MLVPN's (giving them a few seconds between shutting down and starting again), and now we are back to normal. From what I can see this is really a problem with MLVPN, but I can't see what it could be... (and I've been hunting this, as you know, for weeks :-( )

zehome commented 7 years ago

In situation like thoses, I usually do run mlvpn inside gdb, or attach gdb when the problem occurs and place some strategic breakpoints/trace points.

You can also strace or perf trace the thing to try to understand what it's doing

zehome commented 7 years ago

Things i've seen were like mlvpn state machine is fucked up, receiving "try to connect" but thinks the tunnel is up. (Remote and Local state machines not in sync)

markfoodyburton commented 7 years ago

So this is just a little screenshot of what I'm seeing Thoughout this experiment I am streaming the same high bandwidth data through the tunnel. The 'red' ADSL connection is "good" The 'blue' ADSL connection is "bad"

untitled 6

What you see to start with is a single tunnel up. The second tunnel joins it, and the bandwidth drops markedly... - thats what I am trying to solve. Even if the second line adds nothing, we should still be using the first lines bandwidth.

The second tunnel is of 'poor' quality (and this ONLY happens when the line quality is poor). In all other cases, MLVPN does a reasonable job sharing.... If the line is good, we get the addition of the bandwidth.

What I WANT to happen is that the 'SRTT' values for the 'poor' quality line go up, so that I can 'favour' the good line... But - here is what happens:

untitled 7

What you see here is that initially the red line shows some (loaded, but normal) SRTT results from the 'red' tunnel, as soon as both tunnels come up, BOTH tunnels seem to have really good SRTT values.

I expect the second line to be poor, but it's not. All that seems to have happened is that the overall MLVPN tunnel system transports much less data, and as such, both tunnels are working fine...and the SRTT values are 'normal' for un-loaded tunnels.

Measuring the SRTT using ping gives the same results (in all cases, loaded, unloaded, good/bad ADSL lines, etc etc) - so I "believe" the values of SRTT.

So - what I have looked at so far: 1/ I have not yet run perf, or strace - perf MAY help, if I can capture the right times (the issue is understanding the culprit - it could be the system calls.... 2/ It seems to me we have not lost "synchronisation" of the state machine - I'm not sure, but things SEEM to continue as you would expect. A 'debug' trace shows nothing out of the normal, just packets being sent to the two interfaces... 3/ I have tried to measure the time it takes to send packets to the interface (the write - write time) - which seems to be minimal. 4/ I have tried to force the kernel buffer size to be exactly 1 MTU, to force the kernel not to buffer much, and then count how many packets there are in the MLVPN buffers - result - ALWAYS 1. That seems VERY strange (but I have reorder_buffer_size set to 0, (setting it to anything other than zero gives me terrible performance!) This result seems 'odd', but indicates that the data isn't coming into the tunnel. Which - in turn - I assume (given the 'load' is likely to be a TCP connection) means that the TCP connection isn't acknowledging packets very fast.... (which seems likely)... but doesn't get me any closer to understanding why - I just go round in circles :-) 5/ I added a CRC check to packets (on the theory that packets were being corrupted), but - no - they are not corrupted. 6/ I thought maybe the SRTT calculation was wrong, or that it was so large that it was being ignored - but - no - the SRTT calculation is correct, and I can verify it with ping, the SRTT is actually very "low" under these conditions !!!!

I'm running out of ideas - I can't see how come the data coming into the pipe is 'slowing down', and I can't see any evidence that the pipe itself is slowing down (even though I am sure it is - I just can't measure it !!!)

My only theory is that (somehow) the packets of data are being corrupted, but then I think the modems would catch that and the request a re-send, which would mean a longer SRTT - which we dont see? And anyway - I tried adding a CRC, and that seems to suggest everything is working just perfectly....

I'm not sure where to go next - have you ANY ideas?

markfoodyburton commented 7 years ago

I've found one small suspicious bug, which would upset things, but it's the cause of the problem I've added a small patch to the PR (https://github.com/zehome/MLVPN/pull/69) In rtun_send, The reply timestamp can be -1 in the case that the timestamp has already ben consumed. This wasn't being checked, which meant that sometimes the resulting calculation found a seemly valid value, causing brief errors in the SRTT calculation.