yggdrasil-network / yggdrasil-go

An experiment in scalable routing as an encrypted IPv6 overlay network
https://yggdrasil-network.github.io
Other
3.49k stars 239 forks source link

Windows: "More data is available" crash from Water #472

Closed xaozai closed 5 years ago

xaozai commented 5 years ago
2019/07/25 15:02:05 Build name: yggdrasil-fixwindows
2019/07/25 15:02:05 Build version: 0.3.5-0191
2019/07/25 15:02:05 Starting up...
2019/07/25 15:02:05 Starting switch
2019/07/25 15:02:05 Starting router
2019/07/25 15:02:05 Listening for TCP on: [::]:30473
2019/07/25 15:02:05 Startup complete
2019/07/25 15:02:05 TCP admin socket listening on 127.0.0.1:9001
2019/07/25 15:02:05 Connected TCP: 200:529f:150c:eafe:4405:93fe:8d76:39cb@95.27.
175.20, source 10.11.28.111
2019/07/25 15:02:06 Interface name: Ygg
2019/07/25 15:02:06 Interface IPv6: 200:b886:73cd:75d8:abd:cb8e:baff:fe4c/7
2019/07/25 15:02:06 Interface MTU: 65535
2019/07/25 15:02:06 Your IPv6 address is 200:b886:73cd:75d8:abd:cb8e:baff:fe4c
2019/07/25 15:02:06 Your IPv6 subnet is 300:b886:73cd:75d8::/64
panic: More data is available.

goroutine 60 [running]:
github.com/yggdrasil-network/yggdrasil-go/src/tuntap.(*TunAdapter).reader(0xc000
097c50, 0x1, 0xc0000e8ba0)
        /home/circleci/project/src/tuntap/iface.go:122 +0xe03
created by github.com/yggdrasil-network/yggdrasil-go/src/tuntap.(*TunAdapter).St
art
        /home/circleci/project/src/tuntap/tun.go:160 +0x5ee

The error is reproduced when trying to upload a video file to http://[301:fd1:86e3:521c::13]/

neilalexander commented 5 years ago

Curious, I thought this was fixed. This is a problem in Water rather than Yggdrasil—I'll try and take another look later.

neilalexander commented 5 years ago

Can you give this a try?

64-bit: https://1297-115685026-gh.circle-artifacts.com/0/yggdrasil-fixwindows-0.3.5-0192-windows-amd64.exe 32-bit: https://1297-115685026-gh.circle-artifacts.com/0/yggdrasil-fixwindows-0.3.5-0192-windows-i386.exe

xaozai commented 5 years ago

Unfortunately, the error is still there. And in the log yggdrasil-fixwindows-0.3.5-0192-windows-amd64.exe displays: Build version: 0.3.5-0191

panic: More data is available.

goroutine 15 [running]:
github.com/yggdrasil-network/yggdrasil-go/src/tuntap.(*TunAdapter).reader(0xc000
0adc50, 0x1, 0xc0001b2030)
        /home/circleci/project/src/tuntap/iface.go:122 +0xe03
created by github.com/yggdrasil-network/yggdrasil-go/src/tuntap.(*TunAdapter).St
art
        /home/circleci/project/src/tuntap/tun.go:160 +0x5ee
Arceliar commented 5 years ago

Is it possible that we call Read with a buffer that's too small, relative to the tap MTU? If wonder if the issue would go away if we, for example, double the size of the read buffer.

Vort commented 5 years ago

I can reproduce the problem with tap-windows-9.9.2_3.exe. And can not reproduce with tap-windows-9.21.2.exe.

Vort commented 5 years ago

Digged further into this problem. And it looks like ERROR_MORE_DATA processing is still incorrect. From what I can see, ReadFile retries may be requested infinite times. My tests shows that while using of 9.9.2_3 driver, packets size equals to 16398 bytes. Buffer size of 65549 can hold 3 of them. But if I resize buffer to 65592, it still not helps. 4 packets fits into it, but 5-th fails. So if your buffer is not infinite, another approach should be used.

Vort commented 5 years ago

By the way, getOverlappedResult also can return ERROR_MORE_DATA. Most likely, resetEvent can't produce it and checking of its return code for ERROR_MORE_DATA is not needed.

Vort commented 5 years ago

Reverting to water master fixes this particular problem. But as I remember, reverting will mean crash in different conditions.

Vort commented 5 years ago

This is how crash looks like from packet level: peertube_ygg_crash

I will upload this dump if no one will be able to make minimal test program.

(Note lots of retransmissions and other glitches. It's a different topic, but they can be reproduced with https://ygg.thingylabs.io/ and iPerf3)

Vort commented 5 years ago

Reproduction for this bugreport turned out to be very simple. Just send bunch of UDP packets from problematic Windows Yggdrasil instance:

using System.Net;
using System.Net.Sockets;

namespace ConsoleApplication81
{
    class Program
    {
        static void Main(string[] args)
        {
            Socket sock = new Socket(AddressFamily.InterNetworkV6,
                SocketType.Dgram, ProtocolType.Udp);

            IPAddress serverAddr = IPAddress.Parse(
                "201:6779:c497:82f:68fd:188d:4338:1234");

            IPEndPoint endPoint = new IPEndPoint(serverAddr, 10101);

            var data = new byte[16336];
            for (int i = 0; i < data.Length; i++)
                data[i] = (byte)(i ^ (i >> 8));

            for (int i = 0; i < 16; i++)
                sock.SendTo(data, endPoint);
        }
    }
}

Test binary: ConsoleApplication81.zip

neilalexander commented 5 years ago

I'm afraid I really don't know Windows or the Win32 API well enough to understand what the correct API behaviour here is and I haven't been able to find any Microsoft documentation that describes what to do.

Is ERROR_MORE_DATA a condition that can be returned infinite times across multiple frames? I would have expected that it should only happen as many times as needed to read a single frame. In the case of Yggdrasil, a single frame shouldn't ever exceed 65549 bytes so I don't know why we would ever want/need an infinite buffer.

Vort commented 5 years ago

I don't know it too. But that information can be obtained by debugging and constructing of test programs.

Look at last screenshot. Firefox sent five 16398-byte packets in a row. Three of them was stuffed in your 65549-bytes buffer. On fourth ReadFile or getOverlappedResult call (does not matter) error popped out. By the way, error is not always "more data", "device not functioning" also shows. But it is the same condition: 16398 bytes can not be fit into 16355 free buffer space.

It may be 9.9.2_3 driver problem. But 9.21.2 have even more problems! It fails to transmit large packets. You can try ping -l 2000 for example.

Vort commented 5 years ago

I have reverted syscalls_windows.go and can't see any crashes now. Sadly, I can't reproduce #456 condition anymore. Most likely, because DC++ hub was permanently shut down and I don't have continuous connection anymore. But reason may be that #456 bug was fixed not from water side, but from Yggdrasil side.

neilalexander commented 5 years ago

I'll try to look at the buffer sizes later, but it might be that we are better moving to Wintun anyway in the long run. The OpenVPN TAP driver is full of bugs and I am never sure how many of them we are affected by.

I don't know if anyone else is already looking at that for Water - we haven't had the time to investigate it so far. There's only two developers on the project and neither of us use Windows so we pretty much rely on other people to feed back to us.

Vort commented 5 years ago

I'll try to look at the buffer sizes later

Changing of buffer sizes can only hide the problem. I suggest you to debug iperf3 problem, it may help to understand what is going on. It may be that driver is not related to problem. As I see packet drops and reordering. How Yggdrasil TCP connection can have drops at all?

neilalexander commented 5 years ago

There are a couple of places where Yggdrasil can drop packets if absolutely necessary—there is actually work on-going in develop now to reduce that to only the switch since that is where congestion is handled so it may be worth testing those latest develop builds too.

Yggdrasil never reorders packets within a flow but the switch does try to balance traffic across flows through the switch queues when congestion happens, but this doesn't happen on a non-congested link.

Vort commented 5 years ago

Most likely, this problem was fixed with 0.3.6-0043 version. Upload to http://[301:fd1:86e3:521c::13]/videos/upload works fine for me with tap-windows-9.9.2_3.exe driver.

yggdrasil-develop-0.3.6-0043-windows-amd64.exe yggdrasil-develop-0.3.6-0043-windows-i386.exe

neilalexander commented 5 years ago

Glad to hear it. If you're happy after a few hours of use then please feel free to close the issue.