webmeshproj / webmesh

A simple, distributed, zero-configuration WireGuard mesh solution
https://webmeshproj.github.io
Apache License 2.0
425 stars 16 forks source link

err: sendto: destination address required #5

Closed bbigras closed 1 year ago

bbigras commented 1 year ago

I'm following the https://webmeshproj.github.io/guides/personal-vpn/ guide. I'm able to connect and some ping works, but I get this on the server side:

{"time":"2023-08-07T02:02:13.95968374-04:00","level":"INFO","msg":"starting webmesh node","version":"unknown","commit":"unknown","buildDate":"unknown"}
{"time":"2023-08-07T02:02:13.976051984-04:00","level":"INFO","msg":"using CN as node ID","component":"mesh","node-id":"server"}
{"time":"2023-08-07T02:02:13.976094557-04:00","level":"INFO","msg":"loading plugin","name":"mtls"}
{"time":"2023-08-07T02:02:14.010115941-04:00","level":"INFO","msg":"All 0 tables opened in 0s","component":"raft","storage":"/var/lib/webmesh/store","component":"raftbadger"}
{"time":"2023-08-07T02:02:14.017333494-04:00","level":"INFO","msg":"Discard stats nextEmptySlot: 0","component":"raft","storage":"/var/lib/webmesh/store","component":"raftbadger"}
{"time":"2023-08-07T02:02:14.017356434-04:00","level":"INFO","msg":"Set nextTxnTs to 0","component":"raft","storage":"/var/lib/webmesh/store","component":"raftbadger"}
{"time":"2023-08-07T02:02:14.076359285-04:00","level":"INFO","msg":"All 0 tables opened in 0s","component":"badger"}
{"time":"2023-08-07T02:02:14.090821071-04:00","level":"INFO","msg":"Discard stats nextEmptySlot: 0","component":"badger"}
{"time":"2023-08-07T02:02:14.090842809-04:00","level":"INFO","msg":"Set nextTxnTs to 0","component":"badger"}
{"time":"2023-08-07T02:02:14.091883921-04:00","level":"INFO","msg":"starting raft instance","component":"raft","storage":"/var/lib/webmesh/store","listen-addr":"[::]:9443"}
{"time":"2023-08-07T02:02:14.091997004-04:00","level":"INFO","msg":"initial configuration","component":"raft","index":0,"servers":["%+v",null]}
{"time":"2023-08-07T02:02:14.092069274-04:00","level":"INFO","msg":"bootstrapping cluster","component":"mesh","node-id":"server"}
{"time":"2023-08-07T02:02:14.092073906-04:00","level":"INFO","msg":"entering follower state","component":"raft","follower":{},"leader-address":"","leader-id":""}
{"time":"2023-08-07T02:02:18.538521544-04:00","level":"WARN","msg":"heartbeat timeout reached, starting election","component":"raft","last-leader-addr":"","last-leader-id":""}
{"time":"2023-08-07T02:02:18.53858546-04:00","level":"INFO","msg":"entering candidate state","component":"raft","node":{},"term":2}
{"time":"2023-08-07T02:02:18.538669733-04:00","level":"INFO","msg":"election won","component":"raft","term":2,"tally":1}
{"time":"2023-08-07T02:02:18.538681113-04:00","level":"INFO","msg":"entering leader state","component":"raft","leader":{}}
{"time":"2023-08-07T02:02:18.597710471-04:00","level":"INFO","msg":"newly bootstrapped cluster, setting IPv4/IPv6 networks","component":"mesh","node-id":"server","ipv4-network":"172.16.0.0/12","ipv6-network":"fda1:420:b42e::/48"}
{"time":"2023-08-07T02:02:18.598660537-04:00","level":"INFO","msg":"registering ourselves as a node in the cluster","component":"mesh","node-id":"server","server-id":"server"}
{"time":"2023-08-07T02:02:18.598683097-04:00","level":"INFO","msg":"generating wireguard key for ourselves","component":"mesh","node-id":"server"}
{"time":"2023-08-07T02:02:18.598689159-04:00","level":"INFO","msg":"generating new wireguard key","component":"mesh","node-id":"server"}
{"time":"2023-08-07T02:02:18.598998931-04:00","level":"INFO","msg":"starting network manager","component":"mesh","node-id":"server"}
{"time":"2023-08-07T02:02:18.599013156-04:00","level":"INFO","msg":"Configuring firewall","component":"net-manager","opts":{"ID":"server","DefaultPolicy":"accept","WireguardPort":51821,"RaftPort":9443,"GRPCPort":8443}}
{"time":"2023-08-07T02:02:18.643151554-04:00","level":"INFO","msg":"Configuring wireguard","component":"net-manager","opts":{"NodeID":"server","ListenPort":51821,"Name":"","ForceName":false,"ForceTUN":false,"PersistentKeepAlive":0,"MTU":1350,"AddressV4":"172.16.0.1/32","AddressV6":"fda1:420:b42e:442::/64","Metrics":false,"MetricsInterval":15000000000,"DisableIPv4":false,"DisableIPv6":false}}
{"time":"2023-08-07T02:02:18.643335776-04:00","level":"INFO","msg":"creating wireguard interface","component":"wireguard","name":"webmesh0"}
{"time":"2023-08-07T02:02:18.657321514-04:00","level":"INFO","msg":"re-adding ourselves to the cluster with the acquired wireguard address","component":"mesh","node-id":"server"}
{"time":"2023-08-07T02:02:18.657365768-04:00","level":"INFO","msg":"updating configuration","component":"raft","command":0,"server-id":"server","server-addr":"172.16.0.1:9443","servers":["%+v",[{"Suffrage":0,"ID":"server","Address":"172.16.0.1:9443"}]]}
{"time":"2023-08-07T02:02:18.657447005-04:00","level":"INFO","msg":"initial bootstrap complete","component":"mesh","node-id":"server"}
{"time":"2023-08-07T02:02:18.657459953-04:00","level":"INFO","msg":"mesh connection is ready, starting services"}
{"time":"2023-08-07T02:02:18.657884285-04:00","level":"INFO","msg":"Starting gRPC server on [::]:8443","component":"server"}
{"time":"2023-08-07T02:02:38.015101241-04:00","level":"INFO","msg":"started call","component":"server","protocol":"grpc","grpc.component":"server","grpc.service":"v1.Node","grpc.method":"Join","grpc.method_type":"unary","peer.address":"100.127.110.11:36454","grpc.start_time":"2023-08-07T02:02:38-04:00","grpc.time_ms":"0.01"}
{"time":"2023-08-07T02:02:38.015353545-04:00","level":"INFO","msg":"join request received","component":"node-server","op":"join","id":"admin","request":{"id":"admin","public_key":"sDNLid5iqiuIrFJ4sCYoUgaE4Xlbdy2taZazqYdZBF0=","raft_port":9443,"grpc_port":8443,"assign_ipv4":true}}
{"time":"2023-08-07T02:02:38.016100312-04:00","level":"INFO","msg":"adding non-voter to cluster","component":"node-server","op":"join","id":"admin","raft_address":"172.16.0.2:9443"}
{"time":"2023-08-07T02:02:38.01612437-04:00","level":"INFO","msg":"updating configuration","component":"raft","command":1,"server-id":"admin","server-addr":"172.16.0.2:9443","servers":["%+v",[{"Suffrage":0,"ID":"server","Address":"172.16.0.1:9443"},{"Suffrage":1,"ID":"admin","Address":"172.16.0.2:9443"}]]}
{"time":"2023-08-07T02:02:38.016176729-04:00","level":"INFO","msg":"added peer, starting replication","component":"raft","peer":"admin"}
{"time":"2023-08-07T02:02:38.016396001-04:00","level":"ERROR","msg":"failed to appendEntries to","component":"raft","peer":{"Suffrage":1,"ID":"admin","Address":"172.16.0.2:9443"},"error":"dial tcp 172.16.0.2:9443: connect: no route to host"}
{"time":"2023-08-07T02:02:38.016547253-04:00","level":"INFO","msg":"sending join response","component":"node-server","op":"join","id":"admin","response":{"address_ipv4":"172.16.0.2/32","address_ipv6":"fda1:420:b42e:ceea::/64","network_ipv4":"172.16.0.0/12","network_ipv6":"fda1:420:b42e::/48","peers":[{"id":"server","public_key":"5sdQXSMQGUXWzj7Ri8ZyYPoom2++4ISdWk+9sWSZDjg=","primary_endpoint":"100.85.215.110:51821","wireguard_endpoints":["100.85.215.110:51821"],"address_ipv4":"172.16.0.1/32","address_ipv6":"fda1:420:b42e:442::/64","allowed_ips":["172.16.0.1/32","fda1:420:b42e:442::/64"]}],"mesh_domain":"webmesh.internal."}}
{"time":"2023-08-07T02:02:38.016636105-04:00","level":"INFO","msg":"finished call","component":"server","protocol":"grpc","grpc.component":"server","grpc.service":"v1.Node","grpc.method":"Join","grpc.method_type":"unary","peer.address":"100.127.110.11:36454","grpc.start_time":"2023-08-07T02:02:38-04:00","grpc.code":"OK","grpc.time_ms":"1.548"}
{"time":"2023-08-07T02:02:38.017910703-04:00","level":"WARN","msg":"could not ping descendant","component":"net-manager","descendant":"admin","error":"run pinger: write ip4 0.0.0.0->172.16.0.2: sendto: destination address required"}
{"time":"2023-08-07T02:02:38.018289245-04:00","level":"WARN","msg":"could not ping descendant","component":"net-manager","descendant":"admin","error":"run pinger: write ip4 0.0.0.0->172.16.0.2: sendto: destination address required"}
{"time":"2023-08-07T02:02:41.339334625-04:00","level":"ERROR","msg":"failed to heartbeat to","component":"raft","peer":"172.16.0.2:9443","backoff time":10000000,"error":"dial tcp 172.16.0.2:9443: i/o timeout"}
{"time":"2023-08-07T02:02:44.949781283-04:00","level":"ERROR","msg":"failed to heartbeat to","component":"raft","peer":"172.16.0.2:9443","backoff time":10000000,"error":"dial tcp 172.16.0.2:9443: i/o timeout"}
{"time":"2023-08-07T02:02:48.366660528-04:00","level":"ERROR","msg":"failed to heartbeat to","component":"raft","peer":"172.16.0.2:9443","backoff time":10000000,"error":"dial tcp 172.16.0.2:9443: i/o timeout"}
{"time":"2023-08-07T02:02:51.944757019-04:00","level":"ERROR","msg":"failed to heartbeat to","component":"raft","peer":"172.16.0.2:9443","backoff time":20000000,"error":"dial tcp 172.16.0.2:9443: i/o timeout"}
{"time":"2023-08-07T02:02:55.425564585-04:00","level":"ERROR","msg":"failed to heartbeat to","component":"raft","peer":"172.16.0.2:9443","backoff time":40000000,"error":"dial tcp 172.16.0.2:9443: i/o timeout"}
{"time":"2023-08-07T02:02:55.741667558-04:00","level":"INFO","msg":"shutting down gRPC server"}
{"time":"2023-08-07T02:02:55.741742112-04:00","level":"INFO","msg":"Shutting down gRPC server","component":"server"}
{"time":"2023-08-07T02:02:55.741822634-04:00","level":"INFO","msg":"shutting down mesh connection"}
{"time":"2023-08-07T02:02:55.741864178-04:00","level":"INFO","msg":"creating new db snapshot","component":"snapshots"}
{"time":"2023-08-07T02:02:55.741992909-04:00","level":"INFO","msg":"Number of ranges found: 2","component":"badger"}
{"time":"2023-08-07T02:02:55.742081389-04:00","level":"INFO","msg":"DB.Backup Streaming about 0 B of uncompressed data (0 B on disk)","component":"badger"}
{"time":"2023-08-07T02:02:55.742121461-04:00","level":"INFO","msg":"Sent range 0 for iteration: [, 2f72656769737472792f65646765732f61646d696e2f736572766572fffffffffffffff1) of size: 0 B","component":"badger"}
{"time":"2023-08-07T02:02:55.742132076-04:00","level":"INFO","msg":"Sent range 1 for iteration: [2f72656769737472792f65646765732f61646d696e2f736572766572fffffffffffffff1, ) of size: 0 B","component":"badger"}
{"time":"2023-08-07T02:02:55.754551205-04:00","level":"INFO","msg":"DB.Backup Sent data of size 2.2 KiB","component":"badger"}
{"time":"2023-08-07T02:02:55.754931487-04:00","level":"INFO","msg":"db snapshot complete","component":"snapshots","duration":"13.017689ms","size":"982 B"}
{"time":"2023-08-07T02:02:55.754985262-04:00","level":"INFO","msg":"starting snapshot up to","component":"raft","index":20}
{"time":"2023-08-07T02:02:55.755028353-04:00","level":"INFO","msg":"creating new snapshot","component":"raft","storage":"/var/lib/webmesh/store","component":"snapshotstore","path":"/var/lib/webmesh/store/snapshots/2-20-1691388175755.tmp"}
{"time":"2023-08-07T02:02:55.781345175-04:00","level":"INFO","msg":"snapshot complete up to","component":"raft","index":20}
{"time":"2023-08-07T02:02:55.781571056-04:00","level":"ERROR","msg":"failed to transfer leadership","component":"raft","storage":"/var/lib/webmesh/store","error":"cannot find peer"}
{"time":"2023-08-07T02:02:56.721167203-04:00","level":"ERROR","msg":"failed to appendEntries to","component":"raft","peer":{"Suffrage":1,"ID":"admin","Address":"172.16.0.2:9443"},"error":"dial tcp 172.16.0.2:9443: i/o timeout"}
{"time":"2023-08-07T02:02:58.988560091-04:00","level":"ERROR","msg":"failed to heartbeat to","component":"raft","peer":"172.16.0.2:9443","backoff time":80000000,"error":"dial tcp 172.16.0.2:9443: i/o timeout"}
{"time":"2023-08-07T02:02:58.988724645-04:00","level":"INFO","msg":"Lifetime L0 stalled for: 0s","component":"badger"}
{"time":"2023-08-07T02:02:59.016826222-04:00","level":"INFO","msg":"Level 0 [ ]: NumTables: 01. Size: 1.3 KiB of 0 B. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 64 MiB\nLevel 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 4 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 5 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 6 [B]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel Done","component":"badger"}
{"time":"2023-08-07T02:02:59.031477119-04:00","level":"INFO","msg":"Lifetime L0 stalled for: 0s","component":"raft","storage":"/var/lib/webmesh/store","component":"raftbadger"}
{"time":"2023-08-07T02:02:59.099681848-04:00","level":"INFO","msg":"Level 0 [ ]: NumTables: 01. Size: 2.3 KiB of 0 B. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 64 MiB\nLevel 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 4 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 5 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel 6 [B]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB\nLevel Done","component":"raft","storage":"/var/lib/webmesh/store","component":"raftbadger"}
[pid 219825] sendto(22, "\10\0L\322\270\356\0\0\27y\6rKN\376\214\356nZ\225\234\267O~\235K^Y\302B\227V", 32, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("172.16.0.2")}, 16 <unfinished ...>
[pid 219825] <... sendto resumed>)      = -1 EDESTADDRREQ (Destination address required)

I think this cause disconnections.

0.1.2

tinyzimmer commented 1 year ago

So I'm seeing shutdown logs in there too, not sure if that was you or from an error. Some of those ping failures you witness are usually benign. When a new connection is established boths sides will do a short ping to initiate the connection. It doesn't always work - but the heartbeats usually end up starting it anyway.

Those multiple failed heartbeats are a little more concerning - unless you had shut down the other side of the connection.

bbigras commented 1 year ago

Gotcha.

Here's an asciinema video (top panel is my desktop (server), the 2 other panels are my laptop (client)) https://asciinema.org/a/uuKQNDtPIdaguQ1WSAdTMmse3

Note that I'm already using both tailscale and wireguard on both computers.

tinyzimmer commented 1 year ago

Can take a look in a little bit and see whats going on. One other thing to check - are there overlapping address assignments between Tailscale, your devices, and webmesh? Webmesh side you can configure the internal IPv4 prefix at bootstrap - the IPv6 one is randomly generated.

One way to rule that out would be to try running it with ipv4 disabled.

bbigras commented 1 year ago

Here's another recording with --global.no-ipv4 and --no-ipv4.

https://asciinema.org/a/Q26LZhOZCCHVIxUb0CgJoWvjT

Note that I'm using webmesh with tailscale (100.85.215.110 is a tailscale ip).

bbigras commented 1 year ago

Is there something like --global.primary-endpoint but for wmctl connect?

bbigras commented 1 year ago

It works if I connect using lan ips instead of tailscale.

EDIT: but it doesn't if I try to connect to my vps from my desktop. Only my vps's ports are open.

tinyzimmer commented 1 year ago

To the wmctl connect question. There isn't. That utility was originally just for testing, but decided to keep it around. It acts like a NATd VPN client by default. You just poke out.

At least one node needs to accessible currently for peerings to work. But I am currently working on other methods of discovery.

bbigras commented 1 year ago

At least one node needs to accessible currently for peerings to work. But I am currently working on other methods of discovery.

In my tests, at least one node was accessible.

I'll do more test, though.

tinyzimmer commented 1 year ago

I hope to be able to look more closely tonight. One more thing that you can try (and is a really easy way to shoot yourself in the foot that I need to document better) - is that given this is a sort of "zero-trust" solution, the default behavior only lets people peer up if there is a Network ACL allowing it. Those are managed via the admin API, but you can set a default allow-all rule at bootstrap with --bootstrap.default-network-policy=accept. You'll see something similar in most of the examples.

bbigras commented 1 year ago

Here's a nixpkgs test. It works only with the --global.primary-endpoint 192.168.2.101 line. If I remove it I get https://asciinema.org/a/H4bHoph8bTN0NlUd0FmR73OfG . Note that it's possible that the vms have more than 1 network interface each, since setting an ip for eth1 might have created a new one if the default one if not eth1.

I'll test again with my vps.

Don't hesitate if you want to know how to run this test with nixpkgs.

import ./make-test-python.nix ({ pkgs, ... }: {
  name = "webmesh";
  meta.maintainers = with pkgs.lib.maintainers; [ bbigras ];

  nodes = {
    server = {
      networking = {
        interfaces.eth1 = {
          ipv4.addresses = [
            { address = "192.168.2.101"; prefixLength = 24; }
          ];
        };

        firewall = {
          trustedInterfaces = [ "webmesh0" ];

          allowedTCPPorts = [
            8443
            9443
          ];
          allowedUDPPorts = [
            51820
          ];
        };
      };

      systemd.services.webmesh = {
        after = ["network-online.target"];
        wantedBy = [ "multi-user.target" ];
        script = ''
          ${pkgs.webmesh}/bin/webmesh-node \
            --global.insecure \
            --global.no-ipv6 \
            --global.detect-endpoints \
            --global.detect-private-endpoints \
            --bootstrap.enabled \
            --bootstrap.default-network-policy=accept \
            --global.primary-endpoint 192.168.2.101
      '';
      };
    };

    client = {
      networking = {
        interfaces.eth1 = {
          ipv4.addresses = [
            { address = "192.168.2.102"; prefixLength = 24; }
          ];
        };
        firewall = {
          trustedInterfaces = [ "webmesh0" ];
        };
      };
      systemd.services.wmctl = {
        after = ["network-online.target"];
        wantedBy = [ "multi-user.target" ];
        script = ''
          ${pkgs.webmesh}/bin/wmctl \
            connect --insecure --no-ipv6 --join-server=192.168.2.101:8443
      '';
      };
    };
  };

  testScript =
      ''
        server.start()
        server.wait_for_unit("webmesh.service")
        server.wait_for_open_port(8443)

        client.start()
        client.wait_for_unit("wmctl.service")
        client.wait_for_open_port(9443)

        client.succeed("ping -c1 172.16.0.1")
        client.sleep(120)
        client.succeed("ping -c1 172.16.0.1")
      '';
})
tinyzimmer commented 1 year ago

I'm gonna try to replicate it locally - and if I fail may reach out for more infoz

bbigras commented 1 year ago

Also note that I don't have to use --global.primary-endpoint 192.168.2.101 if I disable eth0 in my test.

tinyzimmer commented 1 year ago

I think I've found an issue - not sure if it is related to yours. But in a similar setup the second client keeps dropping the peer. I'll let you know what I figure out.

tinyzimmer commented 1 year ago

Just to give you a quick update. I'm trying to hammer out one last bug and then I'll tag a new release.

I think most related to your issue was in the storage update trigger on each node. It was doing some hacky logic to make sure it was fully up to date - and a recent fix had made that no longer necessary. I think that old hack was causing peer refreshes not to happen at the right times.

I'm having another issue where peer refreshes are sometimes returning the wrong internal IPs to be set to the allowed IPs. I'm not sure if this is something you are experiencing - but I hope to figure it out before I push this other fix out.

tinyzimmer commented 1 year ago

I'm still working on the second issue - but if you are able to build from main you can see if the first one was your problem. If you can't build yourself, the CI will have an image in a little bit.

bbigras commented 1 year ago

I still need --global.primary-endpoint with 56a9f6b671e82d862e34187e6956a1de2af16371 . I mean my nix test, not my real test with my vps.

tinyzimmer commented 1 year ago

That is likely unrelated. Detection is non-deterministic. So if you want a specific one to be tagged as the primary - you have to specify. That being said - I can try to look into it more.

tinyzimmer commented 1 year ago

Worth noting there are also the --mesh.*-endpoint options for more granular control.

bbigras commented 1 year ago

I'm testing with my desktop to my vps and it seems to stay connected now. :tada:

I don't mind using --global.primary-endpoint :)

tinyzimmer commented 1 year ago

Sweet - I fixed my other bug so about to tag a new release. Will close this, but feel free to open a new issue if anything arises.