zerotier / libzt

Encrypted P2P sockets over ZeroTier
https://zerotier.com
Other
185 stars 54 forks source link

libzt fails to connect over IPv6 to ad-hoc and public networks on Linux amd64 #31

Closed pts closed 4 years ago

pts commented 6 years ago

I'm trying to use libzt (at this commit: https://github.com/zerotier/libzt/commit/5ec7d5befceafa7d51152eff2094265fb8ff249a) to connect to a TCP port on an IPv4 and IPv6 peer, but the connect never succeeds.

To make sure that it's not a network connectivity issue, I'm able to use zerotier-cli to join the ad-hoc SSH network:

$ zerotier-cli join ff00160016000000
200 join OK

Then connecting works:

$ gcc -s -O2 -W -Wall -Wextra t1nativex.c
$ ./a.out
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3

However, when I'm doing it with libzt, it doesn't work:

$ g++ -s -O2 -W -Wall -Wextra t1x.c -l... -lpthread
$ ./a.out
...
STACK[32465]:        sys_arch.c:  270:            sys_mbox_post: sys_mbox_post: mbox 0x7f80a4000a80 msg 0x7ffc588b0010
STACK[32469]:        sys_arch.c:  361:      sys_arch_mbox_fetch: sys_mbox_fetch: mbox 0x7f80a4000a80 msg 0x7f80a4000a20
error connecting to remote host: -1 No route to host

Am I doing something wrong? How can I fix it in t1x.c?

Source file t1nativex.c (finishes successfully and quickly after zerotier-one -q join ff00160016000000) with part of the IP address redacted:

#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <unistd.h>

int main()  {
  const char *ip6 = "fce9:0016:00??:????:????:0000:0000:0001";
  int port = 22;
  char buf[256];
  struct sockaddr_in6 addr;
  int fd, err = 0;

  addr.sin6_family = AF_INET6;
  if (!inet_pton(AF_INET6, ip6, &(addr.sin6_addr))) {
    fprintf(stderr, "error converting address\n");
    return 1;
  }   
  addr.sin6_port = htons(port);  

  if ((fd = socket(AF_INET6, SOCK_STREAM, 0)) < 0) {
    fprintf(stderr, "error creating socket\n");
    return 1;
  }
  if ((err = connect(fd, (const struct sockaddr *)&addr, sizeof(addr))) < 0) {
    fprintf(stderr, "error connecting to remote host: %d (%s)\n", err, strerror(errno));
    return 1;
  }
  if ((err = read(fd, buf, sizeof(buf))) < 0) {
    fprintf(stderr, "error reading from socket\n");
    return 1;
  }
  write(1, buf, err);
  if ((err = close(fd)) < 0) {
    fprintf(stderr, "error closing socket\n");
    return 1;
  }
  return 0;
}

Source file t1x.c (IPv6 over ZeroTier, zts_connect fails) with part of the IP address redacted:

#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>

#include "libzt.h"

int main()  {
  const char *ip6 = "fce9:0016:00??:????:????:0000:0000:0001";
  int port = 22;
  char buf[256];
  struct sockaddr_in6 addr;
  int fd, err = 0;

  addr.sin6_family = AF_INET6;
  if (!inet_pton(AF_INET6, ip6, &(addr.sin6_addr))) {
    printf("error converting address\n");
    return 1;
  }
  addr.sin6_port = htons(port);  
  if (zts_startjoin("path", 0xff00160016000000ULL)) {
    printf("error joining the network\n");
    return 1;
  }
  if ((fd = zts_socket(AF_INET6, SOCK_STREAM, 0)) < 0) { 
    printf("error creating socket\n");
    return 1;
  }
  if ((err = zts_connect(fd, (const struct sockaddr *)&addr, sizeof(addr))) < 0) {
    /* No route to host -- is it a real error message? */
    printf("error connecting to remote host: %d %s\n", err, strerror(errno));
    return 1;
  }
  if ((err = zts_read(fd, buf, sizeof(buf))) < 0) {
    fprintf(stderr, "error reading from socket\n");
    return 1;
  }
  write(1, buf, err);
  if ((err = zts_close(fd)) < 0) {
    printf("error closing socket\n");
    return 1;
  }
  zts_stop();
  return 0;
}

Source file t4x.c (IPv4 over ZeroTier, zts_connect doesn't return for several minutes) with part of the IP address redacted:

#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>

#include "libzt.h"

int main()  {
  const char *ip = "172.25.???.???";
  int port = 22;
  char buf[256];
  struct sockaddr_in addr;
  int fd, err = 0;

  addr.sin_family = AF_INET;
  if (!inet_pton(AF_INET, ip, &(addr.sin_addr))) {
    printf("error converting address\n");
    return 1;
  }
  addr.sin_port = htons(port);  
  if (zts_startjoin("path", 0x??????ULL)) {
    printf("error joining the network\n");
    return 1;
  }
  if ((fd = zts_socket(AF_INET, SOCK_STREAM, 0)) < 0) { 
    printf("error creating socket\n");
    return 1;
  }
  if ((err = zts_connect(fd, (const struct sockaddr *)&addr, sizeof(addr))) < 0) {
    /* No route to host -- is it a real error message? */
    printf("error connecting to remote host: %d %s\n", err, strerror(errno));
    return 1;
  }
  if ((err = zts_read(fd, buf, sizeof(buf))) < 0) {
    fprintf(stderr, "error reading from socket\n");
    return 1;
  }
  write(1, buf, err);
  if ((err = zts_close(fd)) < 0) {
    printf("error closing socket\n");
    return 1;
  }
  zts_stop();
  return 0;
}

This command works, with part of the IP address redacted:

$ telnet 172.25.???.??? 22
Trying 172.25.???.???...
Connected to 172.25.???.???.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
^C
Connection closed by foreign host.
joseph-henry commented 6 years ago

Thank you. It appears you are doing everything correctly and I've been able to replicate the error. I'll try to have a solution shortly.

pts commented 6 years ago

Thank you for looking into this!

joseph-henry commented 6 years ago

@pts, Let me know if you're still having issues with this. I haven't tested it since I last looked at it but I suspect some of the most recent updates may have fixed it. I'm going to close this ticket for now but feel free to re-open it if you need.

pts commented 6 years ago

This bug is still not fixed for me, libzt still doesn't work. Could you please take a look again.

I'm using https://github.com/zerotier/libzt/commit/744277fb69c16dde57d15a21d8403837b1187ef4 and the t1x.c above fails with:

$ git clone https://github.com/zerotier/libzt
$ (cd libzt && git checkout 744277fb69c16dde57d15a21d8403837b1187ef4)
$ (cd libzt && git submodule init)
$ (cd libzt && git submodule update)
$ (cd libzt && make patch)
$ (cd libzt && cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release)
$ (cd libzt && cmake --build build)
(takes a few minutes to compile, no errors)
$ g++ -W -Wall -Werror t1x.c -Ilibzt/include libzt/bin/lib/libzt-static.a -lpthread
(no errors or warnings)
$ ./a.out
(busy for a few seconds)
error connecting to remote host: -1 No route to host
$ sudo ./a.out
(busy for a few seconds)
error connecting to remote host: -1 No route to host

This succeeds on the client:

$ zerotier-one join ff00160016000000
200 join OK
$ telnet fce9:0016:00??:????:????:0000:0000:0001 22
Trying ...
Connected to fce9:0016:00??:????:????:0000:0000:0001.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
pts commented 6 years ago

Any update on this issue? libzt still doesn't work for me: it is unable to connect over IPv4 and IPv6.

joseph-henry commented 6 years ago

No update, but I'll re-open this issue and take another look.

joseph-henry commented 6 years ago

I have a theory. I'm thinking that when joining ad-hoc networks the route (normally pushed from a controller) isn't being treated in the same way by the libzt service. I'm going to see what I can do about this.

In the meantime, do you have success with networks that aren't ad-hoc?

Note: I am currently able to reproduce your problem, so this is definitely a bug you've found.

pts commented 6 years ago

Thank you for working on diagnosing and fixing this!

do you have success with networks that aren't ad-hoc?

I haven't tried non-ad-hoc networks with libzt, because I don't need them in my use case. I've just changed the GitHub issue title to make it more focused, now it includes to ad-hoc networks.

Non-ad-hoc networks (as well as ad-hoc networks) work for me with zerotier-one, telnet and nc -l -p ..., using the Linux kernel on both computers to create the TCP packets.

joseph-henry commented 6 years ago

I think I have a solution. It's available on all branches now. I've added a patch to the ext\lwip submodule (will require re-patching after pull) and made some minor tweaks to libzt. Peers are once again reachable on ad-hoc networks but I haven't done any more testing than that.

Let me know if this doesn't seem to fix your issue.

pts commented 6 years ago

Thank you for your work on fixing this!

It still doesn't work for me, I'm still getting the same error. I'm using 71ea71e33a480d665db7f49e6a36dc37abb5d9aa with t1x.c on the client:

$ git clone https://github.com/zerotier/libzt
$ (cd libzt && git checkout 71ea71e33a480d665db7f49e6a36dc37abb5d9aa)
$ (cd libzt && git submodule init)
$ (cd libzt && git submodule update)
$ (cd libzt && make patch)
$ (cd libzt && cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release)
$ (cd libzt && cmake --build build)
(takes a few minutes to compile, no errors)
$ $EDITOR t1x.c  # Change the IPv6 address in line 10.
$ g++ -W -Wall -Werror t1x.c -Ilibzt/include libzt/bin/lib/libzt.a -lpthread
(no errors or warnings)
$ ./a.out
(busy for a few seconds)
error connecting to remote host: -1 No route to host
$ sudo ./a.out
(busy for a few seconds)
error connecting to remote host: -1 No route to host

This still succeeds on the client (with the correct IPv6 address):

$ zerotier-one join ff00160016000000
200 join OK
$ telnet fce9:0016:00??:????:????:0000:0000:0001 22
Trying ...
Connected to fce9:0016:00??:????:????:0000:0000:0001.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
joseph-henry commented 6 years ago

No problem! Thanks for continuing to test this. I was able to replicate this again and have added a block in the route selection code (another lwip patch) which seems to pick the route correctly now. Testing your code verbatim and exact steps yields:

SSH-2.0-OpenSSH_6.9

Please don't hesitate to point out any other bugs, or broken features.

pts commented 6 years ago

TL;DR libzt works for ad-hoc networks now, but libzt doesn't work for public network Earth on IPv6.

Thank you, now the ad-hoc network works for me!

$ git clone https://github.com/zerotier/libzt
$ (cd libzt && git checkout 71e37354a1347674b32389e30f35142a27be1a47)
$ (cd libzt && git submodule init)
$ (cd libzt && git submodule update)
$ (cd libzt && make patch)
$ (cd libzt && cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release)
$ (cd libzt && cmake --build build)
(takes a few minutes to compile, no errors)
$ $EDITOR t1x.c  # Change the IPv6 address in line 10.
$ g++ -W -Wall -Werror t1x.c -Ilibzt/include libzt/bin/lib/libzt.a -lpthread
(no errors or warnings)
$ ./a.out
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
(takes 5.182s on average)

However, the public network Earth doesn't always work for me with IPv6, I'm getting various errors:

$ g++ -W -Wall -Werror t1xp6.c -Ilibzt/include libzt/bin/lib/libzt.a -lpthread
$ ./a.out
no data from remote host
$ ./a.out
error connecting to remote host: -1 No route to host
$ ./a.out
Segmentation fault
$ ./a.out  # Now it works.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
$ ./a.out
no data from remote host

It works though when I join the network with zerotier-one on the client:

$ zerotier-one join 8056c2e21c000001
200 join OK
$ telnet f???:????:????:????:????:????:????:???? 22
Trying f???:????:????:????:????:????:????:????...
Connected to f???:????:????:????:????:????:????:????.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3

Source file t1xp6.c looks like (with the IP address redacted):

#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>

#include "libzt.h"

int main()  {
  const char *ip6 = "f???:????:????:????:????:????:????:????";
  int port = 22;
  char buf[256];
  struct sockaddr_in6 addr;
  int fd, err = 0;

  addr.sin6_family = AF_INET6;
  if (!inet_pton(AF_INET6, ip6, &(addr.sin6_addr))) {
    fprintf(stderr, "error converting address\n");
    return 1;
  }
  addr.sin6_port = htons(port);  
  if (zts_startjoin("path", 0x8056c2e21c000001ULL)) {  /* Earth */
    fprintf(stderr, "error joining the network\n");
    return 1;
  }
  if ((fd = zts_socket(AF_INET6, SOCK_STREAM, 0)) < 0) { 
    fprintf(stderr, "error creating socket\n");
    return 1;
  }
  if ((err = zts_connect(fd, (const struct sockaddr *)&addr, sizeof(addr))) < 0) {
    /* No route to host -- is it a real error message? */
    fprintf(stderr, "error connecting to remote host: %d %s\n", err, strerror(errno));
    return 1;
  }
  if ((err = zts_read(fd, buf, sizeof(buf))) < 0) {
    fprintf(stderr, "error reading from socket\n");
    return 1;
  }
  if (err == 0) {
    fprintf(stderr, "no data from remote host\n");
    return 1;
  }
  write(1, buf, err);
  if ((err = zts_close(fd)) < 0) {
    fprintf(stderr, "error closing socket\n");
    return 1;
  }
  zts_stop();
  return 0;
}

The public network Earth seems to work on IPv4 tough:

$ g++ -W -Wall -Werror t1xp4.c -Ilibzt/include libzt/bin/lib/libzt.a -lpthread
$ ./a.out
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
(takes 5.182s on average)

Source file t1xp4.c looks like (with the IP address redacted):

#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>

#include "libzt.h"

int main()  {
  const char *ip = "28.???.???.???";
  int port = 22;
  char buf[256];
  struct sockaddr_in addr;
  int fd, err = 0;

  addr.sin_family = AF_INET;
  if (!inet_pton(AF_INET, ip, &(addr.sin_addr))) {
    fprintf(stderr, "error converting address\n");
    return 1;
  }
  addr.sin_port = htons(port);  
  if (zts_startjoin("path", 0x8056c2e21c000001ULL)) {  /* Earth */
    fprintf(stderr, "error joining the network\n");
    return 1;
  }
  if ((fd = zts_socket(AF_INET, SOCK_STREAM, 0)) < 0) { 
    fprintf(stderr, "error creating socket\n");
    return 1;
  }
  if ((err = zts_connect(fd, (const struct sockaddr *)&addr, sizeof(addr))) < 0) {
    /* No route to host -- is it a real error message? */
    fprintf(stderr, "error connecting to remote host: %d %s\n", err, strerror(errno));
    return 1;
  }
  if ((err = zts_read(fd, buf, sizeof(buf))) < 0) {
    fprintf(stderr, "error reading from socket\n");
    return 1;
  }
  if (err == 0) {
    fprintf(stderr, "no data from remote host\n");
    return 1;
  }
  write(1, buf, err);
  if ((err = zts_close(fd)) < 0) {
    fprintf(stderr, "error closing socket\n");
    return 1;
  }
  zts_stop();
  return 0;
}
joseph-henry commented 6 years ago

Interesting. After testing this about 20 times I finally got no data from remote host so something is afoot. I'll have to think about this one...

joseph-henry commented 6 years ago

Looks like we were observing multiple related issues in my lwIP driver code. Issue (1) was that pointers to incoming frames were being counted improperly leading to a buffer overflow, and issue (2) was that incoming data stored in lwip pbufs was being deallocated before your app had a chance to read them, now the reference counter is properly incremented and will deallocate automatically. Both issues seem fixed. I've tested on macOS and Linux amd64 machines.

pts commented 6 years ago

Thank you for fixing it, Earth works now:

$ git clone https://github.com/zerotier/libzt
$ (cd libzt && git checkout 2f904ccdc6ce6a4418058e5cf0f386eff60125ad)
$ (cd libzt && git submodule init)
$ (cd libzt && git submodule update)
$ (cd libzt && make patch)
$ (cd libzt && cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release)
$ (cd libzt && cmake --build build)
(takes a few minutes to compile, no errors)
$ $EDITOR t1xp4.c  # Change the IPv4 address in line 10.
$ $EDITOR t1xp6.c  # Change the IPv6 address in line 10.
$ g++ -W -Wall -Werror t1xp4.c -Ilibzt/include libzt/bin/lib/libzt.a -lpthread
(no errors or warnings)
$ ./a.out
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
(takes 6.759s on average)
$ g++ -W -Wall -Werror t1xp6.c -Ilibzt/include libzt/bin/lib/libzt.a -lpthread
(no errors or warnings)
$ ./a.out
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u3
(takes 7.327s on average)

However, in the first 2 minutes after the server has joined Earth, the libzt client is not able to connect (over IPv6, I haven't tried IPv4); it reports (several times):

$ ./a.out
error connecting to remote host: -1 Software caused connection abort
$ ./a.out
error connecting to remote host: -1 Software caused connection abort

Preferably I need a solution which works immediately (or in 1--2 seconds) after the server has joined the network.

joseph-henry commented 6 years ago

I did replicate this once, but I'm not sure this is strictly a libzt issue. Some of the time has to be attributed to the SSH server startup (or) a regular ZT instance (running on the server) joining the network. In some cases it can take ~30 seconds for something to become reachable. Do you notice this same delay using a non-libzt connection request?

pts commented 6 years ago

some of the time has to be attributed to the SSH server startup

The SSH server has been running for months. It takes less than 0.3 seconds to get the SSH-2.0- response when I connect over TCP directly from the same client. So SSH server startup is not an issue here.

a regular ZT instance (running on the server) joining the network

This can be the reason. Can it can take up to 2 minutes for a node to fully join the ZT network after it has received a ZT IP address? How can the node itself query that it has fully joined? (In my use use both the client and the server will join the ZT network, do their SSH communication, then they will leave the network. It's not OK that the server thinks it has already joined, but the client is unable to connect.)

Do you notice this same delay using a non-libzt connection request?

I will update this issue as soon as I have my findings about this.

pts commented 6 years ago

I was able to reproduce the connection issue without using libzt, I've filed https://github.com/zerotier/libzt/issues/38 about that.

We can keep this issue open for TCP connection failures with libzt on the client.

joseph-henry commented 4 years ago

Since this ticket is so old and we have a separate discussion in #38 I'll close it. We can reopen this if it's still an issue.