substituting iptables-nft for iptables seems to fix broken NAT (at least on Raspberry Pi)

TL;DR - I think the iptables calls in entrypoint-bridge.sh should probably become iptables-nft.

But please take that with a few tons of salt - I'm not an iptables groupie and I don't keep track of developments in this area. I'm also a ZeroTier newbie.

Scenario:

Raspberry Pi 4 Model B Rev 1.1 running Debian GNU/Linux 11 (bullseye) as full 64-bit OS.
I want the Pi to forward traffic between ZeroTier and my home LAN.
Ideally, ZeroTier runs in a Docker container.

I mucked about with zerotier/zerotier without really getting anywhere so I backed-off to a native install.

The instructions I was following included three magic lines:

$ sudo iptables -t nat -A POSTROUTING -o $PHY_IFACE -j MASQUERADE
$ sudo iptables -A FORWARD -i $PHY_IFACE -o $ZT_IFACE -m state --state RELATED,ESTABLISHED -j ACCEPT
$ sudo iptables -A FORWARD -i $ZT_IFACE -o $PHY_IFACE -j ACCEPT

Assume all that is in place. From a remote client, pings:

to the 10/8 IP address assigned to the Pi by ZeroTier - succeeded
to the eth0 IP address of the Pi - succeeded
to any IP address on the home LAN beyond the Pi - failed

Sniffing showed that packets from the remote client were appearing on the home LAN with a source address in 10/8. The logical conclusion is that the last-hop NAT was failing. Somewhere along the way I stumbled across an error about iptables -t nat but I didn't record it and I'm dang'd if I can find it again. However, it led me on a journey which suggested that the advent of Bullseye on the Pi had switched to using iptables-nft. So, I replaced the magic three lines above with:

$ sudo iptables-nft -t nat -A POSTROUTING -o $PHY_IFACE -j MASQUERADE
$ sudo iptables-nft -A FORWARD -i $PHY_IFACE -o $ZT_IFACE -m state --state RELATED,ESTABLISHED -j ACCEPT
$ sudo iptables-nft -A FORWARD -i $ZT_IFACE -o $PHY_IFACE -j ACCEPT

and, bingo, everything worked.

Meanwhile, I was still considering alternatives for the container part of the problem. Finding this repo plus the Dockerfile.bridge and entrypoint-bridge.sh containing the three magic lines seemed highly promising.

Spin up zyclonite/zerotier:bridge. Same result as my first attempt with the native install:

can't ping to any IP address on the home LAN beyond the Pi because NAT isn't working.

So, I clone this repo and mod entrypoint-bridge.sh:

$ git diff scripts/entrypoint-bridge.sh
diff --git a/scripts/entrypoint-bridge.sh b/scripts/entrypoint-bridge.sh
index 1d89214..c5628b6 100755
--- a/scripts/entrypoint-bridge.sh
+++ b/scripts/entrypoint-bridge.sh
@@ -7,8 +7,8 @@ fi

 PHY_IFACE=eth0
 ZT_IFACE="zt+"
-iptables -t nat -A POSTROUTING -o $PHY_IFACE -j MASQUERADE
-iptables -A FORWARD -i $PHY_IFACE -o $ZT_IFACE -m state --state RELATED,ESTABLISHED -j ACCEPT
-iptables -A FORWARD -i $ZT_IFACE -o $PHY_IFACE -j ACCEPT
+iptables-nft -t nat -A POSTROUTING -o $PHY_IFACE -j MASQUERADE
+iptables-nft -A FORWARD -i $PHY_IFACE -o $ZT_IFACE -m state --state RELATED,ESTABLISHED -j ACCEPT
+iptables-nft -A FORWARD -i $ZT_IFACE -o $PHY_IFACE -j ACCEPT

 exec "$@"

Then I build the container with those mods and, bingo, everything works.

I'd normally just submit a PR but, as I said, I don't know the ins-and-outs of iptables vs iptables-nft, and I don't have anything other than Raspberry Pis to test on, so I really can't evaluate whether an unconditional switch to the -nft form would fail on other systems.

This might be the same problem reported by @outofsight in #9.

As the original designer of the "bridge" functionality, perhaps @red-avtovo might be able to comment too.

I'm an absolute beginner but my understanding (after discussion in #9) was the intent of -bridge variant was not to make a literal bridge (i.e. L2): iptables, after all, is related with IP and L3 I suppose...

I was able to make a real L2-bridge between ZT and LAN but I had to renounce to Docker container. I made it as an LXC container. There is still some aspect to perfect but it's working as expected. I'm not sure if it could be done in Docker because of networking models, the most similar way being a macvlan I think.

I think this is a really excellent repo+container which I intend to use so I definitely don't want to risk accidentally offending anyone by seeming to get too hung-up on terminology.

However, I do agree with your first sentence. I think it's routing. The way I was taught is:

bridging occurs at layer two and forwards frames between discrete segments of a single broadcast domain
routing occurs at layer three and forwards packets between discrete broadcast domains
in general, "broadcast domain", "subnet" and "VLAN" are synonyms. The notion of "broadcast domain" is preferred because the limits of where a non-unicast datagram will reach is usually a better way to think about what happens at the various layers.

With zyclonite/zerotier:bridge running as a Docker container, a packet sent from a remote client aimed at a host on the home LAN has a source IP address in the ZeroTier cloud and a destination IP address in the range for the home LAN. Those are two different broadcast domains. If packets are getting between those two end points then, according to what I was taught, routing (not bridging) must be occurring.

I am having no difficulty at all running zyclonite/zerotier:bridge as a Docker container. It's perfect for the problem I'm trying to solve.

Most of my Docker work is inside IOTstack which is best thought of as a collection of docker-compose service definitions that adhere some conventions.

Pretty much it boils down to a common top-level folder for persistent storage plus a better-than-even chance that potential port conflicts have been sorted out ahead of time for non-host-mode containers.

What I'm doing right now is:

Clone this repo into my IOTstack templates folder and mod the relevant lines:

$ cd ~/IOTstack/.templates
$ git clone https://github.com/zyclonite/zerotier-docker.git zerotier
$ cd zerotier/scripts
$ sed -i 's/^iptables /iptables-nft /g' entrypoint-bridge.sh

Cloning into .templates is just adhering to an IOTstack convention. It could be anywhere.

The service definition in my ~/IOTstack/docker-compose.yml:

  zerotier:
    container_name: zerotier
    x-image: "zyclonite/zerotier:bridge"
    build:
      context: "./.templates/zerotier/."
      dockerfile: Dockerfile.bridge
    restart: unless-stopped
    environment:
      - TZ=Australia/Sydney
    network_mode: host
    x-ports:
      - "9993:9993"
    volumes:
      - ./volumes/zerotier:/var/lib/zerotier-one
    user: "0"
    devices:
      - "/dev/net/tun:/dev/net/tun"
    cap_add:
      - NET_ADMIN
      - SYS_ADMIN

If you haven't seen the x- prefix before, it just causes the entire clause to be ignored. I could comment-out all the individual lines of each clause but I find x- is easier, particularly while I'm experimenting.

With that container running, plus a static route configured in the ZeroTier web UI pointing 192.168.132.0/23 (slightly less-specific than the actual home LAN network 192.168.132.0/24) to the ZeroTier-assigned IP address of the Pi running this container, a remote client sees:
```
$ traceroute -I 192.168.132.60
 1  10.244.210.253 (10.244.210.253)  89.626 ms  14.627 ms  2.994 ms
 2  192.168.132.60 (192.168.132.60)  4.089 ms  3.216 ms  3.289 ms
```
In this case, .60 is another host on my home LAN (the Pi running the container is .102 on that LAN).

In fact, the way I got that traceroute output into this reply was:
```
$ traceroute -I 192.168.132.60 >trace.txt
$ scp trace.txt marble:Desktop/trace.txt
```
A lot of things had to go right for that scp to work. "marble" is a name in the ~/.ssh/config on the remote client. It points to a fully-qualified domain name in my home DNS. So, in addition to basic connectivity, resolving that fully-qualified domain name to its IP address on the home LAN involved (a) telling the ZeroTier web UI about the DNS server on my home LAN, and (b) telling the remote client to "Allow DNS configuration". Also, the lack of trust warnings and password prompts means that scp was happy that nothing fishy was going on with respect to my ssh keys and certificates.

I can VNC from the remote client using the FQDN of machines on the home LAN.

I can remote-mount volumes.

Basically, with the iptables-nft patch in place, anything I try works.

The only nagging doubt I have at the moment is about the Pi running the container having both eth0 and wlan0 active into the same broadcast domain on the home LAN. As far as I can tell, eth0 is always first in the routing table so the NAT-forwarding arrangement will work so long as Ethernet is a viable path. I'm about to start experimenting with failover conditions.

Bottom line: this is a brilliant container!

Hey @Paraphraser! The nice ticket you've created. According to my experience with a fully working setup (rpi4 in the network + zerotier in docker, bridge configured), I assume you're missing one key point, which will solve the issue. The reason, why you can't ping any IP from your LAN is the absence of a static route.

On a remote client, you should forward all the traffic via the tunnel in order to reach LAN clients or add a particular route in the managed routes section on Zerotier portal

For site-2-site setup, you need to add static routes on your routers to pass the traffic from LAN#1 to LAN#2 and backward, so your LAN clients can talk to each other transparently without even knowing, that the traffic goes through a tunnel. For that please add 2 routes router#1

route add <LAN2 network> via <rpi4 LAN#1 eth address>

router#2

route add <LAN1 network> via <rpi4 LAN#2 eth address>

assuming, that you also have Raspberry Pi 4 within LAN 2 network

Hi @red-avtovo - no, that isn't it. The static route is already there. I can ping the RPi on

the IP address assigned by ZeroTier
the IP address of the RPi's Ethernet interface (and WiFi interface) - I couldn't do either of those if the static route wasn't configured.

But I could not ping beyond the RPi to other devices on the home LAN (the same network range as the static route).

tcpdump showed the ping requests were getting onto the Ethernet. The ping replies were not coming back because the iptables NAT rule wasn't working.

Doing nothing except changing iptables to iptables-nft fixed the problem. NAT started working and ping replies started coming back. Not just pings of course but everything else like ssh.

So, the problem seems to be with iptables being broken - at least on my RPi which is running Bullseye. iptables is deprecated in favour of things like iptables-nft. What I don't understand is whether it is safe to upgrade to iptables-nft - will that break on non Bullseye RPi systems?

Hi @red-avtovo

I have also realised that the basic mechanism of:

Set up iptables rules
exec "$@"

means that if the daemon exits for any reason, the rules aren't removed.

Common situations where this happens are:

Restart the container.
Terminate and re-launch the container.
Container running as part of a compose stack, machine reboots, container resumes but daemon can exit (sometimes several times) if networking is not ready.

Basically, you can easily wind up with a whole pile of duplicate rules.

I've been working on a Pull Request which I'm about to submit and I'd really appreciate it if you could take a look to see whether I've done anything dumb.

The setup works for me on a Raspberry Pi but that's all I've got to test with so I have no idea whether I've fallen into any traps that might've been avoided by someone with more experience.

There will be a full writeup accompanying the PR but. to summarise:

Rename "bridge" to "router" throughout. I've done my best to follow through into the workflows too.
Add tzdata to the container so TZ works and log entries can have local time.
Support a bunch of environment variables including:
- the ability to specify interfaces (defaults to eth0)
- the ability to indicate whether iptables-nft should be used (defaults to iptables)
- the ability to pass one or more default ZeroTier network IDs so any "first launch" condition will auto-join (new hosts still need to be approved in ZeroTier Central, of course). The default is "do nothing" (same as now).
Alter the router entry-point script so it starts the ZeroTier daemon as a detached process. Similar to what happens with an exec $@ but means the entry-point script can suspend on the daemon and then clean up when either the daemon exits or the script is terminated (eg container told to stop by Docker). I've tested this combo in as many ways as I can think of and it seems very reliable.

I added some documentation.

Issues addressed by #12.

zyclonite / zerotier-docker

substituting iptables-nft for iptables seems to fix broken NAT (at least on Raspberry Pi) #10