weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Weave DNS is broken after updating Ubuntu 20.04 #3868

Open emfrias opened 3 years ago

emfrias commented 3 years ago

What you expected to happen?

Weave shouldn't crash when a container tries to resolve a hostname.

What happened?

Containers connected by weave are unable to communicate after applying Ubuntu OS updates. When a container using weave tries to access the network (proably only DNS), the weave container crashes.

How to reproduce it?

And I'm kicked out of my container. This worked fine before the apt-get upgrade, and it still works ok if weave isn't involved (if I omit eval $(weave env)).

Anything else we need to know?

The problem seems to be tied to an update to the Ubuntu systemd package 245.4-4ubuntu3.3 that was published on 2020-11-04. I've experienced it on Ubuntu 20.04 and 20.10. This version of systemd generates a line in /etc/resolv.conf which reads options edns0 trust-ad. The previous version (245.4-4ubuntu3) only generated the line options edns0, without the trust-ad. The new option triggers a bug in miekg/dns that was fixed a few years back: https://github.com/miekg/dns/commit/906238edc6eb0ddface4a1923f6d41ef2a5ca59b

I've tried removing trust-ad from resolv.conf and it does fix the crash on a simple test vm. On my "real" vms where I was using weave, containers were still unable to ping each other after getting rid of the crash, but that may be an unrelated problem.

Versions:

$ weave version
weave script 2.7.0
weave 2.7.0
$ docker version
Client:
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.13.8
 Git commit:        afacb8b7f0
 Built:             Wed Oct 14 19:43:43 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       afacb8b7f0
  Built:            Wed Oct 14 16:41:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.3-0ubuntu2
  GitCommit:        
 runc:
  Version:          spec: 1.0.1-dev
  GitCommit:        
 docker-init:
  Version:          0.18.0
  GitCommit:        
$ uname -a
Linux ubuntu-weave-test 5.4.0-53-generic #59-Ubuntu SMP Wed Oct 21 09:38:44 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Logs:

$ docker logs weave
[......]
panic: runtime error: slice bounds out of range [:9] with length 8

goroutine 604 [running]:
github.com/miekg/dns.ClientConfigFromReader(0x1fae140, 0xc0004bc228, 0x0, 0x0, 0xc0004bc228)
    /go/src/github.com/weaveworks/weave/vendor/github.com/miekg/dns/clientconfig.go:94 +0x823
github.com/miekg/dns.ClientConfigFromFile(0x7ffd25fa7efe, 0x23, 0x0, 0x0, 0x0)
    /go/src/github.com/weaveworks/weave/vendor/github.com/miekg/dns/clientconfig.go:29 +0xc8
github.com/weaveworks/weave/nameserver.(*upstream).Config(0xc000556780, 0x0, 0x0, 0x0)
    /go/src/github.com/weaveworks/weave/nameserver/dns.go:66 +0x197
github.com/weaveworks/weave/nameserver.(*handler).handleRecursive(0xc0005c3100, 0x20041e0, 0xc000790680, 0xc0004c5710)
    /go/src/github.com/weaveworks/weave/nameserver/dns.go:268 +0xf5
github.com/miekg/dns.HandlerFunc.ServeDNS(0xc000118e00, 0x20041e0, 0xc000790680, 0xc0004c5710)
    /go/src/github.com/weaveworks/weave/vendor/github.com/miekg/dns/server.go:84 +0x44
github.com/miekg/dns.(*ServeMux).ServeDNS(0xc000118d90, 0x20041e0, 0xc000790680, 0xc0004c5710)
    /go/src/github.com/weaveworks/weave/vendor/github.com/miekg/dns/server.go:210 +0x62
github.com/miekg/dns.(*Server).serve(0xc0004340c0, 0x1fc3d60, 0xc00210a090, 0x1fad080, 0xc000118d90, 0xc0024e6e00, 0x1c, 0x200, 0xc0000108d0, 0xc00047bba0, ...)
    /go/src/github.com/weaveworks/weave/vendor/github.com/miekg/dns/server.go:567 +0x271
created by github.com/miekg/dns.(*Server).serveUDP
    /go/src/github.com/weaveworks/weave/vendor/github.com/miekg/dns/server.go:523 +0x2a0
Cybernisk commented 3 years ago

@emfrias I've faced with the same bug today. Thank you for the report, it gave me a short way to fix the problem. The initial issue with weave is that it is still using the old version of miekg/dns - 1.0.4 in vendors. I updated to the new version v1.0.5 of miekg/dns (you can edit it in go.mod file) that already has that bug fixed and build new weave images. And then use the newly built images on hosts. Hope that will help you.

emfrias commented 3 years ago

Thanks @Cybernisk, that helped. I just built new images as you described and they've been working well so far. I'm new to this build system so I did it wrong a few times before getting a version that actually had the new version of the dns module. I wound up with:

git clone https://github.com/weaveworks/weave.git 
cd weave
go get github.com/miekg/dns@v1.0.5
go mod vendor
make
almereyda commented 3 years ago

Is it likely that this simple dependency update will find its way into the next release of the binary?

Unfortunately the described workaround does not work here as expected:

Step 9/15 : RUN go get     github.com/weaveworks/build-tools/cover     github.com/mattn/goveralls      golang.org/x/lint/golint        github.com/fzipp/gocyclo        github.com/fatih/hclfmt        github.com/client9/misspell/cmd/misspell
 ---> Running in 8700989776c1
cannot find package "github.com/hashicorp/hcl/hcl/printer" in any of:
        /usr/local/go/src/github.com/hashicorp/hcl/hcl/printer (from $GOROOT)
        /go/src/github.com/hashicorp/hcl/hcl/printer (from $GOPATH)
The command '/bin/sh -c go get     github.com/weaveworks/build-tools/cover     github.com/mattn/goveralls      golang.org/x/lint/golint        github.com/fzipp/gocyclo      github.com/fatih/hclfmt  github.com/client9/misspell/cmd/misspell' returned a non-zero code: 1
make: *** [Makefile:255: .build.uptodate] Error 1

Building succeeded with the following patch:

diff --git a/build/Dockerfile b/build/Dockerfile
index ae6a677..e47913e 100644
--- a/build/Dockerfile
+++ b/build/Dockerfile
@@ -49,6 +49,9 @@ RUN curl -fsSLo shfmt https://github.com/mvdan/sh/releases/download/v1.3.0/shfmt
        mv shfmt /usr/bin

 # Install common Go tools
+RUN GO111MODULE=on go get github.com/hashicorp/hcl@v1.0.0; \
+       mkdir -p /go/src/github.com/hashicorp; \
+       ln -s $PWD/pkg/mod/github.com/hashicorp/hcl@v1.0.0 $PWD/src/github.com/hashicorp/hcl
 RUN go get \
     github.com/weaveworks/build-tools/cover \
     github.com/mattn/goveralls \

Then the Ubuntu 20.04 golang version is not up-to-date, and another issue with modules not being in sync will be displayed. My workaround to be able to run make completely was to update go to latest:

apt remove golang --purge --autoremove
curl -LO https://get.golang.org/$(uname)/go_installer && chmod +x go_installer && ./go_installer && rm go_installer

Unfortunately this also didn't produce a weave executable that would not crash upon weave status.

It's a pity to see this break on a very common platform.

emfrias commented 3 years ago

You're right. I can't explain why, but my steps no longer work, but the changes @almereyda mentions get it building again for me.

From a clean ubuntu:20.04 machine:

sudo apt -y install build-essential git docker.io
curl -LO https://get.golang.org/$(uname)/go_installer && chmod +x go_installer && ./go_installer && rm go_installer
. ~/.bash_profile
git clone https://github.com/weaveworks/weave.git 
cd weave

# patch build/Dockerfile using almereyda's patch above

go get github.com/miekg/dns@v1.0.5
go mod vendor
make

I didn't mention these steps earlier because I figured they'd be a bit different depending on your setup. We've just built images for weaveworks/weave:latest and its helpers on the local system. I run weave using the script you get from sudo curl -L git.io/weave -o /usr/local/bin/weave. That script will try to run weaveworks/weave:2.8.1 by default, and since we didn't build that, it will download it from docker hub and ignore the custom version we built. The simplest change is to edit the weave script:

--- weave.orig  2021-02-11 17:40:59.835349520 +0000
+++ /usr/local/bin/weave    2021-02-11 17:43:42.022209305 +0000
@@ -3,7 +3,7 @@

 [ -n "$WEAVE_DEBUG" ] && set -x

-SCRIPT_VERSION="2.8.1"
+SCRIPT_VERSION="unreleased"
 IMAGE_VERSION=latest
 [ "$SCRIPT_VERSION" = "unreleased" ] || IMAGE_VERSION=$SCRIPT_VERSION
 IMAGE_VERSION=${WEAVE_VERSION:-$IMAGE_VERSION}

and it will stick to using the latest tag we built. This should give you a version that works on this one machine.

I went a step further and pushed the new images to my private docker registry

MY_DOCKER_REGISTRY=docker-registry.me.com
for image in weave weaveexec weave-kube weave-npc weavedb network-tester; do 
    sudo docker tag weaveworks/$image:latest $MY_DOCKER_REGISTRY/weaveworks/$image:latest
    sudo docker push $MY_DOCKER_REGISTRY/weaveworks/$image:latest
done

and then make one more edit to /usr/local/bin/weave:

--- weave.new   2021-02-11 18:08:12.349300226 +0000
+++ /usr/local/bin/weave    2021-02-11 18:09:46.034718622 +0000
@@ -12,7 +12,7 @@
 MIN_DOCKER_VERSION=1.10.0

 # These are needed for remote execs, hence we introduce them here
-DOCKERHUB_USER=${DOCKERHUB_USER:-weaveworks}
+DOCKERHUB_USER=${DOCKERHUB_USER:-docker-registry.me.com/weaveworks}
 BASE_EXEC_IMAGE=$DOCKERHUB_USER/weaveexec
 EXEC_IMAGE=$BASE_EXEC_IMAGE:$IMAGE_VERSION
 WEAVEDB_IMAGE=$DOCKERHUB_USER/weavedb:latest

Now I can distribute this patched version of /usr/local/bin/weave to all my servers and they'll get the patched version of weave. If you don't have a private registry set up, you could manually load your patched binaries on each of your other systems (I guess using something like docker load < weave.tar.gz), and also copy over the patched /usr/local/bin/weave.

It looks like you could just set environment variables rather than patching the weave binary if that's easier for you.

lubars commented 3 years ago

This product is becoming increasingly difficult to justify when it doesn't run on Ubuntu 20.04. This issue has been open for six months - does Weaveworks actively monitor this forum?

monadic commented 3 years ago

Hi, I'm the Weaveworks CEO. We do keep an eye on these forums. At present we work on Weave Net for paying customers or as part of other commercial work.

Santinell commented 3 years ago

Any progress on this issue? It's really disappointing situation. Almost a year passed since issue was opened.

withinboredom commented 3 years ago

At present we work on Weave Net for paying customers or as part of other commercial work.

Why would someone pay for something that is broken?

tomcs7be commented 2 years ago

Looks like, the actual bug is in miekg/dns vendor code, not too experienced in Go, but looks like it checks a string length of 8 characters, and then tries to cut it to 9.

Ubuntu default resolv.conf now includes the string "trust-ad", which is 8 characters, and (i guess) line 94 breaks on this: https://github.com/weaveworks/weave/blob/master/vendor/github.com/miekg/dns/clientconfig.go

The original vendor code seems to be fixed, i think to solve this problem, it would be enough to upgrade: https://github.com/miekg/dns/blob/master/clientconfig.go

Checked my hosts, weave works on my hosts not having any 8-char long entry in the options in the resolv.conf, but breaks on hosts which do. Weave can be launched with the --no-dns option, and i can get a working "weave status", but that way it wouldn't really be usable.

Any idea for a workaround without messing up the automatic resolv.conf?

tomcs7be commented 2 years ago

Figured a workaround which just needs editing the script, by removing the options from resolv.conf weave already ignores, and mounting that file.

+++ b/weave
@@ -136,6 +136,7 @@ exec_remote() {
         $(docker_run_options) \
         --pid host \
         $(exec_options "$@") \
+        -v /usr/local/bin/weave:/home/weave/weave \
         -e DOCKERHUB_USER="$DOCKERHUB_USER" \
         -e WEAVE_VERSION \
         -e WEAVE_DEBUG \
@@ -1167,14 +1168,7 @@ launch() {

     # Figure out the location of the actual resolv.conf file because
     # we want to bind mount its directory into the container.
-    if [ -L ${HOST_ROOT:-/}/etc/resolv.conf ]; then # symlink
-        # This assumes a host with readlink in FHS directories...
-        # Ideally, this would resolve the symlink manually, without
-        # using host commands.
-        RESOLV_CONF=$(chroot ${HOST_ROOT:-/} readlink -f /etc/resolv.conf)
-    else
-        RESOLV_CONF=/etc/resolv.conf
-    fi
+    RESOLV_CONF=/etc/resolv.weave.conf
     RESOLV_CONF_DIR=$(dirname "$RESOLV_CONF")
     RESOLV_CONF_BASE=$(basename "$RESOLV_CONF")

It uses the file resolv.weave.conf, in my case i just edited the original resolv.conf with sed to remove the trust-ad option, generated at boot time with systemd.

karser commented 2 years ago

Encountered this issue on ubuntu 22.04 because of options edns0 trust-ad in my /etc/resolv.conf. What's weird is that on ubuntu 20.04 it was options edns0 and weave was working well.

jashwanthsj commented 1 year ago

We're facing same issue with weave on Ubuntu Server 20.04. Any workaround without modifying resolv.conf?

withinboredom commented 1 year ago

Switch to Calico? You get all the same features and more.

jashwanthsj commented 1 year ago

@withinboredom We need this for a nomad cluster. We're currently using weave for our job. We didn't find any supportive docs related to Calico with nomad cluster.

tyteen4a03 commented 1 year ago

Hi, are there any updates to this?