Open JensVD opened 2 years ago
Hello
Has anybody been able to look at this issue. This week it has happened over 3 times in each environment causing us to have to uninstall and reinstall the plugin after which it works again. It's very weird that traffic is blocked between containers when using this plugin.
All help is greatly appreciated!
Thank you!
Hello
Has anybody been able to look at this issue. This week it has happened over 3 times in each environment causing us to have to uninstall and reinstall the plugin after which it works again. It's very weird that traffic is blocked between containers when using this plugin.
All help is greatly appreciated!
Thank you!
Hi @JensVD,
First off, I am not part of the Weaveworks team, ok? Just a fellow dev trying to help out.
So...I've experienced a similar issue recently. At first, services and containers deployed fine, and could communicate, and after a while they lost communications to each other. Turns out the problem happened when we scaled or updated a service, or one of its containers crashed and were brought back up by Swarm. They would come back up with an out-of-sync IP address. E.g.: when looking it up at "docker network inspect", it states container A resolves to 10.32.0.5, but at Swarms DNS it was resolving to something else, usually one number higher, i.e. 10.32.0.6.
I figured the problem was being caused by the use of the --attachable flag when creating the network with the weaveworks/net-plugin driver. Just don't use --attachable and it should be fine. I don't know exactly why that happens, perhaps it's a bug, or simply it is so by design, I don't know. In most cases, there's not good reason for using that flag anyway, because when integrating the network into Swarm, it will always be able to schedule containers to the network, regardless if it is attachable or not. Unless you're looking forward to attaching stand-alone containers which are external to your Swarm cluster, but then, you'll probably be better off with some more specialized tools for that task, like Consul, which I believe you can integrate into your weave network.
In addition to that, depending on the version of your Docker Engine, you may or may not have to use a template network. When I started my project, I was on Docker Engine 19. Back then, in order for Weavenet to work properly, I had to declare a template network on each node, like this:
docker network create --config-only --subnet 10.32.0.0/12 --driver weaveworks/net-plugin:2.8.1 --gateway 10.32.0.1 myTemplateNetwork
And only then create the actual network on the master node:
docker network create --config-from myTemplateNetwork --scope swarm myNetwork
If I didn't to that, I would face some communication problems with my services, too.
Ever since I upgraded to Docker Engine 20+, not only I don't have to use the template, but also it won't even work even if I try to use it. But I can simply skip that part and define everything in the actual network without a template, and it works like a charm. In other words, I just need a single command now:
docker network create --subnet 10.32.0.0/12 --gateway 10.32.0.1 --scope swarm --driver weaveworks/net-plugin:2.8.1 myNetwork
Hope I've been able to help.
PS: It is also a good thing to check the output of "weave report". That's the best way to find out what might be going wrong with your weave service. Also, be sure to allow the TCP port 6783 on your firewall.
Cheers!
What you expected to happen?
We expected that when installing and configuring the 'weaveworks/net-plugin' docker plugin it should work. It always works at the beginning but we expect the plugin to keep working.
What happened?
We installed the 'weaveworks/net-plugin' docker plugin on our Docker swarm cluster and initially this all works but after a while it starts breaking. As of now we have detected two scenario's in which all traffic between the two interfaces on multiple hosts fails when using the weaveworks/net-plugin as the docker overlay network:
It appears that several IPTables rules are added or deleted causing the traffic to break.
How to reproduce it?
The first scenario is fairly easy to reproduce:
For the second one on the other hand we are not sure how to reproduce it as it just happens at random intervals. We have noticed that it happens a lot more often in our test environment on which the services are update more often. A way to reproduce it might be:
Anything else we need to know?
The infrastructure is as follows;
Versions:
Logs:
Network:
If you require anymore information, just give us a sign and we'll see what we can do
Thank you!