Closed bergtwvd closed 8 years ago
Do you get the same problem when fast datapath is disabled (set WEAVE_NO_FASTDP=1
when launching weave)?
I have seen this problem before, in earlier versions of Weave (without the latest datapath addition). In order to get grip on this issue, what other information should I provide, or what data can I collect and how?
When the problem occurs, can you access the problematic containers and manually check (e.g. with ping
) connectivity?
I ran the test until the issue popped up again:
[Shipsim] Connection successful. hla.rti1516e.exceptions.RTIinternalError: Joining federate 'Shipsim1' cannot rea ch existing federate 'Shipsim<6>' at tcp:10.40.0.1;172.17.0.6/6000,udp:10.40.0.1 /5000,host:10.40.0.1,address:10.40.0.1;172.17.0.6/6000 (909056008)
From another shell I looked up the IP addresses of the other containers in the simulation: ubuntu@docker-1B:~$ weave dns-lookup crc 10.32.0.1 ubuntu@docker-1B:~$ weave dns-lookup web 10.44.0.0 ubuntu@docker-1B:~$ weave dns-lookup ge 10.40.0.0 ubuntu@docker-1B:~$ weave dns-lookup ship1 10.40.0.1 ubuntu@docker-1B:~$ weave dns-lookup ubuntu_trafficsim_1 10.43.255.255 ubuntu@docker-1B:~$ weave dns-lookup ubuntu_shipsim_1 10.32.0.2 ubuntu@docker-1B:~$ weave dns-lookup ship2 10.44.0.1
And for each one I did: ubuntu@docker-1B:~$ docker exec b0cf69b63ca7 ping < address >
And all pings succeeded! I suspect that the network got settled by the time I did the ping.
I wonder wether the Weave network needs some time to settle the addition of new containers and IP addresses? I do not like to build in an arbitary sleep in each containter start script. Please advice what I should do next.
Thanks @bergtwvd. I have some questions about these logs:
[ShipSim3] Connection successful.
hla.rti1516e.exceptions.RTIinternalError: Joining federate 'ShipSim31' cannot reach existing federate 'ShipSim<4>' at tcp:10.40.0.1;172.17.0.6/6000,udp:10.40.0.1/5000,host:10.40.0.1,address:10.40.0.1;172.17.0.6/6000 (909056008)
... followed by stacktrace
ShipSim3
successfully connecting to the CRC (and hence being told the addresses of the other federates) and the exception being reported by ShipSim3
that it cannot connect to themIt's also interesting that federates appear to be registering both their weave (e.g. 10.40.0.1) and docker (e.g. 172.17.0.6) addresses with the CRC - presuming that you're not running the Docker overlay network, those 172.17.0.0/16 addresses will not be reachable cross host - in fact unless you've manually specified different subnets for Docker in each host then ShipSim3
might be trying to talk to 172.17.0.6 on its own host, where it may either be unallocated or allocated to some random container (including itself potentially!). Is the software open source by any chance? I would be interested to see how it manages connecting to federates when it is given more than one IP address to try - it's possible that it is attempting to use the docker IP sometimes and failing.
W.r.t. timestamp and stacktrace, I forwarded the request to the developer (located elsewhere in the world). This may take some time to get.
I will check if I can restrict the network interface that each federate uses to the Weave interface. Each federate has embedded a "Local RTI Component", that can beconfigured to some extent.
The software is currently not open source, although the RTI interface specification is an IEEE standard (1516-2010).
Logging has been added to the federates in question. I tried to re-produce the problem with logging enabled. No luck so far (quite inherent to race-conditions). Also I noticed that the WAN is much more responsive at the moment. So I probably need to wait until there is more traffic and things slow down. (test involves two VMs in IRL and one in GER).
W.r.t. the network adapter, the federate has an eth0 interface and an ethwe interface. I am not sure whether forcing the federate to use a specifc interface matters. The sequence of steps it roughly:
In previous runs the step with the connect succeeds, but the step with the join fails.
I managed to reproduce it . Here the logfile from the federate in question:
ubuntu@docker-2A:~$ docker exec 8757cd9aed6c more /root/prti1516e/trace/LRC-151119_18.35.29-0.log
::::::::::::::
/root/prti1516e/trace/LRC-151119_18.35.29-0.log
::::::::::::::
T: 18:35:29.276 fed0 << connect(HLA_IMMEDIATE, "crcAddress=crc:8989")
T: 18:35:29.484 fed0 << connect
T: 18:35:29.485 fed0 << createFederationExecution("TheWorld", file:/tmp/shipsim/RPRFOM.xml)
T: 18:35:29.489 fed0 << createFederationExecution => hla.rti1516e.exceptions.FederationExecutionAlreadyExists: Federation TheWorld already exists (909000000)
T: 18:35:29.489 fed0 << joinFederationExecution("ShipSim", "TheWorld")
T: 18:35:29.606 fed6 << joinFederationExecution => hla.rti1516e.exceptions.RTIinternalError: Joining federate 'ShipSim2' cannot reach existing federate 'ShipSim1<5>' at tcp:10.44.0.1;172.17.0.6/6000,udp:10.44.0.1/5000,host:10.44.0.1,address:10.44.0.1;172.17.0.6/6000 (909056008)
T: 18:35:29.643 fed6 << getObjectClassHandle("BaseEntity.PhysicalEntity.Platform.SurfaceVessel")
T: 18:35:29.643 fed6 << getObjectClassHandle => hla.rti1516e.exceptions.FederateNotExecutionMember: getObjectClassHandle (909001001)
T: 18:35:29.647 fed6 << registerObjectInstance(null)
T: 18:35:29.647 fed6 << registerObjectInstance => hla.rti1516e.exceptions.FederateNotExecutionMember: registerObjectInstance (909001001)
The log entries after the "cannot reach existing federate" entry are a bug in the code, as the application should not progress when it did not join the federation execution. In the joinFederationExecution the connection to all federates in the federation execution is setup.
The container entrypoint is set to start this java application (java -cp .... code.jar). The elapsed time from entrypoint to the connect is I suspect very small.
After I got the log file, I did a ping from the container to ShipSim1<5>
with:
ubuntu@docker-2A:~$ docker exec 8757cd9aed6c ping 10.44.0.1
PING 10.44.0.1 (10.44.0.1) 56(84) bytes of data.
64 bytes from 10.44.0.1: icmp_seq=1 ttl=64 time=19.4 ms
64 bytes from 10.44.0.1: icmp_seq=2 ttl=64 time=19.3 ms
64 bytes from 10.44.0.1: icmp_seq=3 ttl=64 time=19.4 ms
...
Network interface information:
ubuntu@docker-2A:~$ docker exec 8757cd9aed6c ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:ac:11:00:06
inet addr:172.17.0.6 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:acff:fe11:6/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:12 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:913 (913.0 B) TX bytes:882 (882.0 B)
ethwe Link encap:Ethernet HWaddr fe:fb:5d:8e:17:c0
inet addr:10.32.0.3 Bcast:0.0.0.0 Mask:255.240.0.0
inet6 addr: fe80::fcfb:5dff:fe8e:17c0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1410 Metric:1
RX packets:5751 errors:0 dropped:0 overruns:0 frame:0
TX packets:5753 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:566081 (566.0 KB) TX bytes:584954 (584.9 KB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:554 errors:0 dropped:0 overruns:0 frame:0
TX packets:554 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:27700 (27.7 KB) TX bytes:27700 (27.7 KB)
And routing information:
ubuntu@docker-2A:~$ docker exec 8757cd9aed6c route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default ip-172-17-0-1.e 0.0.0.0 UG 0 0 0 eth0
10.32.0.0 * 255.240.0.0 U 0 0 0 ethwe
172.17.0.0 * 255.255.0.0 U 0 0 0 eth0
224.0.0.0 * 240.0.0.0 U 0 0 0 ethwe
Thanks @bergtwvd. This is interesting - the delay between
T: 18:35:29.489 fed0 << joinFederationExecution("ShipSim", "TheWorld")
(which is presumably after the federate has received addresses from the CRC and before it is about to connect to them) and the failure
T: 18:35:29.606 fed6 << joinFederationExecution => hla.rti1516e.exceptions.RTIinternalError: Joining federate 'ShipSim2' cannot reach existing federate 'ShipSim1<5>' at tcp:10.44.0.1;172.17.0.6/6000,udp:10.44.0.1/5000,host:10.44.0.1,address:10.44.0.1;172.17.0.6/6000 (909056008)
is very short - about 100ms. This suggests to me that something is providing an immediate negative response - otherwise you would expect the delay to be longer due to retry attempts. Can you tell me what is involved in the connection process - is it TCP? Does it retry? Are you setting any socket options or a connection timeout? It would be really helpful to see the underlying cause Throwable
of the RTIInternalError
if possible.
I have consulted with the supplier of the RTI to understand what is happening (network-wise) in the connect and joinFederationExecution methods.
In summary for a two federate federation with LRC1 and LRC2:
connect by any LRC:
joinFederationExecution by LRC1:
joinFederationExecution by LRC2:
The key is that an LRC provides multiple advertised addresses by default, with a potential mix-up in IP connects as you mentioned in an earlier post. I did a test to force the LRC to use only one host interface and so far that seems to work.
I'll close this issue because I think the problem has been identified and solved.
Thanks for your help.
I have the following environment set up in AWS:
Three Docker Hosts, located at: docker-1A@Dublin, Public IP X.X.X.X docker-2A@Dublin, Public IP Y.Y.Y.Y docker-1B@Frankfurt, Public IP Z.Z.Z.Z
OS is Ubuntu 15.04, Docker 1.9, latest versions of Weave, Compose, Swarm.
The initial set of containers per Docker Host is (see attached files on the details how they are started):
The three Weave scope containers are interconnected. The three weave router containers are interconnected. The three consul containers are interconnected.
The user application consists of a set of containers that together make up a distributed simulation. There is a central component called the CRC that registers new-comers. When a containter joins the simulation it first registers itself at the CRC. The CRC will pass back information about the other containers already joined, so that the newly joined container can connect to all the others. On the end the containers all interconnect in a meshed network that they build up, where each container has a connection to each other container.
The problem is that I intermittently see an error that a new container successfully connected to the CRC, but that it cannot connect to another already running container somewhere else in the Weave network.
The errorlog looks like:
[ShipSim3] Connection successful. hla.rti1516e.exceptions.RTIinternalError: Joining federate 'ShipSim31' cannot reach existing federate 'ShipSim<4>' at tcp:10.40.0.1;172.17.0.6/6000,udp:10.40.0.1/5000,host:10.40.0.1,address:10.40.0.1;172.17.0.6/6000 (909056008) ... followed by stacktrace
Looking at Scope, I see that federate 'ShipSim<4>' runs on docker-1A The CRC runs on docker-1B. And the new federate 'ShipSim31' runs on docker-2A.
The container called federate 'ShipSim<4>' is already running on IP 10.40.0.1. It seems that the network routing from the new federate 'ShipSim31' to the existing federate 'ShipSim<4>' has not been set up yet. I.e. the CRC told the new federate about an existing federate, but the network route seems not established.
Since I run on a WAN, it may be more likely that race conditions show up (if any). Please advice if you need more infomation, or even access to the test environment in AWS.
Here a dump of more information:
ubuntu@docker-1A:~$ weave status
PeerDiscovery: enabled Targets: 0 Connections: 2 (2 established) Peers: 3 (with 6 established connections)
DefaultSubnet: 10.32.0.0/12
ubuntu@docker-1A:~$ weave status connections <- Y.Y.Y.Y:39418 established fastdp 42:9d:fe:58:9a:2a(ip-172-31-9-127) <- Z.Z.Z.Z:44967 established fastdp 26:40:3a:3b:c7:16(ip-172-31-27-45)
ubuntu@docker-2A:~$ weave status dns crc 10.44.0.0 d661cf60ba74 26:40:3a:3b:c7:16 ge 10.40.0.2 555ddbb82c45 b6:e5:b8:81:59:27 ubuntu_shipsim2_1 10.32.0.2 a9c8b4d37de8 42:9d:fe:58:9a:2a ubuntu_shipsim3_1 10.40.0.0 942c2826364c b6:e5:b8:81:59:27 ubuntu_shipsim_1 10.40.0.1 9273d45827d7 b6:e5:b8:81:59:27 ubuntu_trafficsim_1 10.43.255.255 05606f5f6d89 26:40:3a:3b:c7:16 web 10.32.0.1 794e4328dd40 42:9d:fe:58:9a:2a
docker-1A.txt docker-1B.txt docker-2A.txt