Closed tmo-rshea3 closed 2 years ago
My short answer is:
Both Parodus expects Pings from Talaria & Talaria expects Pongs from Parodus, regardless of other traffic at the moment. The 180s disconnect is suspiciously around where they will generally give up on each other. Normally the pings/pongs happen every 45s & up to 2 missed ones are allowed (it might be 1). That gets close to what you're seeing. I don't see any pings or pongs in the logs for clues.
I'll try to dig a bit deeper into this later this week.
That was my initial thought too, but I dug into the Parodus dependencies, and it looks like the noPoll WebSocket Toolkit automatically sends pongs after receiving a ping frame. Also, I wouldn't think the ping/pong would matter when making actual requests like this.
You can see one ping received in parodus.log
:
parodus | [1660607047][PARODUS][Info]: Ping received with payload mac:11aa22bb33cc, opcode 9
I was also able to see the count of pings and pongs incrementing correctly on the Talaria /metrics
API:
xmidt_talaria_ping_count 1
xmidt_talaria_pong_count 1
I even modified the Parodus source code to explicitly send a pong response in src/nopoll_handlers.c:listenerOnMessage_queue()
, and I saw the pong count incrementing by two instead of one.
We are evaluating using WebPA across a cellular network, so ideally we would have a much longer ping period to allow the modem to idle more often. I have played around with a lot of different values, and these are the shortest periods that I have found to break. I've been searching around for another config parameter or hardcoded constant that's causing this, but no luck so far.
Ok I think I figured it out. There is another WriteTimeout
parameter in webpa-common/device
that corresponds to the write deadline on the WebSocket connection in Talaria. This parameter is not specified in talaria.yaml
, but it is exposed through the Viper config. This value must be longer than pingPeriod
, or it will close the connection on the next write.
The default values are here: https://github.com/xmidt-org/webpa-common/blob/main/device/options.go
DefaultIdlePeriod time.Duration = 135 * time.Second
DefaultWriteTimeout time.Duration = 60 * time.Second
DefaultPingPeriod time.Duration = 45 * time.Second
So ultimately, pingPeriod
< writeTimeout
and pingPeriod
< idleTimeout
to maintain a persistent WebSocket connection over time.
When I set pingPeriod: 2m
, the default write timeout was too short, so the connection was being closed the next time it was written to. I'm not sure why the pings still appeared to be getting sent to the device successfully though.
If this makes sense to you, then I'll submit a PR that adds this parameter to the sample talaria.yaml
with a comment explaining the constraints.
The WebSocket connection is closed after
idlePeriod
when increasingidlePeriod
andpingPeriod
. I'm not sure if this is an issue with Talaria or Parodus, but I'm starting here.I started with the XMiDT docker-compose example:
https://github.com/xmidt-org/xmidt/tree/master/deploy/docker-compose
Then, I simplified the cluster down to one instance of Consul and Talaria and removed the simulator. I modified
talaria.yaml
to increaseidlePeriod
from 2m to 3m andpingPeriod
from 1m to 2m:I added a Parodus instance built from v1.1.5. It is being run with:
I started the cluster with
docker-compose up -d
and ran the following test loop looking for failures:The Parodus "device" never actually goes idle for longer than 10 seconds in this test loop, but the WebSocket connection is still closed and reconnected after idlePeriod = 3m.
This gist contains the docker-compose.yml, talaria.yaml, and test.sh script for the cluster and the Dockerfile and docker-entrypoint.sh for the Parodus container:
https://gist.github.com/tmo-rshea3/0db63adfd1bca1cd45a59d1e00e5d3f2
Log output attached: talaria.log parodus.log test.log