xmidt-org / talaria

The Xmidt routing agent.
Apache License 2.0
11 stars 21 forks source link

WebSocket connection closed when increasing pingPeriod #245

Closed tmo-rshea3 closed 2 years ago

tmo-rshea3 commented 2 years ago

The WebSocket connection is closed after idlePeriod when increasing idlePeriod and pingPeriod. I'm not sure if this is an issue with Talaria or Parodus, but I'm starting here.

I started with the XMiDT docker-compose example:

https://github.com/xmidt-org/xmidt/tree/master/deploy/docker-compose

Then, I simplified the cluster down to one instance of Consul and Talaria and removed the simulator. I modified talaria.yaml to increase idlePeriod from 2m to 3m and pingPeriod from 1m to 2m:

device:
  manager:
    idlePeriod: 3m
    pingPeriod: 2m

I added a Parodus instance built from v1.1.5. It is being run with:

parodus \
    --hw-model=TESTABC \
    --hw-serial-number=TEST1234 \
    --hw-manufacturer=TESTCO \
    --hw-mac="11aa22bb33cc" \
    --hw-last-reboot-reason=unknown \
    --fw-name=TESTABC_1.2.3 \
    --boot-time="$(date +"%s")" \
    --partner-id="PARTNER" \
    --webpa-interface-used=eth0 \
    --webpa-url="http://petasos:6400" \
    --webpa-ping-timeout=1800 \
    --webpa-backoff-max=9 \
    --parodus-local-url=tcp://127.0.0.1:6666

I started the cluster with docker-compose up -d and ran the following test loop looking for failures:

#!/bin/bash

start_time=$(date +"%s")
echo "$(date): Started"

while :; do
    response=$(curl -s -H "Authorization: Basic dXNlcjpwYXNz" -H "X-Xmidt-Partner-ID: PARTNER" http://localhost:6100/api/v2/device/mac:11aa22bb33cc/config?names=Foo)
    status_code=$(echo "$response" | jq .statusCode)

    now=$(date +"%s")
    dur=$((now - start_time))

    if [ "$status_code" != "531" ]; then
        echo "$(date): Failed (Duration: $dur)"
        echo "$response"
        sleep 1
        start_time=$(date +"%s")
        continue
    fi

    echo "$(date): Got response (Duration: $dur)"
    sleep 10
done

The Parodus "device" never actually goes idle for longer than 10 seconds in this test loop, but the WebSocket connection is still closed and reconnected after idlePeriod = 3m.

This gist contains the docker-compose.yml, talaria.yaml, and test.sh script for the cluster and the Dockerfile and docker-entrypoint.sh for the Parodus container:

https://gist.github.com/tmo-rshea3/0db63adfd1bca1cd45a59d1e00e5d3f2

Log output attached: talaria.log parodus.log test.log

schmidtw commented 2 years ago

My short answer is:

Both Parodus expects Pings from Talaria & Talaria expects Pongs from Parodus, regardless of other traffic at the moment. The 180s disconnect is suspiciously around where they will generally give up on each other. Normally the pings/pongs happen every 45s & up to 2 missed ones are allowed (it might be 1). That gets close to what you're seeing. I don't see any pings or pongs in the logs for clues.

I'll try to dig a bit deeper into this later this week.

tmo-rshea3 commented 2 years ago

That was my initial thought too, but I dug into the Parodus dependencies, and it looks like the noPoll WebSocket Toolkit automatically sends pongs after receiving a ping frame. Also, I wouldn't think the ping/pong would matter when making actual requests like this.

You can see one ping received in parodus.log:

parodus  | [1660607047][PARODUS][Info]: Ping received with payload mac:11aa22bb33cc, opcode 9

I was also able to see the count of pings and pongs incrementing correctly on the Talaria /metrics API:

xmidt_talaria_ping_count 1
xmidt_talaria_pong_count 1

I even modified the Parodus source code to explicitly send a pong response in src/nopoll_handlers.c:listenerOnMessage_queue(), and I saw the pong count incrementing by two instead of one.

We are evaluating using WebPA across a cellular network, so ideally we would have a much longer ping period to allow the modem to idle more often. I have played around with a lot of different values, and these are the shortest periods that I have found to break. I've been searching around for another config parameter or hardcoded constant that's causing this, but no luck so far.

tmo-rshea3 commented 2 years ago

Ok I think I figured it out. There is another WriteTimeout parameter in webpa-common/device that corresponds to the write deadline on the WebSocket connection in Talaria. This parameter is not specified in talaria.yaml, but it is exposed through the Viper config. This value must be longer than pingPeriod, or it will close the connection on the next write.

The default values are here: https://github.com/xmidt-org/webpa-common/blob/main/device/options.go

DefaultIdlePeriod     time.Duration = 135 * time.Second
DefaultWriteTimeout   time.Duration = 60 * time.Second
DefaultPingPeriod     time.Duration = 45 * time.Second

So ultimately, pingPeriod < writeTimeout and pingPeriod < idleTimeout to maintain a persistent WebSocket connection over time.

When I set pingPeriod: 2m, the default write timeout was too short, so the connection was being closed the next time it was written to. I'm not sure why the pings still appeared to be getting sent to the device successfully though.

If this makes sense to you, then I'll submit a PR that adds this parameter to the sample talaria.yaml with a comment explaining the constraints.