xoseperez / espurna

Home automation firmware for ESP8266-based devices
http://tinkerman.cat
GNU General Public License v3.0
2.99k stars 637 forks source link

MQTT sometimes does not notice broker shutdown #1759

Open mcspr opened 5 years ago

mcspr commented 5 years ago

Bug description

When MQTT broker server is physically disconnected from the network (for example, network cable is removed), device still thinks that connection is OK. Neither onDisconnect or onTimeout firing for MQTT module. mqtt.info shows state as connected, mqtt.reset finally resets it

While this seems like a bug in async mqtt client / asynctcp, still recording it here as it may be important "feature" to know. And it may be fixed by manual keep-alive, tracking last successful mqttSend() call

Steps to reproduce

  1. mqttQoS=1, mqttKeep=300, hbInterval 5 reset, wait until mqtt is connected on the device
  2. remove ethernet cable from the server
  3. debug logs still show that mqtt client is sending data, incrementing qos. after some time qos becomes 0 (ref: #1370), meaning that network layer does not accept any more data.

Additional context

Using async-mqtt-client master (or v0.8.1, same thing), Core 2.3.0 & 2.5.2 lwip2 Network is Linux server, Router that is directly wired to the server and ESP connected to the Router's AP. Both on the same LAN network bridged together.

I waited for about 10 min and reconnected the cable, after which MQTT reconnection finally happened.

tcpdump on the router shows that device is still sending data, but it does look like it just tries to retransmit not acked data:

08:18:33.534970 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:18:36.559839 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:18:42.584820 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:18:54.784220 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:19:19.333148 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
08:20:08.005749 IP ESPURNA-35A259.lan.51606 > orion.lan.1883: Flags [P.], seq 1749:1785, ack 181, win 1924, length 36
...
08:21:45.550983 ARP, Request who-has orion.lan tell ESPURNA-35A259.lan, length 28

Small patch to the dev tree enables lwip debugging and shows that pcb is still ESTABLISHED https://github.com/xoseperez/espurna/compare/dev...mcspr:lwip/pcb-debug

[050328] [MQTT] Sending home/ESPURNA-35A259/status => 1 (PID 0)
[050333] [MQTT] Sending home/ESPURNA-35A259/loadavg => 0 (PID 0)
bs
lwip.pcbs
[050807] Active PCB states:
[050807] state=ESTABLISHED local=ESPURNA-35A259:63823 remote=orion.lan:1883 snd_nxt=11326 rcv_nxt=2086424742
[050808] TIME-WAIT PCB states:
[050811] BOUND PCB states:

One would expect that keep-alive would sort this out, but it does not. (and as quick google search answers suggest, regarding lwip tcp /app keepalive)

Device information

mcspr commented 5 years ago

I believe this is related to the marvinroger/async-mqtt-client#105 onPoll tracks last ping: https://github.com/marvinroger/async-mqtt-client/blob/8bd918bfeb2f197afa8b0cb5d278ab2942c62ccb/src/AsyncMqttClient.cpp#L447 But sendPing function will never update last ping time if it can't send ping (as established above): https://github.com/marvinroger/async-mqtt-client/blob/8bd918bfeb2f197afa8b0cb5d278ab2942c62ccb/src/AsyncMqttClient.cpp#L614