xively / xively-client-c

The official and highly portable Xively C Client designed for embedded platforms.
Other
18 stars 22 forks source link

Fix on KeepAlive Ping Write Errors #197

Closed DellaBitta closed 6 years ago

DellaBitta commented 6 years ago

[ Description ] Fix for handling socket disconnection after having write errors when attempting to send the keep alive ping request. At disconnection the keepalive timeout guard task if forced to finish its job. Forced not from event scheduler but from the mqtt logic destructor. Although the task expected to be called only from event scheduler. This contract is guarded by the firing assert. Now timed event is cancelled before forcing its execution.

[ JIRA ] n/a

[ Reviewer ] @atigyi

[ QA ] Tested clients with backoff mode disabled. Clients used the same credentials and were disconnecting one another. Set the MQTT keepalive time to the same time as backoff so that there were cases when the broker's duplicate credential disconnection event ran into the MQTT PingReq event (logged to ensure that this sometimes occurred). Ran with the memory limiter enabled to ensure that memory wasn't leaking when these events occurred. Ran a longevity test to see if repeated disconnections, and repeated MQTT PingReq teardowns, caused any instability. Added 3 new keepalive integration test cases covering disconnecting socket after PINGREQ write error.

[ Release Notes ] Bugfix - only in DEBUG mode - when socket disconnection happened before keepalive time interval passed after failing to write PINGREQ on the socket. Fixes for when the MQTT Keep alive PingReq occurred at the same tick as socket was closed.

atigyi commented 6 years ago

The case this change targets is when the library failed to send the MQTT KEEPALIVE message. The previous handling of this situation was to wait the keepalive timeout and then do a libxively internal reconnect. Although this reconnect probably did not happen because broker closed the connection before in .5 * keepalive secs. Right?! What was the problem with this previous "error handling" mechanism? The assert: assert( NULL == task->timeout.ptr_to_position );?

What this version has as a difference is that the on_keepalive_timeout_expiry event is not scheduled. So it won't hit back when connection closes. Why does the connection close here? Because server closes it. Why? Keepalive timeout or same credential reuse?

"This would put the library in a confusing state." I would rather explore what this "confusing state" is and try to fix it than dodge it. Why? Because this confusing state might be reached on a different path too. Knowing this state would help if there are other paths to reach this state.

We should reproduce this issue with integration test case(s) and fix it that way.