zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.88k stars 6.63k forks source link

net: coap_client: freaks out when socket poll errors out #80282

Closed tautologyclub closed 2 weeks ago

tautologyclub commented 3 weeks ago

Running 3.5.99-ncs1 (Nordic fork). We're getting ZSOCK_POLLERR according to logs. Snippet from handle_poll():

            for (int i = 0; i < nfds; i++) {
                if (fds[i].revents & ZSOCK_POLLERR) {
                    LOG_ERR("Error in poll for socket %d", fds[i].fd);
                }
                if (fds[i].revents & ZSOCK_POLLHUP) {
                    LOG_ERR("Error in poll: POLLHUP for socket %d", fds[i].fd);
                }
                if (fds[i].revents & ZSOCK_POLLNVAL) {
                    LOG_ERR("Error in poll: POLLNVAL - fd %d not open",
                        fds[i].fd);
                }
                if (fds[i].revents & ZSOCK_POLLIN) {
                    clients[i]->response_ready = true;
                }
            }

            return 0;

so it simply returns 0. Then in coap_client_recv():

    while (true) {
        atomic_set(&coap_client_recv_active, 1);
        ret = handle_poll();
        if (ret < 0) {
            /* Error in polling */
            LOG_ERR("Error in poll");
            goto idle;
        }

        for (int i = 0; i < num_clients; i++) {
            if (clients[i]->response_ready) {
                struct coap_packet response;

                ret = recv_response(clients[i], &response);
                if (ret < 0) {
                    LOG_ERR("Error receiving response");
                    clients[i]->response_ready = false;
                    continue;
                }

                ret = handle_response(clients[i], &response);
                if (ret < 0) {
                    LOG_ERR("Error handling response");
                }

                clients[i]->response_ready = false;
            }
        }

        /* There are more messages coming */
        if (has_ongoing_requests()) {
            continue;

So handle_poll() returns 0, but clients[i]->response_ready is not true. Supposedly has_ongoing_requests() is still true though. This leads to a tight loop which it never escapes from.

Seems like handle_poll() should simply return negative here, right? If I get some input from @juhaylinen or some CODEOWNER (@rlubos ? @jukkar ?) I can submit a patch.

For the record, this has been observed empirically, when our LTE connection died at some inopportune moment.

rlubos commented 3 weeks ago

Seems like handle_poll() should simply return negative here, right?

Not sure if we should do that specifically, theoretically we can have multiple clients communicating over different sockets, doing this would stall the communication for every other client. Plus this will not report any error to the application.

I think that if a particular socket starts to report errors during poll, we should just close any active requests for the client that owns the socket. We currently have coap_client_cancel_requests() that could be used here, but perhaps we should generalize the function to allow to report other error codes, like ECONNABORTED or so.

tautologyclub commented 3 weeks ago

https://github.com/zephyrproject-rtos/zephyr/pull/80303

mmahadevan108 commented 2 weeks ago

@tautologyclub Has this been fixed? Can this issue be closed?