zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.93k stars 6.65k forks source link

Socket Service: Restart Flag Not Reset and Accidentally Cleared by Offloaded Polling Events #81813

Open illysky opened 3 hours ago

illysky commented 3 hours ago

I’m working with Zephyr v4.0.0 and trying to get to grips with the socket service. The process has been somewhat challenging, as I’m also using a combination of offloaded sockets alongside native sockets, with plans to eventually incorporate native DNS and TLS. There’s a good chance that some of the issues I’m experiencing are due to a misunderstanding on my part, so apologies in advance.

The DNS Resolve module registers its sockets using net_socket_service_register, as does one of my own modules for HTTP. Each time a new socket is registered, the socket service thread is instructed to restart to reload the updated events. This behavior is triggered by the following code in socket_service.c:

/* Tell the thread to re-read the variables */
zvfs_eventfd_write(ctx.events[0].fd, 1);
ret = 0;

And handled as:

if (ret > 0 && ctx.events[0].revents) {
    zvfs_eventfd_read(ctx.events[0].fd, &value);
    NET_DBG("Received restart event.");
    goto restart;
}

However, I can’t see where or how the restart flag (i.e., ctx.events[0].fd = 0) is cleared once the restart is complete. It doesn’t appear that zvfs_eventfd_read(ctx.events[0].fd, &value) clears it, and frankly, I’m not sure what the purpose of that call is anyway.

After digging into this further, it seems the restart flag might be cleared indirectly—or even randomly—by modules such as offloaded socket drivers. For example, I’ve implemented a variant of the in-tree SimpleLink Wi-Fi driver to support the NWP-only CC3120. The original driver doesn’t appear to be designed with the socket service in mind, as it modifies file descriptors that don’t belong to it. This creates problems, as the socket service reserves the first element of ctx.events for its internal control.

Specifically, the SimpleLink driver clears all revents fields before checking ownership, which can inadvertently prevent the socket service from restarting. Additionally, when the driver detects it doesn’t own a file descriptor, it returns an error instead of gracefully ignoring it and leaving other interfaces unaffected. Here’s the relevant snippet from simplelink_sockets.c:

for (i = 0; i < nfds; i++) {
    fds[i].revents = 0;  // <--- Clears all revents before determining ownership
    if (fds[i].fd < 0) {
        continue;
    } else {
        obj = zvfs_get_fd_obj(fds[i].fd,
                              (const struct fd_op_vtable *)&simplelink_socket_fd_op_vtable,
                              ENOTSUP);
        if (obj != NULL) {
            /* Offloaded socket found */
            // <--- Should clear here only
            sd = OBJ_TO_SD(obj);
        } else {
            /* Non-offloaded socket, return an error */
            retval = slcb_SetErrno(EINVAL);  // <--- Should ignore, not return, as this exits socket_service thread
            goto exit;
        }
    }
    if (fds[i].events & ZSOCK_POLLIN) {
        SL_SOCKET_FD_SET(sd, &rfds);
    }
    if (fds[i].events & ZSOCK_POLLOUT) {
        SL_SOCKET_FD_SET(sd, &wfds);
    }
    if (sd > max_sd) {
        max_sd = sd;
    }
}

Since my implementation of the driver no longer interacts with file descriptors that don’t belong to it, the restart flag isn’t always cleared as expected. This results in the socket service repeatedly restarting when returning from a blocked state in zsock_poll, causing socket data processing to be skipped.

Have I understood this issue correctly? If so, who is responsible for clearing the restart flag in socket_service.c (i.e., ctx.events[0].fd = 0)? I assume the service itself should handle this. Additionally, I’m wondering if the restart condition should be checked and addressed after processing the results of zsock_poll.

To address this, my proposed update to socket_service.c is as follows:

while (true) {
    ret = zsock_poll(ctx.events, count + 1, -1);
    if (ret < 0) {
        ret = -errno;
        NET_ERR("poll failed (%d)", ret);
        goto out;
    }

    if (ret == 0) {
        /* should not happen because timeout is -1 */
        break;
    }

    for (i = 1; i < (count + 1); i++) {
        if (ctx.events[i].fd < 0) {
            continue;
        }

        if (ctx.events[i].revents > 0) {
            ret = trigger_work(&ctx.events[i]);
            if (ret < 0) {
                NET_DBG("Triggering work failed (%d)", ret);
            }
        }
    }

    // Relocate after trigger work so the work gets done before restarting
    if (ret > 0 && ctx.events[0].revents) {
        // zvfs_eventfd_read(ctx.events[0].fd, &value);  // Remove
        ctx.events[0].revents = 0;                      // Clear the event as soon as possible
        NET_DBG("Received restart event.");
        goto restart;
    }
}

Apologies if I’ve misunderstood any part of this—I've only been working on it for a week. Thanks in advance for your insights!

github-actions[bot] commented 3 hours ago

Hi @illysky! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙