zephyriot / zep-jira14

0 stars 0 forks source link

extra unref happening on net_context #2427

Open nashif opened 7 years ago

nashif commented 7 years ago

Reported by Geoff Gustafson:

I'm seeing a problem where I get a net_context in a tcp_accepted_cb and I call net_context_ref on it so that I can keep it around and use it later. But if I instrument Zephyr code and monitor alls refs and unrefs, I see that it gets unref'd down to 0 even though I've never unref'd it. If I don't ref it I see it get unref'd down to -1. Then I changed things to print the caller function for every ref/unref. I see that tcp_established is calling unref twice in a row in these lines near the bottom of the function:

    if (sys_slist_is_empty(&context->tcp->sent_list)
        && context->tcp->fin_rcvd
        && context->tcp->fin_sent) {
        net_context_unref(context);
    }

This code looks to me like it might be at issue because it drops the reference and yet doesn't change any other state to indicate to the stack that it's been closed. It may be that it's expecting to be the last one to hold a reference, but in my case that's not true.

When I run samples/net/echo_server I don't see this problem. I suspect that's because the sample handles socket closure immediately in tcp_received. But in my more complex case with ZJS - JavaScript running in the main thread that I need to sync with, I have to hold on to the context and defer the work to the main thread.

Talking to [~ajross71] in IRC, he suggested that maybe elsewhere the socket should be getting moved to a new state to prevent re-entry into tcp_established, and that's where the bug is.

Note: I'm really working with 1.8 at the moment but I don't see anything that would have changed this in master. The bug I'm working on is the last thing holding up my 0.4 release, and after that we'll try to update to the latest Zephyr.

(Imported from Jira ZEP-2598)

nashif commented 7 years ago

by Geoff Gustafson:

Sorry, I should have explained that my code is basically an echo server test too, just more complex because JS is involved.

If I do this from Linux: $ nc 192.168.1.99 4242 it works fine, and I can go back and forth echoing data.

It's when I do this that it fails: $ echo "foo" | nc 192.168.1.99 4242

That is, something about the socket getting closed so quickly after the data is sent. When I look at the packet trace, I see: Client SYN (Linux) Server SYN/ACK (Zephyr) Client ACK Client PSH/ACK w/ the data "foo" Client FIN/ACK Server ACK to data Server FIN/ACK Client ACK

I tried doing an extra unpaired net_context_ref in my code. It gets farther, but then after the above packet trace I see ack_timeout() get called after a few seconds and unref the context again itself. So it still goes to zero and gets marked as not "in use". I guess it doesn't see that last Client ACK for some reason?