net_buf reference count not protected

chyu313 commented 3 years ago

Describe the bug I tested heavy TCP traffic load by 32 connections sending data to 32 tcp echo server threads on MIMXRT1060_EVK board, and at the same time using 'ping -f' to make zephyr busy handling ICMP packet. After one day, system crashed as following:

I tested many times, and there was a time the system halted due to assertion in k_spin_unlock() saying "Not my spinlock %p" in which the spin lock pointer pointed to _net_buf_pool_area. I didn't capture the error message at that time, but I tried to use atomic operation to protect reference count of net_buf by referencing https://github.com/zephyrproject-rtos/zephyr/commit/5ef825fdf6e760db0fcd9107bc3fbdbba15d547a committed by @jukkar . I changed 'uint8_t ref' in 'struct net_buf' to 'atomic_t atomic_ref', and simply used atomic_inc(&buf->atomic_ref) to replace original 'buf->ref++', atomic_dec(&buf->atomic_ref) to replace original 'buf->ref--', atomic_get(&frag->atomic_ref) to replace original code reading reference count.

Though not knowing why this modification can solve my problem, but after protecting net_buf reference count using atomic operation, my test has run for 3 days and keeps running.

Environment (please complete the following information):

OS: Linux
Toolchain: Zephyr SDK
Commit: Zephyr v2.5.0

jukkar commented 3 years ago

I changed 'uint8_t ref' in 'struct net_buf' to 'atomic_t atomic_ref', and simply used atomic_inc(&buf->atomic_ref) to replace original 'buf->ref++', atomic_dec(&buf->atomic_ref) to replace original 'buf->ref--', atomic_get(&frag->atomic_ref) to replace original code reading reference count.

This change would be welcomed and it most probably fixes your issue. Could you create a PR for this change?

chyu313 commented 3 years ago

#

I changed 'uint8_t ref' in 'struct net_buf' to 'atomic_t atomic_ref', and simply used atomic_inc(&buf->atomic_ref) to replace original 'buf->ref++', atomic_dec(&buf->atomic_ref) to replace original 'buf->ref--', atomic_get(&frag->atomic_ref) to replace original code reading reference count.

This change would be welcomed and it most probably fixes your issue. Could you create a PR for this change?

OK, but as a newbie, I am reading Contribution Guidelines and then will create a PR.

jhedberg commented 3 years ago

I think the main reason it's a uint8_t is that it creates a smaller struct size. We should consider solving the same by adding irq_lock/unlock or similar to the appropriate places in the implementation.

mniestroj commented 3 years ago

@jhedberg we would need spinlock in order to support SMP. Either global for all net_bufs or one per net_buf, which would consume more space than atomic.

Just to confirm, was using net_buf supposed to be thread-safe? There are not so many places where net_buf_ref() is actually used...

I am uncertain that using atomic reference counting is actually solving described bug. TCP application should operate on net_pkts underneath. net_pkts have their own reference counting and net_buf is only unrefed once after net_pkt reference count reaches zero. In other words, I don't see who is calling net_buf_unref second time, that results in race condition.

chyu313 commented 3 years ago

Unfortunately my system crashed on the 4th day of test with messy text dumped on debug console:

I wrote a modified net_pkt_print() to monitor the usage of net_buf

void _net_pkt_print(void)
{
    printk("[%u] TX %4u(%p) RX %4u(%p) TXBUF %4u RXBUF %4u\r",
            k_uptime_get_32(),
        k_mem_slab_num_free_get(&tx_pkts), tx_pkts.free_list,
        k_mem_slab_num_free_get(&rx_pkts), rx_pkts.free_list,
        atomic_get(&tx_bufs.avail_count),
        atomic_get(&rx_bufs.avail_count));
}

In my test described above, net_pkt count was set to 1024 and net_buf count was set to 4096 which is a large value. To verify whether there is something wrong with net_buf allocation/free, I reduced net_buf count on purpose:

CONFIG_NET_PKT_RX_COUNT=1024
CONFIG_NET_PKT_TX_COUNT=1024
CONFIG_NET_BUF_RX_COUNT=1024
CONFIG_NET_BUF_TX_COUNT=1024

The net_buf count was 1024 when system started: [17909] TX 978(0x80107b30) RX 1024(0x800f7318) TXBUF 1024 RXBUF 1024

Then with 32 TCP connections sending data to Zephyr, Data buffer (%zd) allocation failed was frequently dumped. After several times of TCP connecting/disconnecting, transmit net_pkt/net_buf count cannot restore to its original value: [333486] TX 933(0x80104f08) RX 1024(0x800f28d8) TXBUF 831 RXBUF 1024

I think I should first solve the problem of lost net_pkt/net_buf.

jhedberg commented 3 years ago

@jhedberg we would need spinlock in order to support SMP. Either global for all net_bufs or one per net_buf, which would consume more space than atomic.

@mniestroj ok, in that case we're probably better off with moving to atomic_t. I think we should at the same time consider changing the behaviour of net_buf_ref() so that it returns NULL if you give it a 0 reference count, similar to what we did with bt_conn: https://github.com/zephyrproject-rtos/zephyr/blob/b182ec78a16670f21c87ee7a33c2910f413f67db/subsys/bluetooth/host/conn.c#L1864-L1899

We could even consider creating a k_ref abstraction around atomic_t to make refcount implementation simpler (and harder to do wrong). Linux has something like this in the form of struct kref and refcount_t

mniestroj commented 3 years ago

k_ref abstraction seems a good idea. Recently I needed to add refcounting to drivers/wifi/esp/ to manage sockets lifetime, which could benefit from k_ref as well.

I think we should at the same time consider changing the behaviour of net_buf_ref() so that it returns NULL if you give it a 0 reference count

I am not sure this brings much benefit to net_buf_ref(). If you had a net_buf and somebody already decreased refcount to 0, then this net_buf landed in net_buf_pool. Assuming that it is still in the pool, adding check on 0 refcount in net_buf_ref() makes sense. But it is possible that such net_buf was already allocated once again, so effectively net_buf_ref() caller references new net_buf object (which is placed in the same memory, but still it is something different than desired). This means that we should not call net_buf_ref() on something that might be already dereferenced. In other words, caller is responsible for making sure net_buf_ref() is called only on buffers with refcount > 0.

galak commented 3 years ago

Have you tried just make this a uint32_t and testing that. That might convey if its a memory or atomic issue.

chyu313 commented 3 years ago

Have you tried just make this a uint32_t and testing that. That might convey if its a memory or atomic issue.

Changing type of net_buf ref form uint8_t to atomic_t finally did not prevent my system from crashing, so I did not try uint32_t.

The zephyr master branch solves some race condition problems after v2.5 release, so I changed to use master branch code. The system crashed with ARM hard fault as following:

E: ***** USAGE FAULT *****
E:   Illegal use of the EPSR
E: r0/a1:  0x1f9650b4  r1/a2:  0x00000000  r2/a3:  0x00000000
E: r3/a4:  0x00000000 r12/ip:  0xa0000000 r14/lr:  0x80006917
E:  xpsr:  0x00000000
E: r4/v1:  0x8011105c  r5/v2:  0x00000000  r6/v3:  0x00002710
E: r7/v4:  0x80110728  r8/v5:  0x00000500  r9/v6:  0x00000000
E: r10/v7: 0x00000006  r11/v8: 0x80021798    psp:  0x800c8740
E: EXC_RETURN: 0xfffffffd
E: Faulting instruction address (r15/pc): 0x8002164c
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x800233a0 (tcps_25)
E: Halting system

The faulting instruction address locates to a non-text area, it seems to be a stack problem, but I have set all threads' stack size to 8192, also with CONFIG_HW_STACK_PROTECTION=y. It is hard to see what happened before this ARM hard fault. What I am doing now is to use ARM MPU (Memory Protection Unit) to set non-text area to non-executable, so I can know at what point the PC register is changing to an illegal address.

kevin0402 commented 3 years ago

@andrewboie @jukkar @chyu313

Looks like this issue relate to #27032 and might be solved in https://github.com/zephyrproject-rtos/zephyr/commit/dde03c67709a0ed4b84d4892ca05361716b75c80

jukkar commented 3 years ago

Looks like this issue relate to #27032 and might be solved in dde03c6

Then that is just a coincidence/side effect as that commit does not protect any access to net_buf.

chyu313 commented 3 years ago

@andrewboie @jukkar @chyu313

Looks like this issue relate to #27032 and might be solved in dde03c6

It seems there are also more protections needed in kernel.

I tested dde03c6 (without modification to protect net_buf) and MPU fault occurred:

E: ***** MPU FAULT *****
E:   Data Access Violation
E:   MMFAR Address: 0x0
E: r0/a1:  0x80026260  r1/a2:  0x00000000  r2/a3:  0x00000000
E: r3/a4:  0x00000000 r12/ip:  0x80020850 r14/lr:  0x8001b2cf
E:  xpsr:  0x210000a8
E: r4/v1:  0x80026260  r5/v2:  0x00000000  r6/v3:  0x00000000
E: r7/v4:  0x00000010  r8/v5:  0x00000010  r9/v6:  0x80020850
E: r10/v7: 0x00000000  r11/v8: 0x00000000    psp:  0x8028df80
E: EXC_RETURN: 0xfffffff1
E: Faulting instruction address (r15/pc): 0x8001b2bc
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Fault during interrupt handling

E: Current thread: 0x80026260 (rx_q[0])
E: Halting system

To avoid NULL pointer access, I use MPU to set ARM address 0x0 (located in ITCM which is not used in my program) as no-access attribute, so that NULL pointer access to 0 will cause MPU FAULT.

The faulting instruction address 0x8001b2bc is located in sys_dlist_remove() which is called by unpend_thread_no_timeout(). The NULL pointer in sys_dlist_remove() is 'prev' which is compiled into r2 (value is 0 when MPU fault).

The comment of sys_dlist_remove() says: "This and other sysdlist*() functions are not thread safe."

8001b2b8 <sys_dlist_remove>:
    sys_dnode_t *const next = node->next;
8001b2b8:   e9d0 3200   ldrd    r3, r2, [r0]
    prev->next = next;
8001b2bc:   6013        str r3, [r2, #0]
    next->prev = prev;
8001b2be:   605a        str r2, [r3, #4]
    node->next = NULL;
8001b2c0:   2300        movs    r3, #0
    node->prev = NULL;
8001b2c2:   e9c0 3300   strd    r3, r3, [r0]
    sys_dnode_init(node);
}
8001b2c6:   4770        bx  lr

8001b2c8 <unpend_thread_no_timeout>:
{
8001b2c8:   b508        push    {r3, lr}
    sys_dlist_remove(&thread->base.qnode_dlist);
8001b2ca:   f7ff fff5   bl  8001b2b8 <sys_dlist_remove>
    thread->base.thread_state &= ~_THREAD_PENDING;
8001b2ce:   7b43        ldrb    r3, [r0, #13]
8001b2d0:   f023 0302   bic.w   r3, r3, #2
8001b2d4:   7343        strb    r3, [r0, #13]
    thread->base.pended_on = NULL;
8001b2d6:   2300        movs    r3, #0
8001b2d8:   6083        str r3, [r0, #8]
}
8001b2da:   bd08        pop {r3, pc}

carlescufi commented 3 years ago

@chyu313 can you confirm this is still an issue in current main?

chyu313 commented 3 years ago

@chyu313 can you confirm this is still an issue in current main?

Yes, it still an issue. I keep pulling Zephyr code and testing it. Fault still occurs.

github-actions[bot] commented 3 years ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

M1cha commented 1 year ago

I created a PR that implements k_ref: https://github.com/zephyrproject-rtos/zephyr/pull/64798

zephyrproject-rtos / zephyr

net_buf reference count not protected #32564