Closed chyu313 closed 3 years ago
I changed 'uint8_t ref' in 'struct net_buf' to 'atomic_t atomic_ref', and simply used atomic_inc(&buf->atomic_ref) to replace original 'buf->ref++', atomic_dec(&buf->atomic_ref) to replace original 'buf->ref--', atomic_get(&frag->atomic_ref) to replace original code reading reference count.
This change would be welcomed and it most probably fixes your issue. Could you create a PR for this change?
#
I changed 'uint8_t ref' in 'struct net_buf' to 'atomic_t atomic_ref', and simply used atomic_inc(&buf->atomic_ref) to replace original 'buf->ref++', atomic_dec(&buf->atomic_ref) to replace original 'buf->ref--', atomic_get(&frag->atomic_ref) to replace original code reading reference count.
This change would be welcomed and it most probably fixes your issue. Could you create a PR for this change?
OK, but as a newbie, I am reading Contribution Guidelines and then will create a PR.
I think the main reason it's a uint8_t is that it creates a smaller struct size. We should consider solving the same by adding irq_lock/unlock or similar to the appropriate places in the implementation.
@jhedberg we would need spinlock in order to support SMP. Either global for all net_bufs or one per net_buf, which would consume more space than atomic.
Just to confirm, was using net_buf supposed to be thread-safe? There are not so many places where net_buf_ref()
is actually used...
I am uncertain that using atomic reference counting is actually solving described bug. TCP application should operate on net_pkt
s underneath. net_pkt
s have their own reference counting and net_buf
is only unrefed once after net_pkt
reference count reaches zero. In other words, I don't see who is calling net_buf_unref
second time, that results in race condition.
Unfortunately my system crashed on the 4th day of test with messy text dumped on debug console:
I wrote a modified net_pkt_print() to monitor the usage of net_buf
void _net_pkt_print(void)
{
printk("[%u] TX %4u(%p) RX %4u(%p) TXBUF %4u RXBUF %4u\r",
k_uptime_get_32(),
k_mem_slab_num_free_get(&tx_pkts), tx_pkts.free_list,
k_mem_slab_num_free_get(&rx_pkts), rx_pkts.free_list,
atomic_get(&tx_bufs.avail_count),
atomic_get(&rx_bufs.avail_count));
}
In my test described above, net_pkt count was set to 1024 and net_buf count was set to 4096 which is a large value. To verify whether there is something wrong with net_buf allocation/free, I reduced net_buf count on purpose:
CONFIG_NET_PKT_RX_COUNT=1024
CONFIG_NET_PKT_TX_COUNT=1024
CONFIG_NET_BUF_RX_COUNT=1024
CONFIG_NET_BUF_TX_COUNT=1024
The net_buf count was 1024 when system started:
[17909] TX 978(0x80107b30) RX 1024(0x800f7318) TXBUF 1024 RXBUF 1024
Then with 32 TCP connections sending data to Zephyr, Data buffer (%zd) allocation failed
was frequently dumped. After several times of TCP connecting/disconnecting, transmit net_pkt/net_buf count cannot restore to its original value:
[333486] TX 933(0x80104f08) RX 1024(0x800f28d8) TXBUF 831 RXBUF 1024
I think I should first solve the problem of lost net_pkt/net_buf.
@jhedberg we would need spinlock in order to support SMP. Either global for all net_bufs or one per net_buf, which would consume more space than atomic.
@mniestroj ok, in that case we're probably better off with moving to atomic_t. I think we should at the same time consider changing the behaviour of net_buf_ref()
so that it returns NULL if you give it a 0 reference count, similar to what we did with bt_conn:
https://github.com/zephyrproject-rtos/zephyr/blob/b182ec78a16670f21c87ee7a33c2910f413f67db/subsys/bluetooth/host/conn.c#L1864-L1899
We could even consider creating a k_ref
abstraction around atomic_t to make refcount implementation simpler (and harder to do wrong). Linux has something like this in the form of struct kref
and refcount_t
k_ref
abstraction seems a good idea. Recently I needed to add refcounting to drivers/wifi/esp/
to manage sockets lifetime, which could benefit from k_ref
as well.
I think we should at the same time consider changing the behaviour of net_buf_ref() so that it returns NULL if you give it a 0 reference count
I am not sure this brings much benefit to net_buf_ref()
. If you had a net_buf
and somebody already decreased refcount to 0, then this net_buf
landed in net_buf_pool
. Assuming that it is still in the pool, adding check on 0 refcount in net_buf_ref()
makes sense. But it is possible that such net_buf
was already allocated once again, so effectively net_buf_ref()
caller references new net_buf
object (which is placed in the same memory, but still it is something different than desired). This means that we should not call net_buf_ref()
on something that might be already dereferenced. In other words, caller is responsible for making sure net_buf_ref()
is called only on buffers with refcount > 0.
Have you tried just make this a uint32_t and testing that. That might convey if its a memory or atomic issue.
Have you tried just make this a uint32_t and testing that. That might convey if its a memory or atomic issue.
Changing type of net_buf ref form uint8_t to atomic_t finally did not prevent my system from crashing, so I did not try uint32_t.
The zephyr master branch solves some race condition problems after v2.5 release, so I changed to use master branch code. The system crashed with ARM hard fault as following:
E: ***** USAGE FAULT *****
E: Illegal use of the EPSR
E: r0/a1: 0x1f9650b4 r1/a2: 0x00000000 r2/a3: 0x00000000
E: r3/a4: 0x00000000 r12/ip: 0xa0000000 r14/lr: 0x80006917
E: xpsr: 0x00000000
E: r4/v1: 0x8011105c r5/v2: 0x00000000 r6/v3: 0x00002710
E: r7/v4: 0x80110728 r8/v5: 0x00000500 r9/v6: 0x00000000
E: r10/v7: 0x00000006 r11/v8: 0x80021798 psp: 0x800c8740
E: EXC_RETURN: 0xfffffffd
E: Faulting instruction address (r15/pc): 0x8002164c
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x800233a0 (tcps_25)
E: Halting system
The faulting instruction address locates to a non-text area, it seems to be a stack problem, but I have set all threads' stack size to 8192, also with CONFIG_HW_STACK_PROTECTION=y. It is hard to see what happened before this ARM hard fault. What I am doing now is to use ARM MPU (Memory Protection Unit) to set non-text area to non-executable, so I can know at what point the PC register is changing to an illegal address.
@andrewboie @jukkar @chyu313
Looks like this issue relate to #27032 and might be solved in https://github.com/zephyrproject-rtos/zephyr/commit/dde03c67709a0ed4b84d4892ca05361716b75c80
Looks like this issue relate to #27032 and might be solved in dde03c6
Then that is just a coincidence/side effect as that commit does not protect any access to net_buf.
@andrewboie @jukkar @chyu313
Looks like this issue relate to #27032 and might be solved in dde03c6
It seems there are also more protections needed in kernel.
I tested dde03c6 (without modification to protect net_buf) and MPU fault occurred:
E: ***** MPU FAULT *****
E: Data Access Violation
E: MMFAR Address: 0x0
E: r0/a1: 0x80026260 r1/a2: 0x00000000 r2/a3: 0x00000000
E: r3/a4: 0x00000000 r12/ip: 0x80020850 r14/lr: 0x8001b2cf
E: xpsr: 0x210000a8
E: r4/v1: 0x80026260 r5/v2: 0x00000000 r6/v3: 0x00000000
E: r7/v4: 0x00000010 r8/v5: 0x00000010 r9/v6: 0x80020850
E: r10/v7: 0x00000000 r11/v8: 0x00000000 psp: 0x8028df80
E: EXC_RETURN: 0xfffffff1
E: Faulting instruction address (r15/pc): 0x8001b2bc
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Fault during interrupt handling
E: Current thread: 0x80026260 (rx_q[0])
E: Halting system
To avoid NULL pointer access, I use MPU to set ARM address 0x0 (located in ITCM which is not used in my program) as no-access attribute, so that NULL pointer access to 0 will cause MPU FAULT.
The faulting instruction address 0x8001b2bc is located in sys_dlist_remove() which is called by unpend_thread_no_timeout(). The NULL pointer in sys_dlist_remove() is 'prev' which is compiled into r2 (value is 0 when MPU fault).
The comment of sys_dlist_remove() says: "This and other sysdlist*() functions are not thread safe."
8001b2b8 <sys_dlist_remove>:
sys_dnode_t *const next = node->next;
8001b2b8: e9d0 3200 ldrd r3, r2, [r0]
prev->next = next;
8001b2bc: 6013 str r3, [r2, #0]
next->prev = prev;
8001b2be: 605a str r2, [r3, #4]
node->next = NULL;
8001b2c0: 2300 movs r3, #0
node->prev = NULL;
8001b2c2: e9c0 3300 strd r3, r3, [r0]
sys_dnode_init(node);
}
8001b2c6: 4770 bx lr
8001b2c8 <unpend_thread_no_timeout>:
{
8001b2c8: b508 push {r3, lr}
sys_dlist_remove(&thread->base.qnode_dlist);
8001b2ca: f7ff fff5 bl 8001b2b8 <sys_dlist_remove>
thread->base.thread_state &= ~_THREAD_PENDING;
8001b2ce: 7b43 ldrb r3, [r0, #13]
8001b2d0: f023 0302 bic.w r3, r3, #2
8001b2d4: 7343 strb r3, [r0, #13]
thread->base.pended_on = NULL;
8001b2d6: 2300 movs r3, #0
8001b2d8: 6083 str r3, [r0, #8]
}
8001b2da: bd08 pop {r3, pc}
@chyu313 can you confirm this is still an issue in current main?
@chyu313 can you confirm this is still an issue in current main?
Yes, it still an issue. I keep pulling Zephyr code and testing it. Fault still occurs.
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.
I created a PR that implements k_ref: https://github.com/zephyrproject-rtos/zephyr/pull/64798
Describe the bug I tested heavy TCP traffic load by 32 connections sending data to 32 tcp echo server threads on MIMXRT1060_EVK board, and at the same time using 'ping -f' to make zephyr busy handling ICMP packet. After one day, system crashed as following:
I tested many times, and there was a time the system halted due to assertion in k_spin_unlock() saying "Not my spinlock %p" in which the spin lock pointer pointed to _net_buf_pool_area. I didn't capture the error message at that time, but I tried to use atomic operation to protect reference count of net_buf by referencing https://github.com/zephyrproject-rtos/zephyr/commit/5ef825fdf6e760db0fcd9107bc3fbdbba15d547a committed by @jukkar . I changed 'uint8_t ref' in 'struct net_buf' to 'atomic_t atomic_ref', and simply used atomic_inc(&buf->atomic_ref) to replace original 'buf->ref++', atomic_dec(&buf->atomic_ref) to replace original 'buf->ref--', atomic_get(&frag->atomic_ref) to replace original code reading reference count.
Though not knowing why this modification can solve my problem, but after protecting net_buf reference count using atomic operation, my test has run for 3 days and keeps running.
Environment (please complete the following information):