I/O timeout occurs after adding ddr module into NVMeCHA

pefar commented 1 year ago

Hello, firstly I wanna say NVMeCHA is so useful & valuable!

Recently, we are in a program where we're trying to add the DDR4 module with NVMeCHA to run on our Zynq UltraScale+ Card. After combined them, errors occurs during the process of Linux's start:

blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev nvme0n1, logical block 0, async page read

First, we checked the output from uart, found that the Admin Commands were aborted after Identify Namespace completed, and the Admin Command Abort processed. Then host seemed to try restarting the controller but controller was dead at last. And we see the errors above were printed to the screen.

......
NVME_STATE_RUNNING
Admin Command(Create IO SQ), cid: 4, nsid: 0, sq_id: 7, cq_id: 7, qsize: 1023

NVME_STATE_RUNNING
Admin Command(Create IO CQ), cid: 5, nsid: 0, cq_id: 8, qsize: 1023

Succeed! g_iocq_irq_enable: 0xFF, irq_en: 0x1, irq_vector: 0x8

NVME_STATE_RUNNING
Admin Command(Create IO SQ), cid: 6, nsid: 0, sq_id: 8, cq_id: 8, qsize: 1023

NVME_STATE_RUNNING
Admin Command(Identify), cid: 7, nsid: 0, cns: 0x2

Succeed!

NVME_STATE_RUNNING
Admin Command(Identify), cid: 4, nsid: 1, cns: 0x0

Succeed!

NVME_STATE_RUNNING
……
Admin Command(Abort)

Then, we checked the kernel.log recored during system's start, and found these info:

[   33.853997] nvme nvme0: I/O 128 QID 2 timeout, aborting
[   33.874030] nvme nvme0: Abort status: 0x0
[   64.574016] nvme nvme0: I/O 128 QID 2 timeout, reset controller
[   67.621849] nvme nvme0: 8/0/0 default/read/poll queues
[   99.390003] nvme nvme0: I/O 128 QID 2 timeout, disable controller
[  100.610219] blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  100.610249] fbcon: Taking over console
[  100.610257] Buffer I/O error on dev nvme0n1, logical block 0, async page read
[  100.610312] nvme nvme0: failed to mark controller live state
 [  100.610317] nvme nvme0: Removing after probe failure status: -19
[  100.610416] Console: switching to colour frame buffer device 240x67
[  100.629953] nvme0n1: detected capacity change from 33554432 to 0
[  100.630206] blk_update_request: I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  100.630295] Buffer I/O error on dev nvme0n1, logical block 0, async page read
[  100.630402] ldm_validate_partition_table(): Disk read failed.
[  100.630412] Dev nvme0n1: unable to read RDB block 0
[  100.630459]  nvme0n1: unable to read partition table
[  100.630470] nvme0n1: partition table beyond EOD, truncated

which seems that a timeout caused the aborting. Probably the Identify Namespace command were processed with sth wrong? Maybe we have to reconfigure some features in the firmware codes where Identify Namespace is deifned. Now we are reading the Linux kernel driver source, understanding the order and calls of probing the nvme device so as to get a hint.

Is there anyone met the same problem? thanks for some help.

yhqiu16 commented 1 year ago

I guess the timeout may stem from the host NVMe driver cannot find the correct CQ entry. I suggest you may check the DMA address for submitting the CQ entry by simulation or ILA.

pefar commented 1 year ago

Thanks for reply! Do you mean the DDR module caused a timeout or CQ error directly? Certainly, the DDR module does causes a timeout to I/O cmd processing. But in NVMeCHA, Admin Cmd are processed by MicroBlaze, with no accessing to I/O Controller, where we added DDR module. And the timeout occured during the Linux's initialization process, where, according to the NVMe Spec, there are no I/O cmd processed. So I don't think DDR module caused a timeout to Admin cmd's processing. However, rarely, the timeout cmd's sqid, read from dw10 in the Abort cmd, is not 0 (admin queues own id 0), which indicates it's an I/O cmd. Now we suppose that probably the kernel nvme driver triggered an accident dma transport which uses the I/O pipe. Or others.

yhqiu16 commented 1 year ago

I mean the CQ entry may be submitted to a wrong address, so the NVMe driver cannot find it. I think you would better try to print some information in the firmware and use ILA to check the waveform for debugging.

preta21 commented 1 year ago

Along with the lane width modification, I adjusted the AXI Data Width from 128 to 256 in XDMA IP. And now the controller works fine (except some I/O timeout msg from kernel after I replaced the user_app with ddr UI). Thanks for your reply!

yhqiu16 commented 1 year ago

You are welcome!

yhqiu16 / NVMeCHA

I/O timeout occurs after adding ddr module into NVMeCHA #6