xmos / lib_ethernet

Ethernet MAC library
Other
3 stars 12 forks source link

RX not receiving anymore on fast, large transmissions #13

Closed isengard412 closed 6 years ago

isengard412 commented 7 years ago

The ethernet library stops receiving packets if there is too much time spent in rx.packet_ready() case. This can easily be reproduced by using the 100MBit example and modifying the rx.packet_ready() case. Just add a timer and wait some time (XS1_TIMER_HZ or XS1_TIMER_KHZ is definately long enough) (attatched). The longer you wait, the earlier the error will occur. If you just wait short (for example 100u) everything works fine.

Now send big packages (1500bit) with full speed towards the xmos. If you just wait for 100u, then data is being received with about 98.2Mbit/s. If you spend much more time in the rx.packet_ready() case, then it crashes or becomes unresponsive

If it crashes, then it throws a ET_LOAD_STORE exception. The buffer pointers seem to be wrong in this case. The rx buffer is circular. Normally if the 32 slots are full, then the read_bank_wr_ptr is right behind the read_bank_rd_ptr and new packets are being dropped. If packets are being read, the read_bank_rd_ptr moves on and the read_bank_wr_ptr writes the next packages. The problem is that somehow when the error occours the read_bank_rd_ptr is equal to read_bank_wr_ptr and the buffer is full (next_buffer=-1). In the next read event, mii_lite_get_in_buffer just returns zero as data address. After forwarding the wrong data towards the upper layers, the paket is being released. This is done by mii_lite_free_in_buffer. As the data address is zero it now crashes as this function substracts 8 from this address and tries then to write at this position (0xFFFFFFF8) which is obviously not possible.

A short python3 script is attatched. It just sends random UDP packages towards device. The mii_lite_data_t data is also attatched. mii_lite_data_t

AN00120_100Mbit_ethernet_demo.zip test_eth.zip

pthedinger commented 7 years ago

Could you try the code in the pull request (https://github.com/xmos/lib_ethernet/pull/11) as I have fixed this issue in one case definitely. It might not be the same as this case but it would be good to know whether it is still an issue after applying the pull request.

isengard412 commented 7 years ago

I already tried your pull request, but did sadly not solve my problem. I have spend much time in trying to solve the problem on my own, but did not succeed so far.

pthedinger commented 7 years ago

Ok, thanks we will investigate further.

Redeye92 commented 6 years ago

Has any progress ever been made on this? I seem to be having the same problem - certainly the ethernet mii crashes in exactly the same way. This is turning into a pretty big problem for me as I now have product randomly crashing which has to be power cycled to bring it back online.

The code supplied above reliably reproduces the crash for me. In my products the problem is normally triggered by a PC connecting to multiple XMOS devices at the same time which causes a mini "ARP storm" as they all try to get the IP address of the PC and send a flood of ARP broadcast messages. So it looks to me like the problem is not so much the volume of data but more the number of packets being received while the code is stuck in the rx.packet_ready() case.

pthedinger commented 6 years ago

I'm able to reproduce the issue, but have found that the script you are using to test is not doing what you expect. It seems to always send 36-byte packets because the bytes(iter(payload)) always produces a string representing an iterator. I've changed it to the following:

#!/usr/bin/env python
import socket
import numpy as np

ipaddress = "192.168.0.223"
outport = 15533

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.connect((ipaddress, outport))

payloadsize=1458
bytes_to_send = np.random.bytes(payloadsize)

while(1):
    sock.send(bytes_to_send)

which allows me to control the size being transmitted. I can see reproduce the error regardless of whether the payload is 1 or 1458 bytes.

pthedinger commented 6 years ago

I have added a test case for this failure to the test suite and made a change that fixes the crash for my test case. However, I am not convinced it has fixed all the issues. You can see the commit here and the pull request with all my changes here.

If you could test the fix for your case and let me know whether it works for you.

pthedinger commented 6 years ago

I've tested the fix against your test case and it no longer crashes, but also doesn't receive data. I will have to investigate further.

Redeye92 commented 6 years ago

In my investigations the source of the problem is that there is a path through the mii_commit_buffer function which doesn't alter this.read_bank_wr_ptr, therefore this.read_bank_rd_ptr=this.read_bank_wr_ptr so when mii_lite_get_in_buffer is next called it returns {0,0,0} which cannot be handled in the next call to i_mii.release_packet which then causes the crash. I think the bug is in the mii_commit_buffer function - the run path only hits line 256 in mii_lite_driver.xc immediately before it crashes (ie. this line of code never runs successfully). So I suspect either the logic around this and/or other variables not being modified are the source of the fault, although I don't really understand this code sufficiently to know how to fix it.

pthedinger commented 6 years ago

I don't believe the issue is with the read_bank_wr_ptr not being advanced. The buffer must not advance the pointer as otherwise it would wrap and look empty. It is more that the release_packet doesn't doesn't cope with a null pointer - which is what my fix addresses at the moment. I am now trying to figure out why the buffer then seems to lock up with the low-level driver believing that the buffer is full and the buffer manager believing that the buffer is empty.

pthedinger commented 6 years ago

I believe this is now fixed, please could you test and confirm whether the pull request fixes the issues you are seeing. The problem was that packets were being inserted into the buffer without always recording the fact for the receiver to collect the packet. This ultimately caused the whole system to lock up.

Note: the test case provided will no longer crash but may still receive 0 kbit/s because it has so much back pressure. I've done a reasonable amount of testing of different levels of back pressure and none of them crashed or locked up for me.

Redeye92 commented 6 years ago

Initial testing looks good. I've not got it to crash in my application code yet by creating lots of ARP messages (which was crashing it frequently if not predictably before). I've passed the code on to my customer who's been having the most problems to see if this has indeed fixed them.

pthedinger commented 6 years ago

Thanks for you feedback. I've merged the pull request and hope to make a 3.4.0 release soon. Let us know if there are any other issues encountered.