xapi-project / xcp-networkd

The XCP networking daemon
Other
14 stars 42 forks source link

pif-scan causes a crash when infiniband interface is present. #184

Closed Oleszkiewicz closed 3 years ago

Oleszkiewicz commented 3 years ago

On old (pre 8.0) versions it actually worked, now it crashes with internal error when infiniband interface is present in the system, probably caused by a different size of MAC address. The interface actually works, just crashes pif-scan

edwintorok commented 3 years ago

Can you open a ticket on bugs.xenserver.org and attach the output from xen-bugtool? (attaching logs on github is probably not recommended, because then everyone can see details of your system) How does pif-scan crash, does it exit with error, or do you see a crash in dmesg?

Oleszkiewicz commented 3 years ago

I will, it crashes with internal error, as a result even eth interfaces are not scanned properly when ib is present. Also I can post some logs from a test machine that do not include any private data:

These logs are taken from the same machine :

IB driver unloaded -scan works: https://gist.github.com/Oleszkiewicz/93db24d5ecf48e45c0620fcffac39dc0

IB Driver loaded - scan fails: https://gist.github.com/Oleszkiewicz/f3408bead32282e3f6214c0c434545d6

This happens on xcp-ng 8.0-8.2 on some older version it actually detected IB interfaces properly.

edwintorok commented 3 years ago

Thanks, unfortunately the stacktrace from xcp_networkd is missing (says backtrace table not initialized). There is a backtrace from xapi, but that just stops at the IDL level. We probably need to take a look at the full logs to understand what went wrong here.

Oleszkiewicz commented 3 years ago

hmm i dont get anything more in logs, how can I generate backtrace from xcp_networkd ? any pointers? I will be happy to assist in debugging this

Oleszkiewicz commented 3 years ago

btw. bugs.xenserver.org is down :(

edwintorok commented 3 years ago

Did you use https? https://bugs.xenserver.org/ The http redirect might be down.

robhoes commented 3 years ago

When you say "crash", it seems that you mean that the pif-scan called failed, but not that any daemons exited? The logs on your link suggest that some List.hd call in xcp-networkd failed due to a list that was unexpectedly empty. We should eliminate all List.hd calls and properly handle the empty-list case.

robhoes commented 3 years ago

For example, I don't like the look of the error handling around this one (in spite of what the code comment suggests): https://github.com/xapi-project/xcp-networkd/blob/master/lib/network_utils.ml#L164.

Oleszkiewicz commented 3 years ago

yes, pif-scan failed, daemons live AFAIK. I guess that empty-list case could be one of the problems, however another thing is properly handling the IB different MAC format actually the strange thing is that it was supported properly already and then was broken... Actually it should be pretty straightforward as there is no reason/need to assign IB interfaces to guests / bridging etc, so no "infiniband magic" here. What is important is: 1) be able to see ETH interfaces while IB is present in the system (and handling errors properly should solve this), also this is a heavyweight bug as it causes emergency network reset to come back with no network interfaces if IB is present in the system. 2) be able to see IB interface just to allow it to be added to guest as SR-IOV VF interface. on the OS side - this works, on the xen side - is should also work already (there is nothing different in assigning ib VP port vs eth VF port)... I guess it could be trivial to the person who knows this code well, as mentioned - it was already working well, so I guess some change caused IB present to cause an empty list and -> internal error.

robhoes commented 3 years ago

We've added SR-IOV support some time ago (perhaps 8.0), which caused some changes in how interfaces are discovered. We didn't test this with Infiniband as far as I know, so that may have broken it. I bet it's just a matter of catching some corner cases (especially if it used to work before).

Oleszkiewicz commented 3 years ago

I guess so. I think that if the interface discovery is corrected the SR-IOV should work with IB right out of the box (especially that we actually just pass the PCIe VF device to the guest and the rest is the job of the driver inside the guest.?

lindig commented 3 years ago
utop # String.split_on_char '\n' "";;
- : string list = [""]
robhoes commented 3 years ago

Perhaps not that one then :) Some other candidates are: https://github.com/xapi-project/xcp-networkd/blob/master/lib/network_utils.ml#L485-L487

lindig commented 3 years ago

I agree with the sentiment - we should always use match for these cases and log an error including a code position when the thought-to-be impossible happens.

lindig commented 3 years ago

This is a candidate:

utop # Astring.String.cuts ~empty:false ~sep:"/" "";;
- : string list = []
─( 13:37:25 )─< command 3 >──────────────────────────────────────{ counter: 0 }─
utop # Astring.String.cuts ~empty:true ~sep:"/" "";;
- : string list = [""]
─( 13:37:35 )─< command 4 >──────────────────────────────────────{ counter: 0 }─
utop # 

However, that exception would be handled and "N/A" returned. So false alarm.

lindig commented 3 years ago
utop # find "" "";;
- : string list = []
utop # find "x" "";;
- : string list = []
utop # find "/" "foo";;
- : string list = []
lindig commented 3 years ago

The command is parsing the output of /sbin/ip link show dev /some/device. Are there any surprises in that output for the IB device?

Oleszkiewicz commented 3 years ago

let me check

Oleszkiewicz commented 3 years ago

[15:38 Fenrir ~]# ip link show dev ib0 18: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 256 link/infiniband a0:00:02:20:fe:80:00:00:00:00:00:00:24:be:05:ff:ff:cb:db:b1 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

lindig commented 3 years ago

My suspicion: "link/ether" is not found because it is "link/infiniband"

Oleszkiewicz commented 3 years ago

1) this should not kill the whole scan process, at best it should cause the device to be undetected. 2) if this is the culprit it should be pretty simple to repair this I guess and allow xcp-networkd to see ib interfaces again ?

lindig commented 3 years ago

I agree. The problem is the uncaught exception which kills the scan process. The code relies on always finding link/ether, which is wrong. This is unconfirmed at this point but I have created an internal ticket:

lindig commented 3 years ago

A potential fix - not yet too happy about it: https://github.com/xapi-project/xcp-networkd/pull/185; it also lacks testing. I have to find out why we haven't seen this problem in our tests.

Oleszkiewicz commented 3 years ago

how could I test this on the server in the fastest way :)? Hopefully without rebuilding everything from source :)?

lindig commented 3 years ago

I am building this on top of our code base but it would be backported to CH 8.2 at least. I can build a binary for CH 8.2 for you and share it - but that would be on Monday.

Oleszkiewicz commented 3 years ago

This would be awesome :) Thank you and waiting for Monday then.

Oleszkiewicz commented 3 years ago

any possibility to try it out on 8.2 and see if any problems arise?

lindig commented 3 years ago

Closing this. Infiniband hardware is unsupported and I understand that this is not fully resolved but we simply don't have the capacity to work on this and also lack the hardware to test it.