Open Oleszkiewicz opened 5 years ago
Is there anything in the output of dmesg
and in other logs such as daemon.log
?
dmesg is clear of error, just shows the driver properly initializing the Physical Function (which works by the way), the VFs are initialized properly too.
(the VFs are hidden by xen-pciback, so the driver does not mess with them), without xen-pciback the same problem (rebooting xapi) persists.
The daemon logs shows a repeated pattern:
Mar 26 11:30:00 Alpha xapi-init[18247]: Starting xapi:
Mar 26 11:30:00 Alpha systemd[1]: Started Firstboot actions.
Mar 26 11:30:01 Alpha systemd[1]: Started Session 181 of user root.
Mar 26 11:30:01 Alpha systemd[1]: Starting Session 181 of user root.
Mar 26 11:30:01 Alpha systemd[1]: Started Session 182 of user root.
Mar 26 11:30:01 Alpha systemd[1]: Starting Session 182 of user root.
Mar 26 11:30:02 Alpha message-switch[887]: main: [ info|message-switch] Session xapi:16440 cleaning up
Mar 26 11:30:05 Alpha systemd[1]: xapi.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 26 11:30:05 Alpha xapi-init[18546]: Stopping xapi: [ OK ]
Mar 26 11:30:05 Alpha systemd[1]: Unit xapi.service entered failed state.
Mar 26 11:30:05 Alpha systemd[1]: xapi.service failed.
Mar 26 11:30:05 Alpha systemd[1]: xapi.service holdoff time over, scheduling restart.
Mar 26 11:30:05 Alpha systemd[1]: Cannot add dependency job for unit qemuback.service, ignoring: Unit qemuback.service failed to load: No such file or directory.
Mar 26 11:30:05 Alpha systemd[1]: Cannot add dependency job for unit qemuback.service, ignoring: Unit qemuback.service failed to load: No such file or directory.
Mar 26 11:30:05 Alpha systemd[1]: Stopping Firstboot actions...
Mar 26 11:30:05 Alpha systemd[1]: Starting XenAPI server (XAPI)...
Mar 26 11:30:05 Alpha systemd[1]: Started XenAPI server (XAPI).
Mar 26 11:30:05 Alpha systemd[1]: Starting Firstboot actions...
Mar 26 11:30:05 Alpha xapi-init[18562]: Starting xapi:
Mar 26 11:30:06 Alpha systemd[1]: Started Firstboot actions.
Mar 26 11:30:07 Alpha message-switch[887]: main: [ info|message-switch] Session xapi:16740 cleaning up
Mar 26 11:30:10 Alpha systemd[1]: xapi.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 26 11:30:11 Alpha xapi-init[18846]: Stopping xapi: [ OK ]
Mar 26 11:30:11 Alpha systemd[1]: Unit xapi.service entered failed state.
Mar 26 11:30:11 Alpha systemd[1]: xapi.service failed.
Mar 26 11:30:11 Alpha systemd[1]: xapi.service holdoff time over, scheduling restart.
Mar 26 11:30:11 Alpha systemd[1]: Cannot add dependency job for unit qemuback.service, ignoring: Unit qemuback.service failed to load: No such file or directory.
Mar 26 11:30:11 Alpha systemd[1]: Cannot add dependency job for unit qemuback.service, ignoring: Unit qemuback.service failed to load: No such file or directory.
Mar 26 11:30:11 Alpha systemd[1]: Stopping Firstboot actions...
Mar 26 11:30:11 Alpha systemd[1]: Starting XenAPI server (XAPI)...
Mar 26 11:30:11 Alpha systemd[1]: Started XenAPI server (XAPI).
Mar 26 11:30:11 Alpha systemd[1]: Starting Firstboot actions...
Mar 26 11:30:11 Alpha xapi-init[18862]: Starting xapi:
Mar 26 11:30:11 Alpha systemd[1]: Started Firstboot actions.
Mar 26 11:30:13 Alpha message-switch[887]: main: [ info|message-switch] Session xapi:17040 cleaning up
Any ideas on how to debug this deeper? " INTERNAL_ERROR: [ Not_found ]" is not very helpful.
The logs give the backtrace so you may want to explore the code of xen-api at https://github.com/xapi-project/xen-api/, choosing the tag that corresponds to the version of the xapi
RPM in XCP-ng.
Apart from this the logs are not very useful indeed, except that it looks like the XAPI service fails to connect to its database.
I am not a developer actually, however this seems VERY strange, and actually the bug must be really strange too. what is the reason the database would fail with SR-IOV Virtual Functions enabled, but work correctly when they are disabled? Any possibility to check where this actually fails / debug the database / verify why it dies without giving out a meaningful error? Isn't the "not found" error a sign that this error/reason for error was not forseen and thus there is no error message here?
I had done some more digging into the issue. Enabled database write logging. The detailed xensource.log looks like this:
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|xapi] PCI 0000:03:01.4, Mellanox Technologies, MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] created
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|db_write] create_row PCI (OpaqueRef:2558881f-9933-4f4e-81ef-af3cc2ecdbcc) [(_ref,v),(uuid,v),(class_id,v),(class_name,v),(vendor_id,v),(vendor_name,v),(device_id,v),(device_name,v),(host,v),(pci_id,v),(functions,v),(physical_function,v),(dependencies,v),(other_config,v),(subsystem_vendor_id,v),(subsystem_vendor_name,v),(subsystem_device_id,v),(subsystem_device_name,v),(scheduled_to_be_attached_to,v),(driver_name,v)]
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|xapi] PCI 0000:03:01.5, Mellanox Technologies, MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] created
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|db_write] create_row PCI (OpaqueRef:2cc30358-ed61-4030-93fb-1dcbbf4c5919) [(_ref,v),(uuid,v),(class_id,v),(class_name,v),(vendor_id,v),(vendor_name,v),(device_id,v),(device_name,v),(host,v),(pci_id,v),(functions,v),(physical_function,v),(dependencies,v),(other_config,v),(subsystem_vendor_id,v),(subsystem_vendor_name,v),(subsystem_device_id,v),(subsystem_device_name,v),(scheduled_to_be_attached_to,v),(driver_name,v)]
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|xapi] PCI 0000:03:01.6, Mellanox Technologies, MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function] created
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|redo_log] WriteField(task, OpaqueRef:9072bd1e-6610-4181-8ac5-c7387792280b, progress, 0, 1)
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|redo_log] WriteField(task, OpaqueRef:9072bd1e-6610-4181-8ac5-c7387792280b, error_info, (), ('INTERNAL_ERROR' 'Not_found'))
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|redo_log] WriteField(task, OpaqueRef:9072bd1e-6610-4181-8ac5-c7387792280b, backtrace, (), (((process"xapi @ Alpha")(filename list.ml)(line 214))((process"xapi @ Alpha")(filename ocaml/xapi/xapi_pci.ml)(line 218))((process"xapi @ Alpha")(filename list.ml)(line 82))((process"xapi @ Alpha")(filename ocaml/xapi/xapi_pci.ml)(line 216))((process"xapi @ Alpha")(filename ocaml/xapi/xapi_pci.ml)(line 227))((process"xapi @ Alpha")(filename ocaml/xapi/dbsync_slave.ml)(line 239))((process"xapi @ Alpha")(filename ocaml/xapi/dbsync_slave.ml)(line 305))((process"xapi @ Alpha")(filename ocaml/xapi/dbsync.ml)(line 63))((process"xapi @ Alpha")(filename ocaml/xapi/server_helpers.ml)(line 80))))
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|redo_log] WriteField(task, OpaqueRef:9072bd1e-6610-4181-8ac5-c7387792280b, finished, 19700101T00:00:00Z, 20190330T20:00:24Z)
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|redo_log] WriteField(task, OpaqueRef:9072bd1e-6610-4181-8ac5-c7387792280b, status, pending, failure)
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |dbsync (update_env) R:9072bd1e6610|db_write] delete_row task (OpaqueRef:9072bd1e-6610-4181-8ac5-c7387792280b)
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] dbsync (update_env) R:9072bd1e6610 failed with exception Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] Raised Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 1/15 xapi @ Alpha Raised at file list.ml, line 214
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 2/15 xapi @ Alpha Called from file ocaml/xapi/xapi_pci.ml, line 218
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 3/15 xapi @ Alpha Called from file list.ml, line 82
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 4/15 xapi @ Alpha Called from file ocaml/xapi/xapi_pci.ml, line 216
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 5/15 xapi @ Alpha Called from file ocaml/xapi/xapi_pci.ml, line 227
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 6/15 xapi @ Alpha Called from file ocaml/xapi/dbsync_slave.ml, line 239
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 7/15 xapi @ Alpha Called from file ocaml/xapi/dbsync_slave.ml, line 305
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 8/15 xapi @ Alpha Called from file ocaml/xapi/dbsync.ml, line 63
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 9/15 xapi @ Alpha Called from file ocaml/xapi/server_helpers.ml, line 80
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 10/15 xapi @ Alpha Called from file hashtbl.ml, line 194
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 11/15 xapi @ Alpha Called from file lib/debug.ml, line 92
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 12/15 xapi @ Alpha Called from file ocaml/xapi/server_helpers.ml, line 99
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 13/15 xapi @ Alpha Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 14/15 xapi @ Alpha Called from file hashtbl.ml, line 194
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace] 15/15 xapi @ Alpha Called from file lib/debug.ml, line 92
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |starting up database engine D:7d8677d911f0|backtrace]
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 |starting up database engine D:7d8677d911f0|dbsync] dbsync caught an exception: INTERNAL_ERROR: [ Not_found ]
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] starting up database engine D:7d8677d911f0 failed with exception Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] Raised Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 1/19 xapi @ Alpha Raised at file lib/debug.ml, line 240
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 2/19 xapi @ Alpha Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 3/19 xapi @ Alpha Called from file lib/backtrace.ml, line 114
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 4/19 xapi @ Alpha Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 5/19 xapi @ Alpha Called from file ocaml/xapi/dbsync.ml, line 75
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 6/19 xapi @ Alpha Called from file hashtbl.ml, line 194
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 7/19 xapi @ Alpha Called from file lib/debug.ml, line 92
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 8/19 xapi @ Alpha Called from file ocaml/xapi/dbsync.ml, line 80
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 9/19 xapi @ Alpha Called from file ocaml/xapi/xapi.ml, line 102
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 10/19 xapi @ Alpha Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 11/19 xapi @ Alpha Called from file lib/backtrace.ml, line 114
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 12/19 xapi @ Alpha Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 13/19 xapi @ Alpha Called from file ocaml/xapi/server_helpers.ml, line 80
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 14/19 xapi @ Alpha Called from file string.ml, line 118
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 15/19 xapi @ Alpha Called from file sexp.ml, line 112
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 16/19 xapi @ Alpha Called from file ocaml/xapi/server_helpers.ml, line 99
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 17/19 xapi @ Alpha Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 18/19 xapi @ Alpha Called from file hashtbl.ml, line 194
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace] 19/19 xapi @ Alpha Called from file lib/debug.ml, line 92
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 |server_init D:319630a43329|backtrace]
Mar 30 21:00:24 Alpha xapi: [ warn|Alpha|0 |server_init D:319630a43329|startup] task [starting up database engine] exception: Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 ||backtrace] server_init D:319630a43329 failed with exception Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 ||backtrace] Raised Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 ||backtrace] 1/1 xapi @ Alpha Raised at file (Thread 0 has no backtrace table. Was with_backtraces called?, line 0
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 ||backtrace]
Mar 30 21:00:24 Alpha xapi: [debug|Alpha|0 ||xapi] xapi top-level caught exception: INTERNAL_ERROR: [ Not_found ]
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 ||backtrace] Raised Not_found
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 ||backtrace] 1/1 xapi @ Alpha Raised at file (Thread 0 has no backtrace table. Was with_backtraces called?, line 0
Mar 30 21:00:24 Alpha xapi: [error|Alpha|0 ||backtrace]
now it looks like the problem is in the xapi_pci.ml .... with dependencies:
let update_dependencies pfs =
let rec update = function
| [] -> ()
| (pref, prec, pci, _) :: remaining ->
**let dependencies = List.map**
(fun address ->
**let r, _, _, _ = List.find (fun (_, rc, _, _) -> rc.Db_actions.pCI_pci_id = address) pfs**
in r)
pci.related
in
Db.PCI.set_dependencies ~__context ~self:pref ~value:dependencies;
update remaining
in
update pfs
in
**update_dependencies pfs;**
Is anyone here able to help solve this problem?
@stormi and as for the database, it is a simple plain xml file... it is sitting there.. nothing to really "connect" to. If I remove the state.db it is recreated however not fully, the proces dies before acutally writing pci devices info to the file...
If you have the opportunity to test it with XenServer 7.6, that would allow us to know if the problem is specific to XCP-ng or not, and if it isn't we could open a bug so that XenServer devs may have a look if they are interested in the issue.
Also, I haven't asked, but that mellanox card is not used as the management interface of the host, is it?
Unfortunately I don't have XenServer 7.6 and also the free version does not allow for SR-IOV afaik. I might try with trial though. The problem could be that for Mellanox SR-IOV functionality is not on the HCL
It is, actually, the management runs on the physical function. Its a dual port 40/56 GE, so the bandwidth is more than enough. Is this a problem for XAPI?
Disclaimer: I have never used SR-IOV myself.
I don't know about the state in recent releases of XenServer (and thus XCP-ng), however this article says "The setup procedure requires that the 10 GigE NIC not be used as the management interface for the host. A second physical NIC must be installed on the system for that purpose." https://support.citrix.com/article/CTX126624
About mellanox not being in the HCL, this means that we probably won't be able to raise a bug to bugs.xenserver.org. It could work, or it could not. What's certain is that it is not tested by Citrix teams.
You may want to try the forum to see if anyone ever made it work or has useful tips.
https://support.citrix.com/article/CTX126624 relates to ancient Xen Server versions. Anyway IMHO this issue is actually a bug - XAPI shouln't just die anyways with a non-meaningful error message.
Looking at the code I guess there is a problem in the functions that create dependencies?
I might try to set this up on another machine with similar (but not the same) Mellanox NIC, that actually has separate management interfaces, besides this second machine is Dell, so if this fails we could root out the bug dependency on a particular server vendor / NIC model and possibly dependency on a separate management interface existence?
I will also verify if this kind of setup works with oVirt / KVM, as I was planning to do some peformance comparisons/testing on that too.
For Mellanox it is on the HCL but with only Ethernet function mentioned, not SR-IOV
The xapi death cycle should be fixed with this commit: xapi-project/xen-api@0ec5f94
Problem with the Mellanox ConnectX-3 cards is as follows:
Related PCI devices have prior been defined as being different devices on the same bus, sharing the same address, except for their device number (00:00:X
). Virtual functions on Mellanox ConnetctX-3 cards are enumerated on the same bus as the physical function; thus, all virtual functions are identified as being related devices to the pyhsical device.
When updating the PCI DB of physical functions and setting their related devices, the list of devices is first split into two lists of only physical and only virtual devices here. When now trying to find the related devices in the list of physical PCI devices here, finding them fails, as the formerly determined, related devices are virtual and filtered out of the list that is now matched against; now crashing xapi.
The referenced pull request narrows the definition of related devices, to not be virtual, and thus preventing the behavior mentioned above.
Is it already included in hypervisor? If yes - from what version? If no – when is it expected?
From: bitmeal notifications@github.com Sent: Friday, March 6, 2020 3:46 PM To: xcp-ng/xcp xcp@noreply.github.com Cc: Piotr Oleszkiewicz piotr.oleszkiewicz@bettertrade.pl; Author author@noreply.github.com Subject: Re: [xcp-ng/xcp] XAPI in death cycle when Mellanox SR-IOV is enabled (#161)
The xapi death cycle should be fixed with this commit: xapi-project/xen-api@0ec5f94https://github.com/xapi-project/xen-api/commit/0ec5f94
Problem with the Mellanox ConnectX-3 cards is as follows: Related PCI devices have prior been defined as being different devices on the same bus, sharing the same address, except for their device number (00:00:X). Virtual functions on Mellanox ConnetctX-3 cards are enumerated on the same bus as the physical function; thus, all virtual functions are identified as being related devices to the pyhsical device. When updating the PCI DB of physical functions and setting their related devices, the list of devices is first split into two lists of only physical and only virtual devices herehttps://github.com/xapi-project/xen-api/blob/83d4680940fbcb0e4f3302a9e64531e1467e2266/ocaml/xapi/xapi_pci.ml#L216-L219. When now trying to find the related devices in the list of physical PCI devices herehttps://github.com/xapi-project/xen-api/blob/83d4680940fbcb0e4f3302a9e64531e1467e2266/ocaml/xapi/xapi_pci.ml#L230, finding them fails, as the formerly determined, related devices are virtual and filtered out of the list that is now matched against; now crashing xapi.
The referenced pull request narrows the definition of related devices, to not be virtual, and thus preventing the behavior mentioned above.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcp-ng/xcp/issues/161?email_source=notifications&email_token=AJSLSVEUDG46GXBRQYJPLFTRGEEELA5CNFSM4HBFJKFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOBS7CY#issuecomment-595799947, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSLSVD5FECBTB2UNHCYL23RGEEELANCNFSM4HBFJKFA.
xcp-ng 8.1 should include the relevant code, as it is based on xen-api v1.214.1 xcp-ng 8.0 does not include the updated definition (and crashes), as it es based on xen-api v1.160.1
Creating, enabling and managing an SR-IOV network, using xapi, will still not be possible in 8.1. xcp-networkd has to be patched for this; see xapi-project/xcp-networkd/pull/173. Manually configuring VFs via modprobe or firmware, or supplying a templated mlx4_core modprobe config would be another step to take then - but it can be done. (It's working here with patched versions and a templated config. Via which way to integrate a templated config into the release should be discussed in the next time - I am in the process of filing an issue.)
Is it possible for you to show the steps required/example of how to make it work before xcp-networkd is patched?
From: bitmeal notifications@github.com Sent: Monday, March 9, 2020 1:02 AM To: xcp-ng/xcp xcp@noreply.github.com Cc: Piotr Oleszkiewicz piotr.oleszkiewicz@bettertrade.pl; Author author@noreply.github.com Subject: Re: [xcp-ng/xcp] XAPI in death cycle when Mellanox SR-IOV is enabled (#161)
xcp-ng 8.1 should include the relevant code, as it is based on xen-api v1.214.1https://github.com/xapi-project/xen-api/tree/v1.214.1 xcp-ng 8.0 does not include the updated definition (and crashes), as it es based on xen-api v1.160.1https://github.com/xapi-project/xen-api/tree/v1.160.1
Creating, enabling and managing an SR-IOV network, using xapi, will still not be possible in 8.1. xcp-networkd has to be patched for this; see xapi-project/xcp-networkd/pull/173https://github.com/xapi-project/xcp-networkd/pull/173. Manually configuring VFs via modprobe or firmware, or supplying a templated mlx4_core modprobe config would be another step to take then - but it can be done. (It's working here with patched versions and a templated config. Via which way to integrate a templated config into the release should be discussed in the next time - I am in the process of filing an issue.)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcp-ng/xcp/issues/161?email_source=notifications&email_token=AJSLSVB4C4IYN5LFMACYNF3RGQWW7A5CNFSM4HBFJKFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOFFNMI#issuecomment-596268721, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSLSVE2PTHMFI63GHWRZXTRGQWW7ANCNFSM4HBFJKFA.
As you experienced the xapi death cycle, the virtual functions should be present - so nothing to do here.
As for xcp-networkd, use the code from the linked pull request (repo and branch) and build it, using the xcp-ng-build-environment-docker-container. From the top of my head (no guarantee), I think you have to build and install xapi and xapi-client in the container as well, to satisfy the build dependencies of xcp-networkd. Replace the relevant binaries on your xcp-ng installation with the build artifacts afterwards, should be (again from the top of my head) networkd.exe
as xcp-networkd
and networkd_db.exe
as networkd_db
. If you are on xcp-ng <= 8.0, you would have to replace the xapi binary as well, to get rid of the death cycle.
This is without guarantee and I am no xcp-ng/xapi dev! These were the minimal steps I took to evaluate and try to fix this specific problem, with network adapters that do not fully implement the sysfs interface for configuring virtual functions! It would probably be advisable to build the whole xapi stack from source, including the fixes, for something near production use!
Now that XCP-ng 8.1 has been released, could you test it?
I will however due to current situation I am out of office.
Best, Piotr
From: Samuel VERSCHELDE notifications@github.com Sent: Monday, April 20, 2020 1:27 PM To: xcp-ng/xcp xcp@noreply.github.com Cc: Piotr Oleszkiewicz piotr.oleszkiewicz@bettertrade.pl; Author author@noreply.github.com Subject: Re: [xcp-ng/xcp] XAPI in death cycle when Mellanox SR-IOV is enabled (#161)
Now that XCP-ng 8.1 has been released, could you test it?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcp-ng/xcp/issues/161#issuecomment-616490119, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSLSVGXG352NVA6HBNN6JLRNQWRRANCNFSM4HBFJKFA.
Hi! I'm still interested with tests results using XCP-ng 8.1 or above, since it's supposedly fixed there.
in 8.2 this works properly SR-IOV can be enabled, SR-IOV network can be created and assigned to a VM. However when I create an SR-IOV network on one port on dualport card , and assign it, both ports are made available with SR-IOV (yet this is acceptable outcome). If I create two SR-IOV networks - one on each port, and assign just one of them the number of available VFs is lowered on both NICs If I create a bond on two SR-IOV capable interfaces the bond is market as non SR-IOV capable (yet I can still create SR-IOV networks on both ports individually)
FYI I just purchased this card to make various tests. I'll be able to use it to make also XCP-ng tests with it :+1:
That’s great,
I am not sure how experienced are you with InfiniBand, however to have Infiniband running you will need an active SM on the network, the good thing is there is a software implementation (opensm) which is a part of upstream CentOS (and works good enough), it is also possible to connect IB port – IB port directly without a switch 😊
If you have any questions regarding this I will be happy to help
From: Olivier Lambert notifications@github.com Sent: Monday, November 30, 2020 7:08 PM To: xcp-ng/xcp xcp@noreply.github.com Cc: Piotr Oleszkiewicz piotr.oleszkiewicz@bettertrade.pl; Author author@noreply.github.com Subject: Re: [xcp-ng/xcp] XAPI in death cycle when Mellanox SR-IOV is enabled (#161)
FYI I just purchased this card to make various tests. I'll be able to use it to make also XCP-ng tests with it 👍
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcp-ng/xcp/issues/161#issuecomment-735950190, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSLSVGZ7TLIYLXITGJ35QDSSPNPVANCNFSM4HBFJKFA.
I have 0 infiniBand XP and I planned to test and flash cards in Ethernet only.
See this excellent how to by @Fohdeesha https://forums.servethehome.com/index.php?threads/flashing-stock-mellanox-firmware-to-oem-emc-connectx-3-ib-ethernet-dual-port-qsfp-adapter.20525/#post-198015
It would be great if we could kep xcp-ng at least not crashing tools / generating errors when ib interfaces are present, with sr-iov working the rest can be done on the guest side. Ib would work a bit better for multiple sr-iov enabled guests in redundancy situations as lacp interface bonding is sr-iov unaware and it makes it impossible to bond interfaces connected to more than one switch for more than one guest even if switches support multiswitch lacp... (they dont expect more than one peer on lacp link) in infiniband however a SM can route packets based on LIDs and is SM is SRIOV aware (like recent mellanox SMs) it opens door wide open to redundant high performance datacenter infrastructure. (Like hyperconverged RDMA based SDN + container/vm host on a node all functions bebefiting from sriov / rdma low latency :)
Sent from my Galaxy
-------- Original message -------- From: Olivier Lambert notifications@github.com Date: 11/30/20 19:21 (GMT+01:00) To: xcp-ng/xcp xcp@noreply.github.com Cc: Piotr Oleszkiewicz piotr.oleszkiewicz@bettertrade.pl, Author author@noreply.github.com Subject: Re: [xcp-ng/xcp] XAPI in death cycle when Mellanox SR-IOV is enabled (#161)
I have 0 infiniBand XP and I planned to test and flash cards in Ethernet only.
See this excellent how to by @Fohdeeshahttps://github.com/Fohdeesha https://forums.servethehome.com/index.php?threads/flashing-stock-mellanox-firmware-to-oem-emc-connectx-3-ib-ethernet-dual-port-qsfp-adapter.20525/#post-198015
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcp-ng/xcp/issues/161#issuecomment-735957641, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSLSVH6KXZ36PW4VA3F5PDSSPPA3ANCNFSM4HBFJKFA.
You can imagine we don't want to make anything crash. We'll test in ethernet mode to see if we can reproduce or not, at first. Then, we'll see what can be done next.
In ethernet mode 8.2 works ok :) In infiniband or mixed mode xapi lives, but rescanning interfaces fail
Sent from my Galaxy
-------- Original message -------- From: Olivier Lambert notifications@github.com Date: 11/30/20 20:53 (GMT+01:00) To: xcp-ng/xcp xcp@noreply.github.com Cc: Piotr Oleszkiewicz piotr.oleszkiewicz@bettertrade.pl, Author author@noreply.github.com Subject: Re: [xcp-ng/xcp] XAPI in death cycle when Mellanox SR-IOV is enabled (#161)
You can imagine we don't want to make anything crash. We'll test in ethernet mode to see if we can reproduce or not, at first. Then, we'll see what can be done next.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/xcp-ng/xcp/issues/161#issuecomment-736004852, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJSLSVGMMGO6TIPZNV63CUDSSPZZZANCNFSM4HBFJKFA.
Environment:
XCP-NG 7.6 updated, fresh install HP Proliant DL380p Gen8 Mellanox ConnectX-3 Pro card
When I enable SR-IOV on Mellanox ConnectX-3 Pro card (HP Proliant DL380p Gen8), XAPI constantly reboots the process.
Checked with stock drivers and newest Mellanox drivers. Standard and experimental kernel, checked with xapi-core, xapi-xe from updates_testing repository. No change.
when I disable creating virtual functions in driver - everything works correctly again.
the xensource.log shows: