xen-troops / zephyr-xenlib

Set of Zephyr libs and applications that provides basic Dom0 services
Apache License 2.0
0 stars 10 forks source link

xen-dom-mgmt: Fix potential concurrency issues #72

Closed Deedone closed 1 month ago

Deedone commented 5 months ago

Add refcounting to the domain structure to make it thread-safe. Fix possible deadlock in console deinit.

GrygiriiS commented 4 months ago

Huh. lock on lock on lock + ref count. It looks very over-complicated and over-designed. In my strong opinion, the locking is the problem of Zephyr app and not a library. More over, domain management/control process should be simple and straight forward in Zephyr app which definitely should not require to have ref_count at minimum. And I think It all should just work with one global lock.

lorc commented 4 months ago

domain_destroy routine gets a reference and waits until it's reference is the only one before destroying the domain.

A am not sure I got this logic. domain_destroy should put down one reference. Real domain destruction should be performed in put_domain when reference counter reaches zero. Take a look at linux kref as example.

Deedone commented 4 months ago

Moved freeing to put_domain and removed the per-domain lock.

GrygiriiS commented 4 months ago

@Deedone Could you send first patch alone as it's definitely fix "xen_console: Fix deadlock in xen_stop_domain_console()".

Regarding protection - My point stands, per-doms locks and ref_count is over complications and just domctrl_global_lock and global_console_lock should be enough. Domain operation are heavy weight and time consuming and definitely it's not required to perform any domctrl operations even a dozen times per sec. More over, any domctrl ops (even more any call into Xen) are not Real-time compatible as can't be estimated). Any "performance" optimizations without strict use-case or open issue is, unfortunately, usually just waste of time.

lorc commented 4 months ago

Moved freeing to put_domain and removed the per-domain lock.

What I meant, is that whole domain_destroy should be triggered by put_domain(). Because current implementation has no sense: someone may be holding domain reference and trying to do something with it, while you are stopping console, destroying xenstore entries and so on. There is no sense in having reference for a domain, when it basically is destroyed already. At least I see no use case for this. Maybe I am wrong.

I believe @GrygiriiS is right and we have no use for reference counting for now. Maybe global domain_lock will be sufficient. On other hand, we had global pci_lock in Xen and it made lots and lots headache when we tried implement more complex logic in PCI subsystem... So I am a little hesitant about introducing huge global locks.

Anyways, @firscity is maintaining this project and he knows it better, so last word is his.

In the meantime, you can have my:

Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> for the first two patches. Could you please create a separate PR with at least first one, so we can merge it faster?

firscity commented 4 months ago

There is no sense in having reference for a domain, when it basically is destroyed already.

Yeah, this part should be changed as you suggested before.

So I am a little hesitant about introducing huge global locks.

We already have at least 3 ways to create and manage domains - directly from application via zephyr-xenlib, via xrun json configurations and via Zephyr shell command. I think that global locking (altogether with all other locks in console, xenstore etc.) will definitely bring more problems and will be less flexible than refcounting. I may be wrong, but I think suggestion with decrementing refcount inside destroy_domain() and moving its actual logic to put_domain() should work fine.

Deedone commented 1 month ago

Added assert, fixed checkpatch