xapi-project / xen-api

The Xapi Project's XenAPI Server
http://xenproject.org/developers/teams/xapi.html
Other
346 stars 283 forks source link

CP-51574: Add explicit reentrant lock to Db_lock #6011

Closed contificate closed 1 week ago

contificate commented 1 week ago

Update Db_lock to use an explicit reentrant (recursive) mutex pattern.

This is a pretty standard implementation. The way it works is that threads compete to become the "holder" of the lock. The current holder of the lock is identified using thread identifiers (integers) and read/written from/to atomically (by way of Atomic.t). If attempting to acquire the lock (i.e. atomically compare_and_set the holder, with the expected current value as physically equal to None) fails, then the thread begins to wait. Waiting threads wait using a condition variable to avoid busy waiting. The lock can be acquired several times after it is acquired by the calling thread, but releasing it must ensure that every lock is matched by a corresponding unlock (the thread must relinquish every "hold" it has on the lock). Upon successfully releasing its holds on the lock, the thread that is relinquishing control resets the holder (to None) and uses the condition variable to signal a waiting thread to wakeup (and attempt to become the holder).

To make this pattern safe and not expose too many details, an interface file (db_lock.mli) is introduced. This interface does not expose the details of the underlying lock, but rather the single function:

val with_lock : (unit -> 'a) -> 'a

which ensures the invocation of its argument is properly sandwiched between code that correctly acquires and releases the lock (taking care to avoid holding onto the lock during exceptional circumstances, e.g. by way of an exception that is unhandled). Code written to use the database lock currently only use this interface to acquire the Db_lock, so no other changes are required.

A follow-up change may be to enforce the same usage pattern for the global database flush lock, which is seemingly only used in the same way.


Ring3: BST+BVT (205351) all looks fine (so far) except an SDK-related test, which appears to be a known issue. I have ran the same suites on the same code a few times, with everything green except known issues.

This is really a code cleanup effort, as opposed to new functionality. The current code is an ad-hoc implementation of this idea.

The hope is that the style moving forward would be to use atomics and locks, so that we can begin to tackle things that would not be domain-safe in OCaml 5. For example, the current implementation queries allow_thread_through_dbcache_mutex (a ref) non-atomically.

Some code could probably be moved around but I'm unwilling to do any serious functional changes as it would be rather easy to introduce careless errors. This is the kind of simple implementation that you will find in textbooks.

last-genius commented 1 week ago

Any particular reason why this is a draft?

contificate commented 1 week ago

Any particular reason why this is a draft?

Made the PR yesterday before the suite had finished executing.

Looking at the results, there's a few spurious failures:

One of the machines is now marked as broken and the rest have extant tickets.


I'm fairly confident that the code does the right thing, as it's a simple implementation. I am waiting to see if a unit test should be written that properly exercises this implementation. I suspect there's a lot of code in xapi that only does what's intended because the threads are unlikely to ever interleave in a way to expose races.

contificate commented 1 week ago

Replaced Unix.gettimeofday with Mtime_counter.counter counting - the elapsed span is still reified as a float (as the DB statistics interface expects that).