xrootd / xrootd-python

Python bindings for XRootD are now part of the main repository.
https://github.com/xrootd/xrootd
6 stars 8 forks source link

Asynchronous callbacks ridiculously hard to use #13

Open bbockelm opened 8 years ago

bbockelm commented 8 years ago

It appears that it's incredibly difficult to use the asynchronous callback mechanism due to lock ordering issues:

Hence, it's quite straightforward to deadlock.

In order to actually use the asynchronous callbacks, you must be able to guarantee there is no call to the Xrootd module from any python threads as long as there are pending callbacks. It's a pretty high barrier.

Would it be possible to invoke callbacks from a separate thread pool which takes the python GIL first?

bbockelm commented 8 years ago

@ljanyst @jlsalmon - what are your thoughts on a PR which removes the asynchronous capabilities from the python client?

I haven't found any way to actually use these safely. Take the async example shipped with the code:

def callback( status, response, hostlist ):
  print "Called:", status, response, hostlist

with client.File() as f:
  status, response = f.open('root://localhost//tmp/eggs', OpenFlags.DELETE)
  status = f.fcntl( 'asdf', callback = callback )
  sleep(20)

This will deadlock if fcntl takes exactly 20 seconds:

  1. The main python thread will hold the GIL and close the XrdFile object, which takes an XrdCl mutex.
  2. The Xrootd asynchronous callback pool will take a XrdCl mutex and grab the GIL.

Hence, even this trivial example can lead to a deadlock! (Of course, as I mentioned, any usage of the asynchronous callbacks can lead to a deadlock, so this is maybe not so surprising)

bbockelm commented 8 years ago

Note the AsyncResponseHandler suffers from the same issue when used like this:

  f = client.File()
  handler = AsyncResponseHandler()
  status = f.open(smallfile, OpenFlags.READ, callback=handler)
  status, response, hostlist = handler.wait()

This is because the callback is defined as:

  def __call__(self, status, response, hostlist):
    self.status = status
    self.response = response
    self.hostlist = hostlist
    self.mutex.release()

The file above may be closed in the main process any time after mutex.release() is called; however, the callback will still need to hold the GIL to exit the function. It would be quite hard to trigger in this case, however.

ljanyst commented 8 years ago

It's up to @simonmichal these days, although he may not be subscribed here because this package has been merged with the main one some time ago.

I'll try to give it some thought in the background as well.

bbockelm commented 8 years ago

Ah, ok, I didn't realize it was merged in.

Thinking more on the subject, I don't see any reason why the GIL needs to be held when making the calls into the XrdCl library. I think if you can wrap all the calls into the library begin / end allow threads. I.e.,

      Py_BEGIN_ALLOW_THREADS
      status = self->file->Open( url, flags, mode, timeout );
      Py_END_ALLOW_THREADS

Since only one mutex is held at a time, we don't get a lock inversion.

ljanyst commented 8 years ago

Sounds right. There is no need to hold GIL during XrdCl calls. But your other post sounds suspicious. There should be no XrdCl locks held during callbacks.

bbockelm commented 8 years ago

@ljanyst - Yeah, we've hit this a few times in CMS, so this part was unremarkable. Here's the issues I know of:

ljanyst commented 8 years ago

The handling of error callbacks is a known issue. I am sure Michał will fix all these :)