xapi-project / vhd-tool

Command-line tools for streaming and manipulating vhd format data
Other
8 stars 26 forks source link

Wait for FD availability instead of time #68

Closed nraynaud closed 6 years ago

nraynaud commented 6 years ago

We have customers whose data can't be copied in only 20 tries ( https://bugs.xenserver.org/browse/XSO-873 ) Moreover I see no better response to this error code than trying again, I would suggest removing the limit altogether (if it doesn't work eventually the OS will come with a new error like a closed socket or timeout of some sort). You might want to slightly randomize the sleep duration and maybe increase it over time.

edwintorok commented 6 years ago

Thanks for your contribution. There has been a discussion around EAGAIN at the time the original code change was merged, see https://github.com/xapi-project/vhd-tool/pull/54/files/1ec2644866294bae907e53094796b678e8c6306d#diff-050ec8f31ffe4696b6d57c5f510862d5 According to the manpage sendfile would return EAGAIN if "Nonblocking I/O has been selected using O_NONBLOCK and the write would block.", but in our code we do set the file descriptor to blocking mode in https://github.com/xapi-project/vhd-tool/blob/master/src/channels.ml#L49-L63, so this seems odd, should probably confirm with strace what the actual mode of the file descriptor is.

There are two other approaches that could be investigated instead of just bumping the number of retries:

olivierlambert commented 6 years ago

I completely agree that removing the max attempts would be far more elegant. However, we had to integrate this change ASAP because it broke every VDI export for XS and XCP-ng 7.5 users. Which is rather critical.

Note that we observed no export performance impact of this "quick and dirty" patch.

What would be a good compromise for you to get something that doesn't break exports for everyone and that could be merged relatively fast? (for us, raising the number of retry proved to be enough at short term solution)

edwintorok commented 6 years ago

You could try my second suggestion and in addition to the commit increasing max_attempts also replace Lwt_unix.sleep 0.1 here:

         | Unix.(Unix_error (EAGAIN, _, _)) when remaining_attempts > 0 ->
              Lwt_unix.sleep 0.1 >>= fun () ->

with something like:

Lwt_unix.wait_read from_fd >>= fun () ->
Lwt_unix.wait_write to_fd >>= fun () ->

If you got an environment where you can reproduce the original problem easily can you check if a change like this fixes the export problems? And then we can look at dropping max_attempts in a separate PR.

olivierlambert commented 6 years ago

Yes, we can easily check that. @nraynaud is on US timezone, but as soon he's back, we'll do some tests :+1:

olivierlambert commented 6 years ago

Just to be sure @edwintorok : should we remove or keep the max_attempts with our high value if we replace sleep by the extra 2 lines you gave have us?

edwintorok commented 6 years ago

I think it should work either way, once you use wait_read/wait_write then sendfile should not return EAGAIN more than once, but I haven't tested this. It would be useful to test just the sleep -> wait_read/wait_write replacement in isolation, and if it doesn't work then test it in combination with the high max_attempts value.

olivierlambert commented 6 years ago

Copy that, we'll do!

nraynaud commented 6 years ago

I have tried you suggestion @edwintorok , it seems to work in our tests, thanks. I had to fiddle a bit with the types to make it work.

nraynaud commented 6 years ago

We'll have this PR undergo another round of tests tomorrow Thursday European time.

olivierlambert commented 6 years ago
  1. I confirm that tests with the current PR content are working
  2. We have still tests to do with a version entirely removing the loop, if it works, that would be better than counting max_attempts :+1:
edwintorok commented 6 years ago

Could you please add a Signed-off-by line to your commits? See https://elinux.org/Developer_Certificate_Of_Origin

jonludlam commented 6 years ago

And rename it before merging :-)

olivierlambert commented 6 years ago

Sure, but first we'll try to give finish testing without the max_attempts loop, then if it works, we'll update the PR. Otherwise, indeed, we'll sign off and rename anyway!

nraynaud commented 6 years ago

This form of the function works in our tests, but 1) it defeats the whole point of non-blocking, 2) if for some reason sendfile wanted to block anyways there would be no backup.

let _sendfile from_fd to_fd len =
  let unix_from_fd = Lwt_unix.unix_file_descr from_fd in
  let unix_to_fd = Lwt_unix.unix_file_descr to_fd in
  Lwt_unix.wait_read from_fd >>= fun () ->
    Lwt_unix.wait_write to_fd >>= fun () ->
      detach (_sendfile unix_from_fd unix_to_fd) len
nraynaud commented 6 years ago

Thanks a lot.