Closed nraynaud closed 6 years ago
Thanks for your contribution. There has been a discussion around EAGAIN at the time the original code change was merged, see https://github.com/xapi-project/vhd-tool/pull/54/files/1ec2644866294bae907e53094796b678e8c6306d#diff-050ec8f31ffe4696b6d57c5f510862d5
According to the manpage sendfile
would return EAGAIN
if "Nonblocking I/O has been selected using O_NONBLOCK and the write would block.", but in our code we do set the file descriptor to blocking mode in https://github.com/xapi-project/vhd-tool/blob/master/src/channels.ml#L49-L63, so this seems odd, should probably confirm with strace what the actual mode of the file descriptor is.
There are two other approaches that could be investigated instead of just bumping the number of retries:
wrap_syscall
Lwt_unix.wait_read
and Lwt_unix.wait_write
whenever we receive an EAGAIN to wait until the file descriptor is available. This would remove the need for a max attempts and Lwt_unix.sleep, it would also put less stress on the systemI completely agree that removing the max attempts would be far more elegant. However, we had to integrate this change ASAP because it broke every VDI export for XS and XCP-ng 7.5 users. Which is rather critical.
Note that we observed no export performance impact of this "quick and dirty" patch.
What would be a good compromise for you to get something that doesn't break exports for everyone and that could be merged relatively fast? (for us, raising the number of retry proved to be enough at short term solution)
You could try my second suggestion and in addition to the commit increasing max_attempts also replace Lwt_unix.sleep 0.1 here:
| Unix.(Unix_error (EAGAIN, _, _)) when remaining_attempts > 0 ->
Lwt_unix.sleep 0.1 >>= fun () ->
with something like:
Lwt_unix.wait_read from_fd >>= fun () ->
Lwt_unix.wait_write to_fd >>= fun () ->
If you got an environment where you can reproduce the original problem easily can you check if a change like this fixes the export problems? And then we can look at dropping max_attempts in a separate PR.
Yes, we can easily check that. @nraynaud is on US timezone, but as soon he's back, we'll do some tests :+1:
Just to be sure @edwintorok : should we remove or keep the max_attempts
with our high value if we replace sleep
by the extra 2 lines you gave have us?
I think it should work either way, once you use wait_read/wait_write then sendfile should not return EAGAIN more than once, but I haven't tested this. It would be useful to test just the sleep -> wait_read/wait_write replacement in isolation, and if it doesn't work then test it in combination with the high max_attempts value.
Copy that, we'll do!
I have tried you suggestion @edwintorok , it seems to work in our tests, thanks. I had to fiddle a bit with the types to make it work.
We'll have this PR undergo another round of tests tomorrow Thursday European time.
Could you please add a Signed-off-by line to your commits? See https://elinux.org/Developer_Certificate_Of_Origin
And rename it before merging :-)
Sure, but first we'll try to give finish testing without the max_attempts
loop, then if it works, we'll update the PR. Otherwise, indeed, we'll sign off and rename anyway!
This form of the function works in our tests, but 1) it defeats the whole point of non-blocking, 2) if for some reason sendfile wanted to block anyways there would be no backup.
let _sendfile from_fd to_fd len =
let unix_from_fd = Lwt_unix.unix_file_descr from_fd in
let unix_to_fd = Lwt_unix.unix_file_descr to_fd in
Lwt_unix.wait_read from_fd >>= fun () ->
Lwt_unix.wait_write to_fd >>= fun () ->
detach (_sendfile unix_from_fd unix_to_fd) len
Thanks a lot.
We have customers whose data can't be copied in only 20 tries ( https://bugs.xenserver.org/browse/XSO-873 ) Moreover I see no better response to this error code than trying again, I would suggest removing the limit altogether (if it doesn't work eventually the OS will come with a new error like a closed socket or timeout of some sort). You might want to slightly randomize the sleep duration and maybe increase it over time.