Support multipart uploads to S3

Xophmeister commented 4 years ago

The S3 transfer routine is a wrapper to boto3's S3.Client.put_object, which is a straight serial transfer. The benefit of this, as opposed to using, say, upload_file, is that we can specify the MD5 checksum from the outset and S3 will verify the transfer. The downside is that it's much slower.

With a multipart upload, we would have to calculate the checksums ourselves as we streamed the data off iRODS (i.e., rather than using the checksum provided by iRODS). This would have to be done upfront, by virtue of the interface S3 expects; meaning reading through the file on iRODS twice.

It's not clear from the code of the Python iRODS client how much concurrency it supports. This may also be a limiting factor.

Xophmeister commented 4 years ago

Because the S3 part upload needs the MD5 sum upfront, it is true that you need to read the chunk twice. However, it’s not true that you have to read the chunk twice from iRODS. The default chunk size is only 8MiB: that will easily fit in memory, or could be flushed to local disk once the residency exceeds some threshold.

Above investigations have shown that the iRODS client doesn’t currently natively support multithreading. However, it wouldn’t be difficult to implement this ourselves. (The only concern is the potential to overload iRODS. Anecdotally, each client shows up in ips and, with threading on top of that, the sysadmins may have some choice words...)

Regardless, what I propose are separate, tuneable reader and writer pools, with some buffering queue sitting between them. The reader would only ever read a chunk from iRODS once, straight into the buffer queue, which the writer to S3 would consume from as required.

wtsi-hgi / irods-to-s3

Support multipart uploads to S3 #2