xtracthub / xtract-service

Globus Labs Xtract: Extract metadata from distributed data sets.
6 stars 1 forks source link

Parallelize parser execution on single group entity #14

Open tskluzac opened 4 years ago

tskluzac commented 4 years ago

If we have one 'family' of files (all files that must be transferred together such that each file in the group is only filed once during processing), then we want to run all parsers on those files in parallel. Right now the files are transferred in parallel and the parsers are applied serially. The parsers are lightweight, so this isn't a huge overhead -- but could easily add hours to total processing time when we have tens of millions of groups (e.g., MDF).

tskluzac commented 3 years ago

Note -- this should only be used should Xtract have a local mode. Currently via funcX, it makes no sense to do this as each core/worker is already processing a file.