xtracthub / xtract-service

Globus Labs Xtract: Extract metadata from distributed data sets.
6 stars 1 forks source link

Crawl Limits #22

Open blaiszik opened 4 years ago

blaiszik commented 4 years ago

It might be helpful to add arguments to allow for a user to specify a max_crawl_depth (folder depth max) or max_crawl_total (max total number of files). This is not something we need currently, but just a potentially useful addition.

tskluzac commented 4 years ago

Currently thinking this:

Have optional max_crawl_depth and max_crawl_total args at the crawler. The state of the crawl (all local queues and in-flight tasks) will be pickled and stored in S3 as a checkpoint of sorts. Then once the service stops, the user can access a 'crawlNext' token that will pick up where the previous 'max' was met. The 'crawlNext' token will be deleted in 24 hours to save space as these queues can get pretty hefty, and the state of a repo could change pretty drastically.