Open blaiszik opened 4 years ago
Currently thinking this:
Have optional max_crawl_depth and max_crawl_total args at the crawler. The state of the crawl (all local queues and in-flight tasks) will be pickled and stored in S3 as a checkpoint of sorts. Then once the service stops, the user can access a 'crawlNext' token that will pick up where the previous 'max' was met. The 'crawlNext' token will be deleted in 24 hours to save space as these queues can get pretty hefty, and the state of a repo could change pretty drastically.
It might be helpful to add arguments to allow for a user to specify a
max_crawl_depth
(folder depth max) ormax_crawl_total
(max total number of files). This is not something we need currently, but just a potentially useful addition.