webrecorder / browsertrix-crawler

Run a high-fidelity browser-based crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
579 stars 72 forks source link

Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617

Open ssairanen opened 1 week ago

ssairanen commented 1 week ago

CombineWARC seems to create warcs from all the warcs in the folder after one run, but there is no way to create limited size warcs out of one run?

For example: if one has one crawl running daily, and the size of all the warcs is 20TB, if one puts on CombineWARC: true, then suddenly browsertrix creates another 20TB of warc next to the original crawl folder. The warcs are read here: https://github.com/webrecorder/browsertrix-crawler/blob/6329b19a20c4995b6a8835cf2e8dfe37146ddb80/src/crawler.ts#L2329

Could there be a way to combine all of the worker outputs to certainly sized warcs, but only for one run? rolloversize:100000000 does not work as warcs (the worker outputs) might be anything from 1Mb to 100Mb, I want warcs to be 100Mb always until the last one obviously isn't.

ikreymer commented 1 week ago

Not quite sure what you mean - using --rolloverSize + --combineWARC together should work as you describe. The combineWARC operation combines all the WARCs in a collection folder after each crawl, upto the rollover size. The rollover size is also applied to individual WARCs as well.

steph-nb commented 1 week ago

same question for the use of several crawlers in browsertrix: How could an overall max size be configured, which generates x slices of max size and only one smaller WARC? (to my understanding --rolloverSize + --combineWARC are used for crawlers individually)

ikreymer commented 6 days ago

same question for the use of several crawlers in browsertrix: How could an overall max size be configured, which generates x slices of max size and only one smaller WARC? (to my understanding --rolloverSize + --combineWARC are used for crawlers individually)

I believe that this is how it should work if you use both of those flags. The --rolloverSize applies to the individual WARCs, the --combineWARC then combines them so they are all upto the rollover size, and only one WARC smaller. Is this not working correctly?

ssairanen commented 6 days ago

What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/ -folder with original warcs, and one level below with combineWARC -creations, which means 2x the space.

If for example one turns on the combineWARC option on a daily crawl, which has been creating warcs for a while, the combineWARC option will take all of the past warcs into consideration when doing the combining (it's fun when you have 10Tb of warcs in archive/ -folder ...). There is no option to get neatly sized warcs from one run only, in the same folder next to the output of another run of the same crawl.

ikreymer commented 5 days ago

What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/ -folder with original warcs, and one level below with combineWARC -creations, which means 2x the space.

If for example one turns on the combineWARC option on a daily crawl, which has been creating warcs for a while, the combineWARC option will take all of the past warcs into consideration when doing the combining (it's fun when you have 10Tb of warcs in archive/ -folder ...). There is no option to get neatly sized warcs from one run only, in the same folder next to the output of another run of the same crawl.

There is no concept of distinct crawl 'runs' in Browsertrix Crawler - it is assumed that repeated crawls may be part of the same crawl, eg. if a crawl is interrupted/restarted. If you want to separate crawls by day, my suggestion would be to use --collection my-crawl-YYYY-MM-DD and crawl into a new directory for each day, and use --combineWARC and --rolloverSize with these crawls. Or, to put it another ways, all WARCs in ./collections/<name>/archive are assumed to be part of the same crawl - having different directories allows you to isolated and group WARCs from that crawl only.