[docs] recrawl and excludes

wsdookadr commented 1 year ago

I have a request regarding the documentation.

There are three topics that are underdocumented. It would be useful for people(like me) if docs were available for these:

recrawls, how to do them (was asked before here)
excludes: how do they actually work in combination with includes, how can one check the logs to see if something was actually excluded, usage examples
in the Crawler Statistics, what is the meaning of the count for "failed" ? more specifically, are pages that exceed pageLoadTimeout still stored in the WARC in a partial form or are they discarded altogether? do we define a "failed" page to be one that was still loading external resources when the pageLoadTimeout expired?

pato-pan commented 1 year ago

I don't think the logs currently indicate if something was included or excluded because of a rule. When I look at the logs with --logging debug --logLevel --context, the websites that were excluded don't show up at all.

To know if something was excluded (or included), I open the logs and try to find the website in the logs. Sometimes a website can get caught even if it doesn't show up on the logs, so I use replayweb.page to look there too

tw4l commented 1 year ago

recrawls, how to do them (was asked before here)

Currently there's no way to partially re-crawl with browsertrix-crawler. In our Browsertrix Cloud system you can use the archiveweb.page Chrome extension to manually capture content that wasn't crawled and then combine it with the crawl in a Collection, which replays together and can be downloaded as a single (nested) WACZ file.

excludes: how do they actually work in combination with includes, how can one check the logs to see if something was actually excluded, usage examples

Currently exclusions are not logged. We could possibly log these as debug messages so that they're optionally available but that's not yet implemented.

in the Crawler Statistics, what is the meaning of the count for "failed" ? more specifically, are pages that exceed pageLoadTimeout still stored in the WARC in a partial form or are they discarded altogether? do we define a "failed" page to be one that was still loading external resources when the pageLoadTimeout expired?

Failed pages are pages that return a 4xx or 5xx status code or if there is a page load timeout. If anything is captured, it will be included in the WACZ, and each page should also show up in the pages.jsonl file within the WACZ with a load state indicator showing the last successful step for the page (e.g. content loaded, full page loaded, behaviors run)

webrecorder / browsertrix-crawler

[docs] recrawl and excludes #394