website-scraper / node-website-scraper

Download website to local directory (including all css, images, js, etc.)
https://www.npmjs.org/package/website-scraper
MIT License
1.55k stars 275 forks source link

Crawling rate limit or requeue? #485

Closed abale closed 2 years ago

abale commented 2 years ago

Is there a simple way to re-queue a page for crawling? Many sites employ request rate limiting (429 http status code) and typically it's a question of putting that back in the queue for retry.

An alternative would be a function to rate limit the crawler beyond max concurrency - perhaps a global maximum requests/s (with the ability to provide less-than-1 for slower crawling).

Setting maxConcurrency to 1 still crawls too quickly.

s0ph1e commented 2 years ago

Hi @abale 👋

Sorry for late response.

To achieve retries I suggest to check request option. It uses got module inside website-scraper to make http requests and I suppose it's possible to configure got to do retries when request fails.

Also you can try to add delays between requests - please check an example of beforeRequest action usage

Hope it helps

no-response[bot] commented 2 years ago

This issue has been automatically closed because there has been no response from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.