wspr-ncsu / visiblev8-crawler

Framework which makes large scale crawling of URLs with VisibleV8 easy.
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

CPU Overload while crawling websites #4

Open aswad546 opened 1 month ago

aswad546 commented 1 month ago

Hello again,

I am trying to crawl multiple websites using this crawler it seems a lot of websites are facing navigation timeouts and the cpu seems to be overloaded. I am using 32 cores for now and 256GB of RAM for Ubuntu 22.04. I have set the number of parallel instances to be 32 instead of the suggested 128. And my navigation timeout is quite high. If I run these websites individually they are successfully crawled but in this case it seems they are failing. I did examine the network load and it seems that is not the issue. Is this usual, any suggestions? It will be hard to scale this way. Any recommendations?

Thanks, Aswad

sohomdatta1 commented 1 month ago

I personally use (cpu_cores * 3) on most large scale crawls (> 200/300 websites), however, you don't need to use the suggested values, feel free to run smaller 1k crawls and figure out where your sweet-spot is on your hardware.

I will say tho, if you have high CPU usage, that could be indicative of deeper problems with your infra. Make sure to check that you have the open-file limits set to really high numbers and that you don't have significant IO delay when writing files (this is especially important since a lot of the crawler's speed depends on fast io). Generally speaking if you have fast writes, the bottleneck you should hit first is the RAM (cause of Chromium's significant memory usage) and not the CPU usage.