xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

[Enhancement] Sitemaps should be supported in a enhanced way #319

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
As per now, sitemaps aren't fully supported and several tweaks on the code 
needs to be done in order to make them work in a very primitive way.

What I'd recomend is to create sepparate methods to parse sitemaps as they're 
becoming a common pattern in most of the websites.

The sitemap procotol is pretty basic and would make the crawling process way 
faster.

In a basic test, my own crawler (It's not a website copier, it just wants some 
specific pages to be crawled/parsed) the crawler got 10 times faster.

I grabbed over 1kk urls from several websites in less than 4 hours of execution 
in a 10 threaded single crawler approach.

Original issue reported on code.google.com by panthro....@gmail.com on 16 Nov 2014 at 5:29

GoogleCodeExporter commented 9 years ago
We should consider using:
https://code.google.com/p/crawler-commons/

For sitemap parsing.

Please note that we should consider also applying the "strict" flag using a 
flag in our config

Original comment by avrah...@gmail.com on 16 Nov 2014 at 6:01