yujiosaka / headless-chrome-crawler

Distributed crawler powered by Headless Chrome
MIT License
5.51k stars 405 forks source link

[Feature Request] Add support for multiple sitemaps #148

Open NickStees opened 6 years ago

NickStees commented 6 years ago

What is the current behavior? I don't believe the crawler is handling sitemaps broken out into multiple sitemaps. This is common in large sites since sitemaps are limited to 50k urls. See Simplify multiple sitemap management

Good example is NASA https://www.nasa.gov/sitemap.xml

What is the expected behavior? Successfully crawl large sites via sitemap(s)

What is the motivation / use case for changing the behavior? Large enterprise sites not being crawled via sitemap

NickStees commented 6 years ago

Also I am having trouble getting the crawler to pick up my sites sitemap.xml and after digging around the code I realized that this crawler requires the sitemap be declared in the robots.txt https://www.sitemaps.org/protocol.html#submit_robots most sites I don't think do this, so it probably has limited success in actually finding sitemaps. Using the NASA example again https://www.nasa.gov/robots.txt they don't declare the sitemap.xml in their robots.txt and neither was I.

NickStees commented 6 years ago

Ok... after a lot of time debugging. It appears that the robots-parser will in fact handle multiple sitemaps just fine. My problem is just the fact that the sitemap: directive was not in my robots.txt.

So I guess the Feature Request is now maybe somehow accounting for robots.txt that don't list the sitemaps, and just manually checking for /sitemap.xml on the server? Since this is a standardized location?

yujiosaka commented 6 years ago

@NickStees

Sorry to keep you waiting for long. Thanks for a good issue and even thorough investigation on it.

After learning the protocols for sitemaps, now I can safely say that there is no standard for the file location. No official documents including sitemap.org define it, so this stackoverflow question helped me the most.

People conventionally places sitemap.xml at the root folder, but it's not a standard. There are two ways how search engines find sitemaps:

  1. Locations written in robots.txt
  2. Locations submitted to each search engines' webmaster forms

There is no way we know the sitemaps locations submitted to search engines, so the only way we have is the current way: find locations written in robots.txt.

It's true that most users places the sitemap.xml in the root folder, but there is no guarantee that the information is right. That's why Scrapy for example only trust robots.txt written in robots.txt.

I'd like to keep this issue opened until this feature is supported.

BubuAnabelas commented 6 years ago

Maybe it could be useful to force checking sitemap.xml at the root folder when followSitemapXml options is true. If the file doesn't exist the crawler discards the result, but if it exists, it parses the file and continue crawling the link that where found.

yujiosaka commented 6 years ago

@BubuAnabelas Yes it would be useful, but I believe it should not be a default behavior. Since no one states that it's the right sitemaps, it may be wrong, outdated or simply a copy from somewhere else. There is no such rules as that sitemaps should be named sitemap.xml anyway.

NickStees commented 6 years ago

@yujiosaka Thanks for digging into this so much, I always assumed it was a standard to name it sitemap.xml but yes it looks like it's not required to be so. I think most CMS's by default go with this convention so I assume its probably popular.

Maybe in the crawler configuration, we can manually specify an array of sitemaps??

One other thing I encountered was that in the robots.txt the sitemap URL has to be fully qualified. So I could not run the crawler on a dev/test server since the dev robots.txt (a non-dynamic file) always pointed to the production sitemap.xml URL. Not a biggie, just thought it may be interesting to know that.

Thanks for working on such a handy tool!