Ignore pages with noindex

davwheat commented 4 years ago

Thank you so much for this package! I really love using it and it saves me a lot of pain.

I wanted to ask if you could add an option to ignore pages which have been set to have indexing turned off...

e.g. pages with...

<meta name="robots" content="noindex" />

zerodevx commented 4 years ago

Hi, thanks for your note - I'm glad this helps!

Regarding your use-case, unfortunately there's no in-built mechanism at the moment that does that... =/ So for now you may need to use the -m flag to specifically ignore those pages.

We may consider adding this feature in a future release, but implementation comes with its own set of challenges:

Th quick way may be to string-detect, but it's definitely not robust. Because

<meta name="robots" content="noindex" />
<meta content="noindex" name="robots">
<meta
    name="robots
    content="noindex">

are all valid HTML.

The other way will be to parse the HTML (using JSDOM or such), but it's a non-trivial task and a resource-intensive operation that will significantly impact speed.

Alternatively, you can continue to include noindex pages into your sitemap - search engines still respect the noindex meta with the highest priority - though this generates a bunch of errors in Search Console. =/

davwheat commented 4 years ago

It could be possible to only parse the <head> tag.

node-html-parser claims it can parse an HTML file in under 2ms, which wouldn't be too much of a speed hit, bearing in mind most people would likely only use this tool before deploying their changes to a webserver. I use this directly after prettier which ends up spending up to 750ms per HTML file.

I'll make some changes and see how an implementation of this could affect runtime.

davwheat commented 4 years ago

@zerodevx So I've made a version of the tool which follows noindex meta tags using htmlparser2.

It's slower than the normal version by roughly 4x...

Benchmarking with 529 HTML files (totalling 50 MB), I found that by following the noindex tags, it took about 1400-1500ms. By ignoring them, it took about 350-380ms.

At the moment I've implemented it as an argument which needs to be manually enabled. I'll PR and see what you think.

zerodevx commented 4 years ago

That's great work! Looking through it right now.

zerodevx commented 4 years ago

Looks really good to me. I'll merge #10 and release a new minor.

Thanks for your contribution! 🎉

zerodevx / static-sitemap-cli

Ignore pages with noindex #9