Closed davwheat closed 4 years ago
Hi, thanks for your note - I'm glad this helps!
Regarding your use-case, unfortunately there's no in-built mechanism at the moment that does that... =/ So for now you may need to use the -m
flag to specifically ignore those pages.
We may consider adding this feature in a future release, but implementation comes with its own set of challenges:
Th quick way may be to string-detect, but it's definitely not robust. Because
<meta name="robots" content="noindex" />
<meta content="noindex" name="robots">
<meta
name="robots
content="noindex">
are all valid HTML.
The other way will be to parse the HTML (using JSDOM or such), but it's a non-trivial task and a resource-intensive operation that will significantly impact speed.
Alternatively, you can continue to include noindex
pages into your sitemap - search engines still respect the noindex
meta with the highest priority - though this generates a bunch of errors in Search Console. =/
It could be possible to only parse the <head>
tag.
node-html-parser claims it can parse an HTML file in under 2ms, which wouldn't be too much of a speed hit, bearing in mind most people would likely only use this tool before deploying their changes to a webserver. I use this directly after prettier which ends up spending up to 750ms per HTML file.
I'll make some changes and see how an implementation of this could affect runtime.
@zerodevx So I've made a version of the tool which follows noindex
meta tags using htmlparser2.
It's slower than the normal version by roughly 4x...
Benchmarking with 529 HTML files (totalling 50 MB), I found that by following the noindex tags, it took about 1400-1500ms. By ignoring them, it took about 350-380ms.
At the moment I've implemented it as an argument which needs to be manually enabled. I'll PR and see what you think.
That's great work! Looking through it right now.
Looks really good to me. I'll merge #10 and release a new minor.
Thanks for your contribution! 🎉
Thank you so much for this package! I really love using it and it saves me a lot of pain.
I wanted to ask if you could add an option to ignore pages which have been set to have indexing turned off...
e.g. pages with...