whatwg / meta

Discussions and issues without a logical home
Creative Commons Zero v1.0 Universal
93 stars 161 forks source link

add rel="canonical" to snapshots so search engines know what they *should* index. #33

Open phonedude opened 6 years ago

phonedude commented 6 years ago

forbidding indexing by robots.txt is 1/2 a solution. using rel="canonical" to provide a suggestion to SEs is the other 1/2: it will prevent indexing of the snapshot and inform the SE what they should index. it also aligns with industry standard practice, see the wikipedia example at: http://ws-dl.blogspot.com/2017/08/2017-08-07-relcanonical-does-not-mean.html

annevk commented 6 years ago

I don't think Wikipedia is doing the right thing.

hvdsomp commented 6 years ago

Your comment suggests that your thinking is authoritative without justification. Can't wait for actual justification.

phonedude commented 6 years ago

While not definitive, there is some pretty strong evidence that Google worked with wikia.com (and thus transitively mediawiki/Wikipedia) on early rel="canonical" implementations. I guess it's possible wikia.com coordinated on one aspect of rel="canonical" and then went rogue on another aspect, but that seems unlikely.

https://github.com/whatwg/html/issues/2899#issuecomment-321624207

annevk commented 6 years ago

There's nothing in the definition of canonical that suggests that the canonical version of a dated resource is its maintained variant.

domenic commented 6 years ago

The issue may be that the definition of canonical is wrong then. The authors of the dated resources we are examining would rather have search engines index the maintained variant. canonical accomplishes this. If the definition of canonical does not support that usage, we should fix its definition. Can you help suggest new text?

annevk commented 6 years ago

Is that the dominant pattern though? What about folks using it per the definition? If you want to change anything here you'd first have to do some kind of analysis of the landscape.

phonedude commented 6 years ago

here's another example:

inside: https://www.w3.org/TR/2017/REC-shacl-20170720/ and https://www.w3.org/TR/2017/PR-shacl-20170608/ etc.

there's:

<link rel="canonical" href="https://www.w3.org/TR/shacl/"&gt;

domenic commented 6 years ago

Is that the dominant pattern though? What about folks using it per the definition? If you want to change anything here you'd first have to do some kind of analysis of the landscape.

I would be extraordinarily surprised if it wasn't the dominant pattern, given the many many articles on SEO explaining how to use it in that fashion, and its widely-known benefits for popular search engines.

That said, we can probably do some HTTP archive analysis if you think that's necessary...

domenic commented 6 years ago

@annevk any further thoughts on this? Especially given upcoming review drafts, I'd really like to direct search engines to the Living Standard, if they encounter any incoming links to snapshots or review drafts.

annevk commented 6 years ago

@domenic I still think the Living Standard is not the canonical representation of a snapshot. And I also think that if we adjust robots.txt to include review-drafts/ (as I'm planning to at least) it won't be a problem.

phonedude commented 6 years ago

just a reminder, the options are between:

  1. follow ~10 years of established practice by Google, mediawiki/Wikipedia/wikia, W3C, etc.

  2. create a new method

phonedude commented 6 years ago

I'll try another pass.

Perhaps it's the overloaded word "canonical" that is the problem. Let's replace all instances of "canoncial" with "9f3fda2fef6dda85970e12ce9a9b8cbe", the md5 hash of "canonical":

$ echo -n "canonical" | md5 9f3fda2fef6dda85970e12ce9a9b8cbe

there are browser extensions to replace strings with other strings so you never have to see them, so for us all the W3C, Wikipedia, etc. pages now say things like:

<link rel="9f3fda2fef6dda85970e12ce9a9b8cbe" href="https://www.w3.org/TR/shacl/"&gt; <link rel="9f3fda2fef6dda85970e12ce9a9b8cbe" href="https://en.wikipedia.org/wiki/DJ_Shadow"/&gt;

etc.

Now decide if the interactions between Google and these pages produce the desired semantics (i.e., dated variants hinting "don't index me, index my undated friend here")

then rel="9f3fda2fef6dda85970e12ce9a9b8cbe" is the rel type you should use.

domenic commented 6 years ago

@annevk As noted previously, I don't think "canonical representation" is a useful definition for rel=canonical. The useful definition (i.e. the one used by implementers) is "what should I put in my search engine index when I see this page."

In the short term, I'd like to implement rel=canonical in our review drafts, without you blocking me. In the longer term, I'd welcome your help in changing the definition of rel=canonical to match implementations.

As for rel=canonical vs. robots.txt, I think it's better to have a crawler be able to follow incoming links and go to the right place, than to block crawlers entirely.

annevk commented 6 years ago

Per https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html the pages have to be more or less the same. So whenever we do a major refactoring we'd be abusing it, no? Is there some URL that backs up your point of view?

domenic commented 6 years ago

Sure, the first link in that blog post takes us to https://support.google.com/webmasters/answer/139066?visit_id=1-636608746222517583-624958929&rd=1 which has more discussion.

annevk commented 6 years ago

Yeah, and all that talks about is duplicate content, not content under version control.

domenic commented 6 years ago

It makes the effects pretty clear:

Google uses the canonical pages on your site as the gold standard of your site's content, as far as evaluating content and quality, and the Google Search result usually points to the canonical page, unless one of the duplicates is explicitly better suited to a user's query

Why should I choose a canonical URL? [...] To specify which URL that you want people to see in search results. To consolidate link signals for similar (emphasis mine) or duplicate pages

annevk commented 6 years ago

I still think it would be better to avoid indexing it at all. "Similar" is not defined and if it turns out to be false at some point in the future we might end up with a weird alternate URL for a standard if it had gotten linked a ton for some reason.