whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.12k stars 2.67k forks source link

rel=bookmark with <link> #2899

Open edsu opened 7 years ago

edsu commented 7 years ago

Unlike many of the other link relations rel=bookmark cannot be used with the <link> element. The others being external, nofollow, noopener, noreferrer, and tag.

As discussed over on the WHATWG discussion list there is a group at the IETF that is proposing the addition of the link relation identifier. The purpose of this new relation is to assert a persistent or durable URL for the current document.

They took a close look and (I think rightly) decided that canonical isn't really the right fit because it isn't used to convey a persistent link. However this is typically the meaning of a permalink, which can be found in the definition of bookmark.

In the interests of not introducing yet another link relation for the concept of a permalink, which would cause confusion for web authors, is it be possible to update the definition of rel=bookmark so that it can be used with <link>? It seems like the existing language would allow for it:

The bookmark keyword gives a permalink for the nearest ancestor article element of the linking element in question, or of the section the linking element is most closely associated with, if there are no ancestor article elements.

I guess I'm volunteering to put together a PR to that effect. But I'm also guessing that <link> was left off for a particular reason?

phonedude commented 7 years ago

This probably deserves its own blog post, but here is our twitter thread on this topic: https://twitter.com/i/moments/895081563653902336

annevk commented 7 years ago

How is canonical not the right fit?

The canonical keyword indicates that URL given by the href attribute is the preferred URL for the current document.

I don't see how that would not be persistent.

I suspect bookmark is not allowed on link elements to not duplicate canonical, but I can't find evidence.

hvdsomp commented 7 years ago

Regarding canonical not being applicable, see http://ws-dl.blogspot.nl/2017/08/2017-08-07-relcanonical-does-not-mean.html.

annevk commented 7 years ago

How Wikipedia uses canonical seems rather broken. E.g., https://dom.spec.whatwg.org/commit-snapshots/aaf90dcbedf22f856ff8dcf00952926127fc5d99/ certainly shouldn't have rel=canonical pointing to https://dom.spec.whatwg.org/. That doesn't make much sense.

hvdsomp commented 7 years ago

Definitely correct in your example. But Wikipedia wants the most recent version of a page (not any/all oldid version) indexed by search engines. Hence its use of canonical to point from oldid URI to generic URI.

annevk commented 7 years ago

If you don't want it indexed you should just ban the robot. Don't tell it non-identical pages are identical.

hvdsomp commented 7 years ago

Good point re the robot. Anyhow, the intent was not to discuss canonical because we know it does not apply to our target use cases, as per http://ws-dl.blogspot.nl/2017/08/2017-08-07-relcanonical-does-not-mean.html.

annevk commented 7 years ago

My point was that I don't think that article uses solid rationale.

edsu commented 7 years ago

Thanks for the responses. @annevk one reason why I thought @hvdsomp & @phonedude made a good case for not using canonical is found in the sentence just following the one you quoted from the spec:

The canonical keyword indicates that URL given by the href attribute is the preferred URL for the current document. That helps search engines reduce duplicate content, as described in more detail in The Canonical Link Relation specification.

RFC 6596 makes it even more clear:

The canonical link relation specifies the preferred IRI from resources with duplicative content. Common implementations of the canonical link relation are to specify the preferred version of an IRI from duplicate pages created with the addition of IRI parameters (e.g., session IDs) or to specify the single-page version as preferred over the same content separated on multiple component pages.

The question is preferred for what. canonical is preferred for search engines who are indexing the page so that results can be collapsed in search results. In an ideal world this would also be the preferred URL for persistence. In the academic publishing world that the I-D authors are speaking to it is often useful to refer to a URL at another domain (doi.org).

As a short thought experiment consider what would happen if you told Google that the preferred URL for your page at http://example.com was http://dx.doi.org/1234 Would this drive your traffic to dx.doi.org? Would Google interpret this as some attempt at spamming or subverting their index? It seems like a gray area that canonical does not really address.

I think a rel=bookmark that could be used with would be a useful way of indicating a URL for a document potentially at another domain that is deemed to be more persistent.

edsu commented 7 years ago

@annevk also, thanks for the insight that rel=bookmark isn't currently used with <link> probably to avoid confusion with rel=canonical. That does seem to make sense, and perhaps is a good reason to close this issue.

annevk commented 7 years ago

In the academic publishing world that the I-D authors are speaking to it is often useful to refer to a URL at another domain (doi.org).

I don't really understand. Is that other location the canonical location for the content or not? Another domain doesn't really change how the relation works I think.

edsu commented 7 years ago

I think the argument that is being made here is that it is not the canonical location, at least in terms of how canonical is defined in the context of HTML.

I think it's useful to think through what a search engine would do if they ran across a canonical link at another domain, and what impact this would have on publishers who chose to use it.

phonedude commented 7 years ago
  1. there is significant evidence that rel="bookmark" predates rel="canonical", by at least as many as 7 years, and thus the limitation on bookmark is almost surely not about canonical.

one of the first mentions I can find for rel="canonical" is this 2009 blog post: https://webmasters.googleblog.com/2009/02/specify-your-canonical.html

and it's conspicuously absent from this 2008 blog post that surely would have mentioned it if it existed: https://webmasters.googleblog.com/2008/09/demystifying-duplicate-content-penalty.html

one of the first mentions that I can find for rel="bookmark" is 2002: http://tantek.com/log/2002/11.html#L20021128t1352 and this one from 2003: https://annevankesteren.nl/2003/08/putting-relbookmark-to-work I believe rel="permalink" predates "bookmark", but I'm not entirely clear why the "permalink" later became "bookmark".

Regardless, it's about solving a problem that used to exist with blogs, forums, etc. E.g., this blog post from 2000: http://www.kottke.org/00/03/finally-did-you-notice-the is noting that there is now a permanent link for the content, which was a big deal at the time. This is the genesis of where we get rel-"permalink|bookmark".

Please see this series of tweets: https://twitter.com/i/moments/895081563653902336 Which shows that it simply doesn't make sense to use bookmark at a level.

  1. I would argue that Wikipedia is using rel="canonical" the correct way; "canonical" effectively does block indexing of this particular page and also recommends a replacement page instead.

Putting aside the issue of early article stubs, splits, mergers, etc., the general case is that the n+1 version of an article is going to be duplicative of version n, at least according to text analysis methods. The reader may attribute significant semantics to the edits in the page, but in the general case Jaccard, cosine, etc. will report version n and version n+1 to be nearly identical. See also the ebay and amazon examples in: http://ws-dl.blogspot.com/2017/08/2017-08-07-relcanonical-does-not-mean.html -- the HTML is not exactly the same, but it is duplicative, and a SE user would not want to see both versions returned in a SERP.

Furthermore, I think you should have a rel="canonical" on https://dom.spec.whatwg.org/commit-snapshots/aaf90dcbedf22f856ff8dcf00952926127fc5d99/ You already have a human readable statement saying "...Do not reference this version as authoritative in any way. Instead, see https://dom.spec.whatwg.org/ for the living standard", but you don't have machine readable guidance. Instead of simply blocking Google from indexing it (which it doesn't do now, btw), it should refer Google to the version it should index, just like it refers the human reader to the version they should be reading (i.e., the canonical version).

phonedude commented 7 years ago

here's a 1999 spec that mentions rel="bookmark"; it doesn't explicitly state that it is not legal in , but it does mention that you can have several bookmarks per page (which implies the limitation):

https://www.w3.org/TR/REC-html40/types.html#h-6.12

Bookmark Refers to a bookmark. A bookmark is a link to a key entry point within an extended document. The title attribute may be used, for example, to label the bookmark. Note that several bookmarks may be defined in each document.

annevk commented 7 years ago

Euhm, the canonical version of a commit-snapshot is that commit-snapshot. It's not some other document that's continuously updated. We should probably add a robots.txt though. Filed https://github.com/whatwg/meta/issues/32 on that.

domenic commented 7 years ago

Anne, I think what people are trying to point out is that rel=canonical shouldn't necessarily be interpreted according to the English meaning of the word canonical, but instead based on the ecosystem (largely search-engine related) that uses and interprets it. I am pretty sure the spec is meant to align with the latter, not the former.

annevk commented 7 years ago

I don't see the distinction.

edsu commented 7 years ago

@domenic thanks, that's exactly what I was going for. The WHATWG approach of looking at desired behavior rather than nailing down the semantics is helpful. I guess it's a bit alien because we're not talking about the behavior of the vanilla browsers, but automated browsers (bots).

In an ideal world @annevk would be correct: the link a web publisher would like to be presented in search results would also be persistent. But, alas, here we are with a web where (academic) publishers find it difficult to commit to persistence of their web resources and would rather defer that responsibility to another authority like, DOI or some other archive.

I can definitely see @annevk's point that appeasing this community by providing both canonical and bookmark could just be a recipe for weakening the ecosystem. But I think this discussion has been very helpful (at least for me) to understand why rel=bookmark is the way it is.

phonedude commented 7 years ago

@domenic is correct; it's not about what the word means to you but how the rel type is defined in RFC 6596, the short version of which is "don't index me, index this other thing". https://tools.ietf.org/html/rfc6596

domenic commented 7 years ago

A few points.

@annevk, I think it may help to understand why people (including Wikipedia) are using canonical as they are if you mentally replace rel=canonical with rel=if-you-find-this-while-crawling-the-web-see-other-url-instead-and-store-that-in-your-database-for-future-lookups. From that perspective using it on commit snapshots or historical revisions makes sense to me. And I think it's important to have a definition that matches prevailing usage, instead of saying sites like Wikipedia are doing things wrong.

(It still seems like even with the prevailing definition canonical might make sense for the DOI use case, if sites don't want people linking to them but instead to DOI archives.)

As for bookmark and using it for DOI-type lookups, I think one interesting distinction is that bookmark is meant to be used when the user wants a "permalink" they can come back to later. However I think it might be surprising if they saved such a permalink and ended up on an entirely new site. Certainly I would be extremely surprised if I used my browser's bookmark feature and that happened. So maybe bookmark is not a good fit for these cases either.

phonedude commented 7 years ago

Google's 2009 blog post about this lists wikia.com (which uses mediawiki, just like wikipedia) as a "trusted tester": https://webmasters.googleblog.com/2009/02/specify-your-canonical.html

"Sounds great—can I see a live example? Yes, wikia.com helped us as a trusted tester. For example, you'll notice that the source code on the URL http://starwars.wikia.com/wiki/Nelvana_Limited specifies its rel="canonical" as: http://starwars.wikia.com/wiki/Nelvana."

Here's a 2009 memento for that page: http://web.archive.org/web/20090402163356/http://starwars.wikia.com/wiki/Nelvan

I can't find an archived "oldid" version for that page, but in the page that is archived you see:

<link rel="canonical" href="http://web.archive.org/web/20090402163356/http://starwars.wikia.com/wiki/Nelvan"/&gt;

Which IA has aggressively rewritten, but nonetheless the point remains that the page is issuing a rel=canonical to itself. That's because in April 2009 when http://web.archive.org/web/20090402163356/http://starwars.wikia.com/wiki/Nelvan was observed, http://starwars.wikia.com/wiki/Nelvan was also accessible as a static version as: http://starwars.wikia.com/wiki/Nelvana?direction=prev&oldid=2525429, which of course is not the URI that we want Google to index.

My point here is that Google called out wikia.com as a "trusted tester" for rel=canonical, and how Google and wikia.com (and by extension mediawiki and Wikipedia) agreed to do it in 2009 is probably the intended semantics.

(edit to fix angle brackets)

phonedude commented 7 years ago

@domenic rel="bookmark" is really a relic from the bad old days; I'm not sure browsers every actually "did" anything with it, but it was a bit of semantic sugar to make it extra-explicit that the content your viewing "here" has a permalink of ___. it has a built-in assumption of N bookmarks per page, and thus you can't use rel="bookmark" in a page-level header. see: https://twitter.com/i/moments/895081563653902336

we hoped the semantics would be "use this URI when you hit ctrl-D", but unfortunately that's not what it was for. and since it addresses a problem that doesn't really exist anymore (anonymous content in blogs), while it hasn't gone away, collectively we have sort of forgotten what it was meant to do. and if "bookmark" wasn't already taken (albeit somewhat abandoned now), we probably would have used that for our proposed rel type.

edsu commented 7 years ago

@phonedude I still disagree with your assertion that rel=bookmark could never be used at the document level. It is simple to imagine wanting to bookmark the content of the current page at a particular URL, even if there are multiple bookmarks in the page for subsections of content. In fact the blog post of Anne's you linked to above has an example of creating a <link> element with rel=bookmark. I understand that you want to position identifier as the only logical solution here, but I don't think bookmark can be ruled out as easily as you suggest ... other than the fact that it's currently not allowed to be used with <link> per the HTML spec.

phonedude commented 7 years ago

@edsu that 2003 blog post is a proposal / js trick for promoting the values from "a" to "link", presumably it was never adopted and then explicitly prohibited in subsequent specs for a good reason. I don't have the history here, but I can't believe this prohibition is because of an oversight; my reading is that the current prohibition enables certain semantics that would be otherwise lost with page-level link elements.

edsu commented 7 years ago

@phonedude the semantics of this content here has this persistent URL would be lost? I guess I'm lost :-)

phonedude commented 7 years ago

rel="bookmark" is meant to bind to the parent enveloping element (div, h, etc.). there is meant to be multiple bookmarks per page. below is the intended use. Again, I wasn't there, but I suspect "they" decided that the more general solution is to always bind to the parent element, and if you want to specify a page level rel="bookmark", then you place it directly inside the body instead of in the head.

If you allow multiple rel="bookmark" in <link> elements, you lose the ability to map which bookmark binds to which div, h1, h2, etc. elements. If you restrict it to a single <link> element, then you have a weird situation where it can appear in multiple <A> elements, but only a single <link> element. I don't see a precedent for "single <link> and multiple <A> " in https://html.spec.whatwg.org/multipage/links.html#linkTypes. But again, that's my guess as to the reason.

+----------------------------+
|                            |
|  <A href="blog.html"       |
|     rel=bookmark>          |
|  Super awesome alphabet    |
|  blog! </a>                |
|  Each day is a diff letter!|
|                            |
|  +---------------------+   |
|  | A is awesome!!!!    |   |
|  | <a href="a.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for A </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | B is better than A! |   |
|  | <a href="b.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for B </a>|   |
|  +---------------------+   |
|                            |
|  +---------------------+   |
|  | C is not so great.  |   |
|  | <a href="c.html"    |   |
|  |    rel=bookmark>    |   |
|  | permalink for C </a>|   |
|  +---------------------+   |
|                            |
+----------------------------+

$ curl blog.html
Super awesome alphabet blog!
Each day is a diff letter!
A is awesome!!!!
permalink for A
B is better than A!
permalink for B 
C is not so great.
permalink for C
$ curl a.html
A is awesome!!!!
permalink for A
$ curl b.html
B is better than A!
permalink for B 
$ curl c.html
C is not so great.
permalink for C
edsu commented 7 years ago

ASCII Art always wins!

But seriously, if there was a <link rel="bookmark" href="..."> in the <head> and the <head> is an element that contains metadata for the document wouldn't it be reasonable for clients to consider the target URL a bookmark for the current document? If there are more than one (hey the web is a wild & crazy place) then there are multiple persistent links for the document. If there are bookmarks for sections within the document so be it. I really don't see what the problem is.

But, did I say I love ASCII art? I should close this ticket just because of the ASCII art.

phonedude commented 7 years ago

I'm not necessarily arguing for how it should have been done, just trying to infer the reasoning behind the restriction (bc I don't think the restriction was an accident/oversight). My best guess is "they" decided to come up with 1 rule that works for the general case: N URIs for N different pages, and bind each to the nearest parent element to determine scope. That covers the case for a "permalink" for the top level page (which is kind of a special case in this model I think), and the more likely intended case of enveloped content.

I would guess a few more things that might have shaped the thinking at the time:

(A corollary to the above is that an agent already knows what the URI for "this" page is, so restating it probably didn't seem interesting in 1999; selecting/asserting your preferred URI from a range of canonical URIs, DOIs, etc. might have been a luxury problem that hadn't really arrived yet.)

Anyway, in summary, I don't think the restriction is an accident and I strongly suspect the restriction predates the arrival of rel="canonical". And even if both of those guesses are wrong, it feels pretty sketchy to reverse a restriction that's been in place for this long, even if we can't quite uncover it why it was put there in the first place.

We should attempt no landing at Europa https://www.youtube.com/watch?v=38EDhpxzn2g#t=1m58s

P.S. I'm glad you like ascii art as much as me ;-)

hvdsomp commented 7 years ago

On Aug 11, 2017, at 20:46, Ed Summers notifications@github.com wrote:

ASCII Art always wins!

But seriously, if there was a in the and the contains metadata for the document wouldn't it be reasonable for clients to consider the target URL a bookmark for the current document? If there are more than one (hey the web is a wild & crazy place)

It really is. That's why we want to make sure relation types are not called "identifier".

Cheers

Herbert

then there are multiple persistent links for the document. If there are bookmarks for sections within the document so be it. I really don't see what the problem is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.