webrecorder / pywb

Core Python Web Archiving Toolkit for replay and recording of web archives
https://pypi.python.org/pypi/pywb
GNU General Public License v3.0
1.36k stars 212 forks source link

The collection attribute is illegal in Link #546

Open ibnesayeed opened 4 years ago

ibnesayeed commented 4 years ago

As per RFC 5988 arbitrary attributes are not allowed in Link, hence collection attribute in Link header and TimeMap entity MUST be removed or incorporated as per RFC 6573.

See: https://ws-dl.blogspot.com/2020/03/2020-03-26-memento-compliance-audit-of.html#1-2-timemap And: https://ws-dl.blogspot.com/2020/03/2020-03-26-memento-compliance-audit-of.html#1-3-main-page-memento

ato commented 4 years ago

As per RFC 5988 arbitrary attributes are not allowed in Link

I'm having trouble finding this statement in RFC 5988.

The grammar also has a link-extension term which seemingly allows arbitrary data and says this about it:

... any link-extension link-params are considered to be target attributes for the link.

and then later:

Target attributes are a set of key/value pairs that describe the link or its target; for example, a media type hint. This specification does not attempt to coordinate their names or use, but does provide common target attributes for use in the Link HTTP header.

ibnesayeed commented 4 years ago

RFC 5988 does not prohibit extensions, but extensions need to be defined in an RFC. For example, Memento RFC introduces datetime attribute. What you are quoting is laying down the foundation for extension through standardization process and not a means to throw in any attributes. In the specific case of collections, RFC 6573 can be consulted to rethink about it.

As a counter example, HTML5 allows arbitrary attributes to most of the elements with data- prefix and Custom Elements allow any HTML element name as long as it has a hyphen in the name and does not belong to the set of a finite number of hyphened elements.

ato commented 4 years ago

Pywb's documentation explains the reasoning for the collection attribute as:

When using the auto-all collection, it is possible to determine the original collection of each resource by looking at the Link header metadata if Memento API is enabled. The header will include the extra collection field, specifying the collection

A hypothetical example use case might be a institution rendering a calendar-page based on the timemap and wants to visually distinguish snapshots that belong to different collections perhaps to show public access vs onsite-only or to acknowledge the source of the data.

The collection relation defined by RFC 6573 is not directly applicable for this use case as it is not the timemap that belongs to a collection but rather the individual snapshots. While you could return a 'collection' relation in a Link header on the URI-M itself that would mean the client would have to issue a request for each snapshot. This is impractical in the not uncommon case where there are hundreds or thousands of snapshots.

A compromise for compatibility with clients that are confused by extra attributes might be to include it only on timemaps for aggregate collections as it doesn't seem to add any value when querying an individual collection.

ibnesayeed commented 4 years ago

I don't think we are contesting against the purpose or denying the usefulness of the attribute, but the focus here was about compliance and a gradual shift away from specifications that accumulate over time. In fact @machawk1 and I were working on something called "Extended TimeMap" for a while on a slow pace in which we are exploring the possibility of exposing many on-demand attributes of each memento, such as content hash (useful for cross-archive deduplication), status code (to isolate resources restricted by law or filter nearby redirects from/to http vs. https schemes or www vs. naked domains for a more accurate memento counting etc.), type (to identify raw, rewritten, banner, or screenshot etc.), ACL annotations, or some quality related attributes such as Memento Damage score. This clearly goes in the CDX API++ territory. The biggest hurdle we had was the limitation posed by Link format, which perhaps was a good choice at the time of Memento draft due to the built-in support of the Link header in various tools and the same format was promoted to be used as the entity in case of TimeMap. Past the draft phase when it finally became an established standard with proper RFC number, it was realized that JSON payloads are more widely supported in various tools, so a non-standard variation was introduced in the TimeTravel Service API, but it does not replace the Link format that is essential to comply with the RFC and inter-operate. I hope, in the future if and when Memento specification is extended or revised, some of these issues are reconsidered. In that case I would vote for UKVS being the format of choice for TimeMaps.

it doesn't seem to add any value when querying an individual collection

This is true. From a practical standpoint such per-record attributes will only be useful when items are collected from multiple sources/collections. I have another concern about the value of the attribute as it is an opaque string, which means nothing if it were to be aggregated from many different sources (especially when the names collide). If these collections had a URI where they were described, using that URI as a value would be more useful.

phonedude commented 4 years ago

RFC 6573 is the right way to do collection/item.

RFC 5988 allows you to override the target URI

https://tools.ietf.org/html/rfc5988#section-5.2

By default, the context of a link conveyed in the Link header field is the IRI of the requested resource.

When present, the anchor parameter overrides this with another URI, such as a fragment of this resource, or a third resource (i.e., when the anchor value is an absolute URI).

if you want to say: "this collection has-item URI-M":

Link: ; rel="item"; anchor="collection-URI"

Note if you consider the collection to be made on (or include) the URI-R, then this is already covered in RFC 7089's defn of the context URI for TimeMaps:

https://tools.ietf.org/html/rfc7089#section-5

The Link header field of [RFC5988], and the media type of the entity- body MUST be "application/link-format" as introduced in [RFC6690]. Links contained in the entity-body MUST be interpreted as follows:

o The Context IRI is set to the anchor parameter, when specified;

o The Context IRI of links with the "self" Relation Types is the URI-T of the TimeMap, i.e., the URI of the resource from which the TimeMap was requested;

o The Context IRI of all other links is the URI-R of the Original Resource, which is provided as the Target IRI of the link with an "original" Relation Type.

ibnesayeed commented 4 years ago

Link: ; rel="item"; anchor="collection-URI"

Yes, that's doable, but I would note here that it will require duplicate URI-M entries in the TimeMap because the context of memento relation will be the URI-R and of the item relation it will be the collection-URI, which will not be possible in a single entry.

phonedude commented 4 years ago

yes, you're correct.

The mementos can point back to the collection but for the TimeMap this will be problematic.

ato commented 4 years ago

Ah, I see. I had missed the anchor attribute. So what you're suggesting is this?

<https://pywbtest.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT",
<https://pywbtest.ws-dl.cs.odu.edu/example/>; rel="collection";  anchor="https://pywbtest.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/",
ibnesayeed commented 4 years ago

Or

<https://pywbtest.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:37:04 GMT",
<https://pywbtest.ws-dl.cs.odu.edu/example/20200323133704mp_/https://example.com/>; rel="item";  anchor="https://pywbtest.ws-dl.cs.odu.edu/example/",