weserv / images

Source code of wsrv.nl (formerly images.weserv.nl), to be used on your own server(s).
https://wsrv.nl/
BSD 3-Clause "New" or "Revised" License
1.86k stars 187 forks source link

Add link rel="canonical" response header? #309

Closed GitBoudewijn closed 2 years ago

GitBoudewijn commented 2 years ago

I came across the following article about WordPress Photon:

https://michaelkummer.com/tech/jetpack-photon-seo/

Apparently they use a link response header with rel="canonical" so that search engines will index the original image rather than the proxied image.

Might be a good idea for you to have this as well?

andrieslouw commented 2 years ago

Seems logical, will investigate!

kleisauke commented 2 years ago

This has been implemented with commit https://github.com/weserv/images/commit/522e8f2d64660524e8a67d68eabef7c669b8c6bb, which has just been rolled out to production. Thanks for reporting this!

GitBoudewijn commented 2 years ago

Nice.

Just a question about the implementation: It seems you're using the final (redirected) url rather than just echoing the initial supplied url (this is what Photon does). Not sure what the upsides/downsides to this are. Maybe there could be cases where the redirection changes?

kleisauke commented 2 years ago

The purpose of a canonical is to explicitly and unambiguously indicate a preferred URL, so I think it's better to include the final URL rather than the initial URL in the rel="canonical" response header.

When an upstream server changes the redirection to another image, search engines may index the wrong URL if we set the initial URL as canonical (due to caching). While the current approach would still point to the final URL fetched at that time.

Note that the blog post you linked is a bit misleading, as Google doesn't honor this response header for images, see for example: https://webmasters.stackexchange.com/a/118622 https://twitter.com/JohnMu/status/768024243544133633

GitBoudewijn commented 2 years ago

Ok that makes sense. I just looked at how Photon does it but this is probably better. This way any url that redirects to https also gets the https url as canonical.

It seems the only information about Google is from 5 years ago so who knows. Thanks!

GitBoudewijn commented 2 years ago

@kleisauke Found this post from 3 months ago from a WordPress engineer and they still claim Google follows the canonical link: http://wordpress.org/support/topic/too-many-oversights-make-this-plugin-easily-exploitable-for-image-thieves/ (Ignore the other guy who talks about 'stealing' images as if that's possible.)

But I wanted to come back on the redirect issue:

Say you get an image from Instagram like this: https://images.weserv.nl/?url=instagr.am/p/CXMsNwIM_Zf/media/%3Fsize%3Dl

This will redirect to an url on cdninstagram.com with a time based signature that expires after a couple of days. So if you still have the image in the cache the canonical link will point to a non-existent location!

In this case the request goes like this: 301 http://instagr.am/... 301 https://instagr.am/... 302 https://www.instagram.com/... 200 https://xxx.cdninstagram.com/...

So I think in this case the 3rd URL should be the canonical one. So I would suggest when determining the canonical URL to only follow permanent redirects (301 and 308). I guess this should be pretty easy to implement. During the fetching process you just have to keep track of the final permanent URL (before the first non-permanent redirect).

Thanks!

kleisauke commented 2 years ago

I think the root cause of this is caching. Without that, the canonical link will always indicate the correct preferred URL. While determining the canonical link on permanent redirects might solve the issue for Instagram, it is more or less a workaround; if Instagram switches to permanent redirects for these URLs then it won't work anymore.

You're always free to use our source code to host your own solution without any caching. Images.weserv.nl is for caching and manipulating images, not for bypassing time-based expiration URLs nor circumventing hotlinking-protection.

GitBoudewijn commented 2 years ago

I disagree this is about caching. Instagram gives a temporary redirect because it's a temporary URL. If Instagram switches to permanent redirects than that would mean they consider that final URL to be permanent. But this isn't about Instagram but about all canonical URLs, I was just using this as an example.

Don't you agree that a canonical URL is always a permanent one?

(That being said, in the case that the final URL has a canonical Link header itself, would it make sense to take over the value of that one?)

Not sure why you mention self-hosting since this is only about search engines.

kleisauke commented 2 years ago

According to RFC 6596 section 3 ("The Canonical Link Relation"), the target (canonical) URL may be the source of a temporary redirect:

  • Be the source IRI of a temporary redirect. For HTTP, this refers to status codes 302, 303, or 307 (Sections 10.3.3, 10.3.4, and 10.3.8, respectively, of [RFC2616]).

The same section emphasizes that the target (canonical) URL should not be designated to a permanent redirect.

  • The source IRI of a permanent redirect (for HTTP, this refers to 300 and 301 response codes, defined in Sections 10.3.1 and 10.3.2 of [RFC2616]).

So I think you're right. I just fixed this with commit 34b08b15849600babd2217d0e957faf93c938bed, which has just been rolled out to production. Thanks for reporting this!

Soetens commented 2 years ago

But how to disable it, as its very unwanted to announce the original location. Also gives in our case unwanted behavior with caching and could also lead to unwanted indexing at search engines.

I cant find in the documentation where to disable it.

kleisauke commented 2 years ago

But how to disable it, as its very unwanted to announce the original location. Also gives in our case unwanted behavior with caching and could also lead to unwanted indexing at search engines.

There are no plans to disable nor allow the rel="canonical" HTTP header to be configured on the public API. Most search engine crawlers will respect robots.txt, in which we defined that all proxied images should not be indexed, see: https://github.com/weserv/images/issues/165#issuecomment-438777292 https://images.weserv.nl/robots.txt

If you need to "mask" the original image location, then our public service is probably the wrong solution. You're always free to host your own solution. Note that the weserv filter directive doesn't set this header, see e.g. https://github.com/weserv/images/issues/314#issuecomment-952012562 for a way to process images from only one domain.

Soetens commented 2 years ago

If you need to "mask" the original image location, then our public service is probably the wrong solution. You're always free to host your own solution. Note that the weserv filter directive doesn't set this header, see e.g. #314 (comment) for a way to process images from only one domain.

We do not use the public service, but our own docker. We want to hide the source location of images for obvious reasons, but just noticed it by accident that there was a canonical header added. We removed it via the nginx proxy hide headers options, so that works fine. But this could be a minor security issue if you don't know about this change and starting publishing your image sources.

For example, if you use the blur option to hide images, but then you also announce the source location it could miss the point of blurring images.

kleisauke commented 2 years ago

Ah, I thought you were referring to our public service. However, I don't think this is a security issue, since the original image source would also be available in the ?url= query parameter (unless there's another service in front of it that masks this parameter).

Anyways, I made this configurable with the weserv_canonical_header nginx directive added in commit 7f102c555ffe94a0c5b39c136540495c7b7fdc08. Hope this helps.