Closed camallen closed 4 years ago
Have you got the URL for that subject on the old and new sites? I'm interested to see if the browser encodes the URL by default. If not, we can explicitly encode it when we parse subject locations. Assuming that doesn't break subject URLs for any other projects.
So this subject is one of them https://talk.galaxyzoo.org/subjects/AGZ000atp8/
From this collection https://talk.galaxyzoo.org/collections/CGZS0003tq/ the URL is encoded correctly
<img loading="lazy" alt="Subject AGZ000atp8" src="https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg">
however if we use the rewritten non-s3 URL www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg we get a redirect via the nginx proxy to an s3 URL which has the path decoded in the rewritten location, https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg
That's not great - it seems the issue here is the nginx static proxy and the rewrite rule. We may have to proxy pass these URLs (serve them directly) via NGINX instead of redirecting them to avoid this issue.
This is getting more interesting....after testing a local version of the static nginx proxy, our static proxy seems to be preserving the encoded URLs correctly. Note the Location
response header below
$ curl -v -H "Host: www.galaxyzoo.org" localhost:8080/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8080 (#0)
> GET /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg HTTP/1.1
> Host: www.galaxyzoo.org
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Server: nginx/1.4.6 (Ubuntu)
< Date: Thu, 20 Aug 2020 21:46:19 GMT
< Content-Type: text/html
< Content-Length: 193
< Connection: keep-alive
< Location: https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
< X-debug-message: /subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg
< X-debug-message: /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
< X-debug-message: https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
<
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.4.6 (Ubuntu)</center>
</body>
</html>
* Connection #0 to host localhost left intact
$ curl -v www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
* Trying 52.186.94.16...
* TCP_NODELAY set
* Connected to www.galaxyzoo.org (52.186.94.16) port 80 (#0)
> GET /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg HTTP/1.1
> Host: www.galaxyzoo.org
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 308 Permanent Redirect
< Server: nginx/1.17.10
< Date: Thu, 20 Aug 2020 20:49:44 GMT
< Content-Type: text/html
< Content-Length: 172
< Connection: keep-alive
< Location: https://www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
<
<html>
<head><title>308 Permanent Redirect</title></head>
<body>
<center><h1>308 Permanent Redirect</h1></center>
<hr><center>nginx/1.17.10</center>
</body>
</html>
* Connection #0 to host www.galaxyzoo.org left intact
$ curl -v https://www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
* Trying 52.186.94.16...
* TCP_NODELAY set
* Connected to www.galaxyzoo.org (52.186.94.16) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
...TLS stuff removed
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fe5ee806600)
> GET /subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg HTTP/2
> Host: www.galaxyzoo.org
> User-Agent: curl/7.54.0
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 301
< server: nginx/1.17.10
< date: Thu, 20 Aug 2020 20:57:36 GMT
< content-type: text/html
< content-length: 193
< location: https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg
< strict-transport-security: max-age=15724800; includeSubDomains
<
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.4.6 (Ubuntu)</center>
</body>
</html>
* Connection #0 to host www.galaxyzoo.org left intact
Note the response Location header above is where lose the encoding, the request from the client is still encoded.
Nginx logs in k8s are the decoded URL, it appears that the nginx ingress is rewriting the URL before it hits the static proxy pod
10.244.14.97 www.galaxyzoo.org - [20/Aug/2020:20:57:36 +0000] "GET /subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg HTTP/1.1" 301 193 "-" "curl/7.54.0" -
This looks relevant, https://github.com/kubernetes/ingress-nginx/issues/1615#issuecomment-343968872 our nginx ingress config looks like this
## start server *.galaxyzoo.org
server {
server_name *.galaxyzoo.org ;
listen 80 ;
listen 443 ssl http2 ;
set $proxy_upstream_name "-";
ssl_certificate_by_lua_block {
certificate.call()
}
--
proxy_next_upstream_tries 3;
rewrite "(?i)/" /$1 break;
proxy_pass http://upstream_balancer;
proxy_redirect off;
}
}
## end server *.galaxyzoo.org
Looking at it on my phone, the image is broken in this discussion about that subject. I rebuilt the discussion pages this morning. https://talk.galaxyzoo.org/boards/BGZ0000004/discussions/DGZ0001krf/ So that would be the redirect breaking the location? The subject and collections pages are built from master, but the discussion page is built from #81.
resolved by https://github.com/zooniverse/static/pull/176
related to #81 and #64
A GZ subject thumbnail URL like www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg will redirect to s3 URL via nginx static rewrite at https://github.com/zooniverse/static/blob/fe42d006be275b5e59e6e584e67fbeff500f426a/sites/www.galaxyzoo.org.conf#L10
E.g. the above subject URL redirects to the literal '+' this doesn't https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08+005811.6_thumbnail.jpeg this works https://s3.amazonaws.com/www.galaxyzoo.org/subjects/decals/thumbnail/J211326.08%2B005811.6_thumbnail.jpeg
I believe this will be the same in azure land (needs testing) https://docs.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#blob-names
I haven't found a decent way to encode the URL in nginx (which strikes me as very strange) and i need to test how these '+' symbols in urls work in azure as well.
We may need to encode these URLs explicitly before publishing them to ensure they work as we expect. TDB