weserv / images

Source code of wsrv.nl (formerly images.weserv.nl), to be used on your own server(s).
https://wsrv.nl/
BSD 3-Clause "New" or "Revised" License
1.84k stars 187 forks source link

The hostname of the origin is unresolvable (DNS) or blocked by policy. #365

Closed KIKOmanasijev closed 9 months ago

KIKOmanasijev commented 1 year ago

Hey,

I know that this issue has been reported multiple times before, but we have tried every recommended solution possible and we are still not able to resolve it.

The problem appears constantly, random images won't get shown and after hard refresh they will show but new ones will get broken. If I open the transformed url (ex. www.domain.com/?q=100&output=webp&url=www.random.com/image.png) it will show the error code The hostname of the origin is unresolvable (DNS) or blocked by policy. but if I hard refresh or just change a single param then the image will come back.

What we have tried so far:

The site where we implemented this is: www.brandsgateway.com

kleisauke commented 1 year ago

Sounds like the IPv6 interface is not properly configured, see #206 for more information.

vnukhr commented 1 year ago

I have to +1 this issue, having the same problem - images randomly get 404-ed with The hostname of the origin is unresolvable (DNS) or blocked by policy.

My resolver has config has ipv6=off for a while now and it's not making a difference. Currently I'm mitigating this problem with a daily (used to be weekly but turned out to not be enough) reboot via crontab.

Running built from source on CentOS without Docker.

andrieslouw commented 1 year ago

The The hostname of the origin is unresolvable (DNS) or blocked by policy. is always because of DNS.

DNS Haiku

What kind of reboot are you doing? Whole server? And when the issues are happening, what does nslookup show? Do you use a local(host) resolver, or an external one? How reliable is the connection to the resolver/of the server? And is it with just one domain (?url=domain1.com) or multiple domains?

kleisauke commented 1 year ago

but we have tried every recommended solution possible and we are still not able to resolve it.

I don't think this is true, given that you only changed the DNS server to Cloudflare.

If this is still a issue, are you able to provide the output of the following commands within the container?

# Check the DNS settings (ought to be inherited from the host when Docker's default bridge network is used)
$ cat /etc/resolv.conf | grep nameserver
# Install bind-utils for nslookup
$ dnf install -y bind-utils
# Try to resolve using the default DNS server(s) (specified in /etc/resolv.conf)
$ nslookup google.com
# ... and with Google's open DNS server
$ nslookup google.com 8.8.8.8
# ... and with Docker's embedded DNS server (when using a user-defined bridge)
$ nslookup google.com 127.0.0.11

I'm relabeling this as question.

vnukhr commented 1 year ago

@andrieslouw yeah, I reboot the whole server - the issue is so random and so low on my priority list, I haven't had time to investigate if restarting only nginx would suffice. Once it starts happening, it's random, and while it's happening, some images get the error, some get fetched and resized fine and nslookup/dig work as expected, no errors. My images are fetched exclusively from two hosts on the same domain. Re connection, resizers (two of them) and one of the hosts that are used to fetch (~95% of) images from are on Hetzner, other (~5% of) images are a bit more remote.

I've used Google DNS and Cloudflare DNS resolvers (no difference, also, I haven't tried local Hetzner resolvers), haven't tried setting up local resolver yet; which is what I should probably do.

My biggest pain in debugging this is nothing gets written in weserv logs when this happens, so I have to wait for users to start complaining about broken images. I have error_log enabled for host that's running weserv proxy and public facing host - any ideas how to monitor when this happens? I can rebuild from source easily, is there a compile time flag that could get me some debug logs?

kleisauke commented 1 year ago

My biggest pain in debugging this is nothing gets written in weserv logs when this happens

weserv module errors are usually written to /var/log/nginx/weserv-local-error.log if you use our recommended nginx configuration. https://github.com/weserv/images/blob/faaaed8bc3ef99e5735b7990cd7d0c3c39ed5d67/ngx_conf/imagesweserv.conf#L43

I have error_log enabled for host that's running weserv proxy and public facing host - any ideas how to monitor when this happens? I can rebuild from source easily, is there a compile time flag that could get me some debug logs?

See #289. Please open a new issue if there's still a problem (since hijacking issues raised by others is generally not considered a good thing on GitHub).

kleisauke commented 1 year ago

@KIKOmanasijev Were you able to make any progress with this?

skynet2 commented 1 year ago

Hi all, Basically, i had the same issue during migration from cloudflare image resizer to weserv solution. Once I put real traffic on k8s claster with 8 pods 2.5Gi ram 1cpu each, I constantly had the same behavior, in the error log there was "Cannot assign requested address to upstream" and ipv6 address of cloudflare servers.

I was able to solve it by switching from google DNS to Cloudflare DNS and, maybe most important disabling ipv6. ( also, I used alpine image, instead of enterprise Linux).

Here is a patch that I applied https://github.com/skynet2/images/commit/ab11937ba5eabba919fb02c099f5c834c03b835a for now, it works perfectly.

andrieslouw commented 1 year ago

One thing we do in production is increase the number of outgoing ports, by using something like net.ipv4.ip_local_port_range = 15000 65000 in /etc/sysctl.conf. While this setting is called net.ipv4 the IPv6-stack in Linux also uses this setting.

If issues persist with Cannot assign requested address to upstream you may want to check why connections are not properly closed or re-used; maybe they are unstable, or nginx is unable to set those connections up properly in the first place, falling back to IPv4 untill it runs out of outgoing ports to try on the IPv6-stack.

Please always confirm if your IPv6 is working properly, make some effort to get it working, and don't switch it off without any good reason, as large parts of the internet are having real issues with IPv4 these days.

For internal usage: Consider switching to unix domain sockets to get rid of TCP/IP overhead. Or make sure that you use internal IP-addresses where possible, to preserve ports on your outgoing addresses.

skynet2 commented 1 year ago

hi @andrieslouw ,

I had this sysctl on my hosts:

net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
net.netfilter.nf_conntrack_max = 786432
net.ipv4.ip_forward = 1
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1

When I had that issue, it was less than ~3k concurrent connections per host (checked with ss). I agree that disabling ipv6 is not a good approach, but I wanted to give some feedback that it worked :)

I will try to dig into the issue with ipv6. Thanks.

kleisauke commented 9 months ago

Closing due to inactivity. Please feel free to re-open if there's still a problem.