rade commented 10 years ago

Following on from issue #7, it would be desirable to be able to run weave in a non-privileged container. The sole reason we are doing so is because selinux (et al?) prevents binding of raw ip sockets otherwise. So the burning question is whether we could somehow avoid doing that...

Why do we need raw IP sockets?

A peer that connects to us may sit behind a firewall that does not permit "unsolicited" inbound udp packets. So in order to communicate with that peer we must send udp packets with a source ip address & port that matches the destination the peer connected to. For ordinary upd packets we accomplish that by simply sending them on the listening socket. But we also need to be able to send udp packets with 'DF' ("do not fragment") set, which requires setting of the IPPROTO_IP.IP_MTU_DISCOVER=IP_PMTUDISC_DO socket option. So we need a separate socket for that. Furthermore, we need to be able to catch the EMSGSIZE errors that may be encountered when sending on such a socket, and retrieve the pmtu from the IPPROTO_IP.IP_MTU socket option. So we really need one socket per peer, so we can associate the error & pmtu with the correct peer.

We could create those per-peer DF udp sockets with net.DialUDP. However, we need to specify the same source ip address & port as our main udp listener. DialUDP attempts to bind to the source address & port, which fails since it is already bound by the listener.

That's why we create a raw IP socket instead. Ports don't feature in IP, so there is no issue with binding.

What else could we do?

Use the listener socket to send all udp packets, setting the appropriate socket options per packet. That requires a global lock in order to make the set_option, send_packet, check_error, perhaps_retrieve_pmtu sequence atomic, thus effectively serialising all outbound peer communication and also introducing one or more possibly expensive syscalls into the critical path. However, note that a) inbound peer communication is already serialised, b) outbound non-df communication gets serialised (since writing on socket takes a lock). So we should measure the performance impact before discarding this solution.
somehow set SO_REUSEADDR when creating the extra udp sockets. Unfortunately there appears to be no way to do this in the go networking API, so we'd have to roll our own. It is also not clear whether this is safe. Would it interfere with the listening socket? Would the exta sockets somehow interfere with each other, especially when it comes to handling the EMSGSIZE errors?

msackman commented 10 years ago

The EMSGSIZE error is async. If you send to dst A, that may return immediately. At some point in the future, the error will arrive and be buffered. If you next use the same socket to send to dst B, then that one may error until you retrieve the error. Consequently you have to manage the mapping of errors retrieved to destination yourself. It is doable - indeed we tried that before, but it's an utter PITA.

Pretty sure REUSEADDR is on by default on listening sockets: http://golang.org/src/pkg/net/sockopt_linux.go

rade commented 10 years ago

The EMSGSIZE error is async. If you send to dst A, that may return immediately. At some point in the future, the error will arrive and be buffered. If you next use the same socket to send to dst B, then that one may error until you retrieve the error.

Agree about the async bit, but I don't think your conclusion necessarily holds. I suspect the incoming icmp 3.4 packets simply update the kernel's knowledge of the pmtu for a particular route. When you subsequently attempt to send something to that destination which is larger, then you get the error.

Pretty sure REUSEADDR is on by default on listening sockets: http://golang.org/src/pkg/net/sockopt_linux.go

If you dig deeper you'll find it's only on for tcp listening sockets and udp sockets listening on multicast addresses.

msackman commented 10 years ago

Fairly sure you're wrong on that re errors. You'd need to turn on IP_RECVERR and the pain runs from there on. Using a socket with a buffered error will error, and then you have all the CMSG fun to retrieve and parse the errors. Until the error queue is empty, you can't use the socket at all. Indeed such a buffered error will even interrupt receives on the socket.

rade commented 10 years ago

mind you, man 7 ip says

When [the socket] is connected to a specific peer with connect(2), the currently known path MTU can be retrieved conveniently using the IP_MTU socket option (e.g., after an EMSGSIZE error occurred).

which suggests the pmtu is only available as a socket option on connected sockets.

you need to turn on IP_RECVERR and the pain runs from there on

Yes, I know that's a pain. was hoping to avoid that.

So option 1 is out then. Which leaves option 2.

msackman commented 10 years ago

PMTU is only available _via the IPMTU API on connected sockets. If you want to do it on unconnected sockets (i.e. listening sockets) you can do, via the IP_RECVERR mess.

No idea re path 2. Can't imagine that's going to be fun. Just run them privileged, or avoid the stupidity of running inside docker ;)

dpw commented 10 years ago

I think there are more options than just 1 or 2, but anyway:

Re option 1: Rather than trying to process IP_RECVERR, it might be simpler to set up a raw socket for ICMP (calling net.dialIP with the local and remote address as nil so no bind occurs). Then weaver will receive all ICMP packets, from which it can pick out the relevant 3,4 ones. There will be a bit of work to map them back to the associated UDP sockets, but it might be simpler overall than trying to use IP_RECVERR.

Re option 2: I'm otherwise occupied today, but at some point I'll have a look at the kernel code to understand the relevant behaviour of UDP sockets. If it works, then it should be a fairly simple change, right?

rade commented 10 years ago

I think there are more options than just 1 or 2

Such as? I am curious now :)

Re option 2: [...] If it works, then it should be a fairly simple change, right?

It would certainly be a simple change to make to the go networking code. But that's part of core go and we really don't want to fork that. Alas I cannot see any way to add this functionality w/o forking. That's why I said "roll our own", i.e. implement our own version of UDPConn, including only the bits we need. Far from pleasant.

dpw commented 10 years ago

I think there are more options than just 1 or 2

Such as? I am curious now :)

One other option would be to set IP_HDRINCL on the raw socket and construct the IP header yourself. Then there is no need to bind that socket.

Re option 2: [...] If it works, then it should be a fairly simple change, right?
It would certainly be a simple change to make to the go networking code. But that's part of core go and we really don't want to fork that. Alas I cannot see any way to add this functionality w/o forking. That's why I said "roll our own", i.e. implement our own version of UDPConn, including only the bits we need. Far from pleasant.

SO_REUSEADDR just needs a setsockopt (if it is not already set within the net pkg as Matthew suggests). You already have a setsockopt in forwarder.go:dialIP. So I don't see why 'rolling your own' would be needed for option 2.

I have just checked the EMSGSIZE logic on UDP sends. It is done by checking the PMTU for the destination as stored in the route cache (falling back to the MTU on the outgoing device in the case of a cache miss). So extra sockets should not interact with each other in this respect.

rade commented 10 years ago

One other option would be to set IP_HDRINCL on the raw socket and construct the IP header yourself. Then there is no need to bind that socket.

That sounds reasonably straightforward. But, as we discovered above, retrieving the PMTU via getsockopt requires the socket to be bound.

SO_REUSEADDR just needs a setsockopt (if it is not already set within the net pkg as Matthew suggests). You already have a setsockopt in forwarder.go:dialIP. So I don't see why 'rolling your own' would be needed for option 2.

We need to set SO_REUSEADDR before the bind. But the socket gets both created and bound in net.DialIP; there is no way for us to intersperse the SO_REUSEADDR.

dpw commented 10 years ago

We need to set SO_REUSEADDR before the bind. But the socket gets both created and bound in net.DialIP; there is no way for us to intersperse the SO_REUSEADDR.

Ok, so is the socket already set to SO_REUSEADDR, as Matthew suggested it might be?

Otherwise, and it's a hack that might have other consequences, pass the local address as nil when calling dial, and then do an explicit bind system call afterwards?

rade commented 10 years ago

Ok, so is the socket already set to SO_REUSEADDR, as Matthew suggested it might be?

It is not, as I pointed out.

Otherwise, and it's a hack that might have other consequences, pass the local address as nil when calling dial, and then do an explicit bind system call afterwards?

That is worth a try.

rade commented 10 years ago

One other option would be to set IP_HDRINCL on the raw socket and construct the IP header yourself. Then there is no need to bind that socket.

That sounds reasonably straightforward. But, as we discovered above, retrieving the PMTU via getsockopt requires the socket to be bound.

Actually, what we discovered is that the socket needs to be connected. Whether it's bound is irrelevant. So this could still work.

dpw commented 10 years ago

Otherwise, and it's a hack that might have other consequences, pass the local address as nil when calling dial, and then do an explicit bind system call afterwards?

Unfortunately this doesn't work. The Linux kernel requires that bind comes before connect, and the net package dial functions always do a connect. Conversely, the net package listen functions always do a bind (if you pass nil for the listening address, they bind to INADDR_ANY).

rade commented 10 years ago

The Linux kernel requires that bind comes before connect

But do we need to bind at all? As per https://github.com/zettio/weave/issues/9#issuecomment-53538941, a 'connect' may be all we need. And, as you noted, we are operating on a raw socket and hence can control the IP headers.

dpw commented 10 years ago

But do we need to bind at all? As per #9 (comment), a 'connect' may be all we need. And, as you noted, we are operating on a raw socket and hence can control the IP headers.

I'm not quite convinced that skipping the bind is safe. Binding fixes the source IP. Wtihout that, if a destination IP is routable from multiple local addresses, the kernel will pick one of those local addresses as the source. Is it certain that it cannot pick the wrong one, and thus report the wrong MTU?

But yes, this is by no means the end of the story. It's just an obstacle to the relatively simple change of converting the raw socket to a UDP socket while keeping everything else more or less as-is.

dpw commented 10 years ago

CentOS 7 (and thus RHEL 7) has the same selinux bind issue as Fedora 20 did.

bboreham commented 10 years ago

Does this help? http://stackoverflow.com/questions/3062205/setting-the-source-ip-for-a-udp-socket

dpw commented 10 years ago

Does this help? http://stackoverflow.com/questions/3062205/setting-the-source-ip-for-a-udp-socket

The question is whether go lets us get at those C level APIs (if it wasn't for the obstacles imposed by go's net pkg, this issue would be simple).

WriteMsgIP on an IPConn allows us to pass an oob slice (net/iprawsock_posix.go). That parameter is passed on to netFD's writeMsg (net/fd_unix.go), which passes it to SendmsgN (syscall/syscall_linux.go). And SendmsgN uses it as the msg_control field of the struct msghdr passed to the sendmsg syscall.

So maybe we can construct an Inet4Pktinfo (syscall/types_linux.go), wrap it into a Cmsghdr, make a byte slice to cover it (not sure how you do this in go, some usage of unsafe?), then pass that to writeMsg as the oob parameter.

So I can't rule it out without writing code to try it...

bboreham commented 10 years ago

Actually I found that reference in some Go code - https://github.com/miekg/dns/blob/master/udp.go

dpw commented 10 years ago

Interesting. But that code is receiving the magical oob value from ReadMsgUDP and passing it back to WriteMsgUDP. We would have to construct it ourselves.

(I met the author of that code a couple of weeks ago. It is indeed a small world.)

dpw commented 10 years ago

I'm trying the oob approach. It's going to require the use of cgo.

dpw commented 10 years ago

I'm trying the oob approach. It's going to require the use of cgo.

I've tried the IP_PKTINFO approach, and once again go's net pkg seems designed to frustrate.

When sending on th DF socket fails with EMSGSIZE, we use IP_MTU to ask the kernel what it thinks the MTU is. "man 7 ip" says under IP_MTU "Valid only when the socket has been connected", and this corresponds to what the kernel code does: If the socket is not connected, it doesn't record the routing information on sends necessary to yield the MTU. So the socket has to be connected.

That's ok, we can use net.DialUDP, passing a nil local address (because the whole point is that we don't bind) and the remote UDP address (so that we connect). This would be a bit tricky to get right, as the remote address port number can change, but I ignored that for now and passed the initial conn.RemoteUDPAddr().

And here's the catch: WriteMsgUDP, the function that allows us to pass the oob parameter with the IP_PKTINFO setting the source address, checks whether the socket is connected, and if so fails! This is the only function that would allow us to invoke the sendmsg system call on a UDP socket.

So, this approach requires the socket to be connected and not connected. Reductio ad absurdum.

dpw commented 10 years ago

So, this approach requires the socket to be connected and not connected.

The flaw in that argument is that we don't have to use a single socket for both sending out the DF UDP packets and for querying the MTU. So we can use an unconnected socket for the former, and a connected socket for the latter (e.g. the TCP socket between the pair of weavers). So I now have an implementation of IP_PKTINFO that works, though it still needs some polish.

But, in the process of implementing that, I have found a much simpler way!

SO_REUSEADDR is only needed if the two sockets share a port number as well as an address. But the DF socket is only used to send datagrams, so there is no fundamental reason for it to be bound to port 6783, so no need for SO_REUSEADDR. We can let the kernel pick an ephemeral port number for that socket.

Not sure why I didn't think of that earlier.

dpw commented 10 years ago

SO_REUSEADDR is only needed if the two sockets share a port number as well as an address. But the DF socket is only used to send datagrams, so there is no fundamental reason for it to be bound to port 6783, so no need for SO_REUSEADDR. We can let the kernel pick an ephemeral port number for that socket.

There's a proof of concept of this at https://github.com/dpw/weave/tree/no-raw-socket . It needs some tidying up, but seems to work.

rade commented 10 years ago

there is no fundamental reason for it to be bound to port 6783

There is. It's to do with firewall traversal. When a peer "connects" over UDP to us, it will do so on 6783 and in the process their firewall will allow packets to flow the other way.

dpw commented 9 years ago

Perhaps we can WONTFIX this. I'd guess that the weavetools container requires --privileged anyway, so I'm not sure that the weave container's need for --privileged is worth worrying about.

rade commented 9 years ago

IMO there's a difference between running a privileged container briefly, as happens with weavetools, vs continuously, as happens with weave.

weaveworks / weave

run non-privileged container #9

Why do we need raw IP sockets?

What else could we do?