[RFC] net: Handling dynamic updates of interface addresses

zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.

https://docs.zephyrproject.org

Apache License 2.0

10.89k stars 6.63k forks source link

[RFC] net: Handling dynamic updates of interface addresses #7630

Closed pfalcon closed 1 year ago

pfalcon commented 6 years ago

This RFC ticket is prompted by https://github.com/zephyrproject-rtos/zephyr/issues/7500, and discussion in it started at https://github.com/zephyrproject-rtos/zephyr/issues/7500#issuecomment-389157028

IP addresses of the network interfaces may change dynamically. For example, initially (on app startup), a static IP may be assigned, but as soon as DHCP handshake goes thru, a DHCP address may be assigned, which in some time can expire, and a different DHCP address can be assigned. Also, "change" is not the only operation which can happen, in the arbitrary case, there're few IP addresses per interface (e.g. IPv6 mandates that, and in some cases, even for IPv4 multiple addresses may be supported as an extension). So, besides changing of the existing address, a new address assignment for an interface may appear, and likewise, it may be gone later.

There're 2 polar ways to deal with the address changes described above:

Try to do as much as possible "magic" to accommodate them in the IP stack, e.g. try to rewrite endpoint addresses of the connections to try to keep them valid across address changes.
Don't do anything peculiar on the IP stack side, instead have a clear API to notify a user application of any changes of the networking environment (including but not limited to interface address changes), and leave handling of them to a user app. Where the baseline handling would be "reinitialize networking, i.e. close all the existing connections, and recreate them, which will ensure that up to date addresses, etc. are used).

Between these 2 polar ways, some middle-ground can be found too, e.g. for a particular clearly defined usecase, where it's beneficial to have some "automatic" handling in the stack, that can be done.

pfalcon commented 6 years ago

Don't do anything peculiar on the IP stack side, lave handling to a user app.

One benefit of this solution is following. Interface address changes is only one kind of networking environment changes which may happen. Interfaces may go up and down, their parameters may changes (like MTU, routing, etc.). At least some of them would need to be handled by a user app, because only it is "smart enough" to make decisions suitable for a particular usecase.

As an example, if network goes down, then a most obvious reaction is to wait until it's up, or perform active actions, to make it up. But this may happen due to instability of carrier, and network may be just flaky, and then blind infinite retries may e.g. just drain battery. So, application-specific logic (which knows that there's a battery) can do e.g. exponential back-offs with some "dead time" (e.g. if it couldn't get stable connection within last 30mins, don't try again in next 1hr).

Then, given that some of network status changes will need to be handled by an app anyway, it may make sense to leave all of them to an app, and keep complexity of the IP stack down. Besides complexity, doing too much magic in the IP stack may also raise security concerns (e.g. rewriting IP addresses behind application's back may leave out some security checks on app's level).

pfalcon commented 6 years ago

To actually proceed with this issue, we'd need to enumerate different usecases and user stories, and spec out what the behavior should be.

As an example, in https://github.com/zephyrproject-rtos/zephyr/pull/5750, following user story was presented:

For testing and development purposes, it should be possible to have a single configuration allowing a Zephyr networking device to communicate with either a development host or to (via) a networking router. For that, it should be possible to configure a Zephyr application with both static IP address and DHCP enabled. If DHCP is not available (as in case of a workstation, which is itself a DHCP client), the static IP should be used. Otherwise, if DHCP is available (e.g. when connected to a router, which usually includes DHCP server), a DHCP address should replace the static one.

In https://github.com/pfalcon/zephyr/issues/6, it's tested that this user story works for a "server" style application binding its server sockets to INADDR_ANY.

pfalcon commented 6 years ago

Proceeding to test samples/net/sockets/big_http_download , a typical client application, with CONFIG_NET_DHCPV4=y, it "almost" works. And it works by "cheating". As first step, it does a DNS resolution, and if it fails, it won't proceed further. And if it doesn't fail, that means that it received a response from DHCP-assigned DNS, which means that DHCP IP address is assigned too, and thus won't change after that, when a long-running HTTP transaction executes.

Now the only problem is that DHCP request takes time to complete, and initial attempt at DNS resolution (using statically configured address) almost certainly times out. Then, we just need to add retries to DNS resolution. This is implemented in https://github.com/zephyrproject-rtos/zephyr/pull/7635.

Essentially, this is dumb/naive way to handle (assumed) network state failure at the application level - without any specialized APIs beyond POSIX - it's just if there's an error from getaddrinfo(), we sleep for few secs and retry, limited by number of retries. Of course, any production IoT device would need to be smarter that that (specific smartness is up to its vendor, that's where differentiation happens), but the basic idea is the same.

pfalcon commented 6 years ago

Another point which was discussed in https://github.com/zephyrproject-rtos/zephyr/issues/7500#issuecomment-389157028 and followups is to assess how other IP stacks deal with the situation of interface IP address change while the connection is in progress.

Let's consider "client" case (making a connection from a local to remote system). A typical setup is used: a SOHO router and a workstation connected to it, IP addresses assigned via DHCP. The idea is to override the address to another one from the same subnet using ifconfig wlan0 <ip>. Attempt to do that shows that the default gateway setting gets reset, so we need to reinstate it with:

ifconfig wlan0 <ip>; route add default gw <gw>

For example, assuming that the current IP is 192.168.16.106 and the gateway is 192.168.16.1, we can switch our host to new address with (adding +10 to not stomp on other devices' addresses):

ifconfig wlan0 192.168.16.116; route add default gw 192.168.16.1

Now, what we do is:

$ telnet zephyrproject.org 80
Trying 23.185.0.1...
Connected to zephyrproject.org.
Escape character is '^]'.
GET / HTTP/1.0

Note that we typed "GET / HTTP/1.0" and pressed Enter once. Now we switch the interface IP behind our connection:

ifconfig wlan0 192.168.16.116; route add default gw 192.168.16.1

And this point we go back to the telnet window and press Enter again (empty line, end of HTTP headers). At this point the remote peer should send us HTTP response. But nothing happens. So, changing IP address dynamically while a connection on that interface is in progress "doesn't work".

Let's double-check that. In the Zephyr tree, we have a samples/net/sockets/big_http_download sample, which can be also built for POSIX system. It's also a "long-running" sample, downloading several MBs, so the idea is to start it, then immediately switch to another window and change the IP, then see how it behaves. As expected, the download hangs in the middle. Even more interesting is switch the IP address back. The transfer resumes. And that everything works pretty well in accordance with TCP spec, which is reliable protocol, you can disturb its channel (changing IP is such disturbance), it will keep doing retries, and if channel re-establishes, the TCP connection will continue too. (Note that classical TCP is timeout-less, so you can disturb it for 10 years and it can recover afterwards.)

therealprof commented 6 years ago

I think there needs to be a middle ground. The automatisms you call "magic" are in big parts not magic at all, e.g. a client software using sockets for outgoing connections on any OS does not need to know anything about available network connections or IP addresses and even less about changes: You open the socket, you connect it and if everything works out you send data and will automatically receive it, too.

OTOH (as you've mentioned) there're use cases where it makes sense to actually do something if certain changes on the system happen; if the link changes you want to fetch new ip addresses, if the link goes down you might want to switch into power save mode, if the IP addresses change you may want to notify peers about that change (e.g. if the device is under management, you'd want to notify the management server; if the device announces itself locally using MDNS, you'd want to update the MDNS record, and so on).

The first part is just regular POSIX socket behaviour and the second part doesn't exist in any standard form but there're two common ways to implement it:

Using a communication bus one can attach to and filter out relevant messages, something like Linux netlink
Using a event and callback registration mechanism

pfalcon commented 6 years ago

@therealprof : Thanks for the comments.

The automatisms you call "magic" are in big parts not magic at all

What I call "magic" is: things are not required by POSIX, or, for cases not specified by it (which are many) - not done by well-known IP stacks, which can be seen as a reference models. The Linux IP stack is a good example, adjusted for its size, i.e. we won't be able to implement all features it may have. In the comment https://github.com/zephyrproject-rtos/zephyr/issues/7630#issuecomment-389975170, there're results of an experiment which shows that even Linux doesn't update connections based on the IP address change of the underlying interface.

Using a event and callback registration mechanism

Yes, Zephyr already offers such an API: http://docs.zephyrproject.org/subsystems/networking/network-management-api.html .

therealprof commented 6 years ago

In the comment #7630 (comment), there're results of an experiment which shows that even Linux doesn't update connections based on the IP address change of the underlying interface.

That "test" and the assumptions behind it are flawed: At the point where you're changing the connectivity, the TCP handshake has been long completed and the connection is established, after that point your basically pushing data into a dormant connection. There's even a good change that, if you change your connectivity back to the previous value, it'll be able to resume the connection and everything will be just fine. That's one of the biggest problems of the TCP protocol that these things happen legitimately and cause troubles due to the large timeouts involved until a connection is detected as dead and signalled to the application so it can re-connect.

Anyhow, I don't see how that relates to #7500 at all since there're no connections in UDP and each datagram is sent individually with current information and of course the way back needs to work in the exact same way. In fact, with a UDP "server" you often run into the exact opposite problem: You don't know which address the packet was sent to which is relevant because your reply needs to come from exact that address and port to be routed back and there's no connection or connection info you can rely on.

pfalcon commented 6 years ago

Anyhow, I don't see how that relates to #7500

This ticket is to discuss general approach to the problem of handling dynamic updates of the interface IP address(es). Hopefully, that clarifies it.

therealprof commented 6 years ago

This ticket is to discuss general approach to the problem of handling dynamic updates of the interface IP address(es). Hopefully, that clarifies it.

Not really, it seems like the management API should be able to take care of all cases where the application actually has to do something. Everything ist just standard socket behaviour and thus should work as such.

jukkar commented 5 years ago

Just FYI, there is the experimental IPv6 privacy extension support in #9905 where IPv6 address might be removed from the interface.

tbursztyka commented 4 years ago

This ticket is to discuss general approach to the problem of handling dynamic updates of the interface IP address(es). Hopefully, that clarifies it.

Not really, it seems like the management API should be able to take care of all cases where the application actually has to do something. Everything ist just standard socket behaviour and thus should work as such.

Just read this thread, linked from #29200, and so far I don't see why we should change anything. The management API is indeed already sending the proper events (ip addr/removed), so up to the application to do its job.

pfalcon commented 4 years ago

Just read this thread, linked from #29200, and so far I don't see why we should change anything. The management API is indeed already sending the proper events (ip addr/removed), so up to the application to do its job.

So, as the baseline, this ticket called to: a) survey ways to handle that across existing well-known IP stacks (big OSes like Linux, popular embedded stacks); b) choose the "official" approach for Zephyr (i.e. document it). Whether anything needs to be "changed" depends on how well the above is done.

carlescufi commented 1 year ago

Closing as stale, the IP stack already provides mechanisms to deal with IP address changes.