whole-tale / terraform_deployment

Terraform deployment setup for WT prod
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

DHCP lease timeout #4

Open craig-willis opened 6 years ago

craig-willis commented 6 years ago

We're seeing frequent log entries indicating network configuration changes:

Dec 19 14:00:27 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:00:27 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).
Dec 19 14:00:47 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:00:47 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).
...
Dec 19 14:02:41 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:02:41 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).
Dec 19 14:03:00 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:03:00 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).

Since we're using Docker swarm, we're also seeing frequent "node join" events as the system responds to the network change.

This may be related to short DHCP lease timeout

$ cat /run/systemd/netif/leases/2 .. MTU=9000 T1=133 T2=245 LIFETIME=300

According to the OS docs, the default value of dhcp_lease_duration is 24 hours.

Confirm with TACC why the lease is so short and consider impacts

craig-willis commented 6 years ago

tickets.xsede.org #80694

DHCP licenses are short primarily because of suspend and migration issues.

Essentially, during either Suspend or non-live Migration, the VM's internal clock stops. Back in the "real" world, the DHCP server's clock didn't stop.

When the VM resumes, its lease may be expired, but it doesn't think to ask for a new lease until its internal timer goes off, then it renegotiates. If we left it at 24hrs, then we'd either have the pool of IPs used up and/or VMs might wait up to 24 hrs to renegotiate.

Worth noting:

SDSC/Cloud

MTU=1458 T1=78090 T2=142890 LIFETIME=172800

NCSA/Nebula

MTU=1454 T1=40307 T2=72707 LIFETIME=86400

craig-willis commented 6 years ago

Comment from SDSC about the high lease time:

dont think we have given it much thought, suspend and non-live migration is not very common at sdsc. Each project has their own subnet and pool of ips so unused ips have not been a concern. Perhaps if we couldn't live migrate then this would be a concern, but there are very few circumstances where we cant

craig-willis commented 6 years ago

@Xarthisius I don't think TACC will change the DHCP lease timeout based on above -- it seems to have been an intentional decision. Do you have any further questions? I expect we should go forward expecting frequent network config changes and swarm join log entries.

craig-willis commented 6 years ago

Actually, just had another response from Jetstream:

since you’re already playing with fire, I’ll assume you’re willing to play with explosives as well ;) If you want to adjust the DHCP life time for an instance, on that instance edit the file (on RHEL anyway) /etc/dhcp/dhclient.conf and add the line supersede dhcp-lease-time XXXXX; where XXXXX is the number of seconds you want your lease to have. Obviously, if you’re unable to communicate with an instance after resuming it, you may have to wait for the lease time to run out. Please, let us know how that works out,

So perhaps we can override somehow, if needed.

Xarthisius commented 6 years ago

No, until we have a concrete issue that this is causing, I don't think we can push them.