Open craig-willis opened 6 years ago
tickets.xsede.org #80694
DHCP licenses are short primarily because of suspend and migration issues.
Essentially, during either Suspend or non-live Migration, the VM's internal clock stops. Back in the "real" world, the DHCP server's clock didn't stop.
When the VM resumes, its lease may be expired, but it doesn't think to ask for a new lease until its internal timer goes off, then it renegotiates. If we left it at 24hrs, then we'd either have the pool of IPs used up and/or VMs might wait up to 24 hrs to renegotiate.
Worth noting:
SDSC/Cloud
MTU=1458 T1=78090 T2=142890 LIFETIME=172800
NCSA/Nebula
MTU=1454 T1=40307 T2=72707 LIFETIME=86400
Comment from SDSC about the high lease time:
dont think we have given it much thought, suspend and non-live migration is not very common at sdsc. Each project has their own subnet and pool of ips so unused ips have not been a concern. Perhaps if we couldn't live migrate then this would be a concern, but there are very few circumstances where we cant
@Xarthisius I don't think TACC will change the DHCP lease timeout based on above -- it seems to have been an intentional decision. Do you have any further questions? I expect we should go forward expecting frequent network config changes and swarm join log entries.
Actually, just had another response from Jetstream:
since you’re already playing with fire, I’ll assume you’re willing to play with explosives as well ;) If you want to adjust the DHCP life time for an instance, on that instance edit the file (on RHEL anyway) /etc/dhcp/dhclient.conf and add the line supersede dhcp-lease-time XXXXX; where XXXXX is the number of seconds you want your lease to have. Obviously, if you’re unable to communicate with an instance after resuming it, you may have to wait for the lease time to run out. Please, let us know how that works out,
So perhaps we can override somehow, if needed.
No, until we have a concrete issue that this is causing, I don't think we can push them.
We're seeing frequent log entries indicating network configuration changes:
Since we're using Docker swarm, we're also seeing frequent "node join" events as the system responds to the network change.
This may be related to short DHCP lease timeout
According to the OS docs, the default value of dhcp_lease_duration is 24 hours.
Confirm with TACC why the lease is so short and consider impacts