NTP out of sync each night on ESXi
The clock of each VM drift each night mysteriously: who is the responsible? Veeam, ESXi, NTP?
In the last weeks I installed CheckMk to try it. I already added to CheckMK monitoring several hosts, mainly VM with its Linux Agent.
Since I’m trying it, I’m not giving it so much attention, right now. Yesterday, some late alarms got my attention. Seems that some servers are with the clock drifted, also if they have NTP installed and (I hope)configured correctly.
It’s time to dig.
First
Pattern
During digging I found a strange pattern. While I was on ssh on a VM, another one had the same NTP alert. Very very strange. This is an alarm bell. Shouldn’t a specific VM issue but a quite large one. It’s time to use the powerful of CheckMK. I did a search for the service alert NTP and found a very interesting behavior on it. This is an example of the outputI can see:
- every ~24h it happens
- it has been resolved automatically after sometime
- it happens in a lot of VM But why?!
Matching ideas
It cannot be a NTP issue on a specific VM as several VM have the same issue. It cannot be a Debian version specific issue or systemd issue as it happens also on Devuan and on several different VM with different Debian releases. So? Idea: Backups! The Veeam backup jobs run exactly on these hours… but does not make sense…on the basis of the knowledge that already I have… Googled and…found!
VMWare NTP sync during snapshots
I discovered that by design vMware when take snapshots (and Veeam take snapshots for doing its backup) it sync the time of the ESXI hosts to the VM via vMware Tools. This is correctly documented here, the interesting part is:
One-off time sync: Guest system clock is synchronized to host time upon specific VM life-cycle events that can cause guest > clock to become incorrect, such as resuming from vMotion or a snapshot. This capability is recommended for use and turned > on by default. VMware Tools does not set time backwards (when guest time is ahead of the host), except once when periodic time > synchronization is turned on.
Clock assessment on the VMWare infrastructure
Checked the config on each ESXi hosts:
- NTP is configured on each ESXi;
- the clock is wrong on all ESXi; …
Checking more deeply on each ESXi hosts, I found that the NTP servers configured on them are belonging to a my temporary infrastructure decommissioned. These NTP servers are definitively not working… and this is the reason why all clocks are wrong.
Final timeline of the events
What’s happening:
- all ESXi servers have bad NTP servers
- all ESXi clocks drifted
- at every Veeam backup, VMWare synchronize the ESXi clock to the VM (with a wrong clock)
- checkmk discover the drift and catch an alert
- after some minutes NTPd servers running on each VM correct the clock of the VM
- checkmk alert are cleared
This was happening cyclically each night, each, backup window… the mystery has been found.
I fixed the NTP servers with valid ones on each ESXi, the clock has been corrected and the issue has been resolved definitively.