Hosts not accessible - troubleshooting
This is one of the most common issues a VMware admin will run into time in time.
Host showing 'Not Responding' in vCenter and being unmanageable through the DCUI.
Here is a summary of what happened with a client, what causes this, how the root cause may be determined (usually only before rebooting), and what to do if it happens again:
The symptoms you will see are:
-The host is showing 'Not Responding' and VMs running on it show 'Disconnected in vCenter'. You likely won't be able to open a vSphere client directly to the host. If you can, it will work very slowly.
-You are unable to navigate through the DCUI or have long delays trying to do so. If you can get to the shell, commands don't work properly or just stall
-Attempting to reconnect the host in vCenter fails.
-Attempting to SSH to the host times out. Pings may still work.
-VMs on the host are still running. Normally, you won't even have any issue with the VM performance though you can't power on additional ones.
***You CANNOT migrate VMs to another host. A reboot will be required, though there is a small chance of re-connecting the host following some steps (below) and then migrate the VMs off at that point before rebooting.
If the host is rebooted, only vmkernel logs will persist (unless there was a full PSOD crash and you may get a core dump then). Unless the kernel was involved with the cause of the host going into that state (rare), it won't show anything meaningful. Other logs won't help unless you have syslog set up and it collected them prior to the reboot. Because of this, it is usually not possible to identify the specific root cause of this after rebooting the host (and if you can't run commands, it will be unlikely to identify it then either).
`The above noted, the direct cause of this is most often known based on the symptoms:
-The hypervisor will protect the VMs at the cost of being able to run itself effectively. Resources allocated to running VMs are left alone and ESXi will let go of management agents (mainly hostd) that require unavailable remaining resources (almost always memory)
-Running out of memory available to the hypervisor is mainly due to processes that connect to the hostd agent. Over time, failing to decommission memory afterwards can lead to eventually running out. There are also other issues that can cause it to happen more quickly. As products develop, there are often improvements to how this works and known causes of hostd issues are fixed, so it is essential to keep hosts updated (this will also mean regular reboots which significantly reduce the likelihood of running into this issue).
There are some things you can do if a host becomes unresponsive. Rather than type it all out here, I will provide links to important KB articles describing the condition, how troubleshooting can be done, and reference to much of what is already summarized above:
https://kb.vmware.com/s/article/1003409 Troubleshooting an ESXi host in non responding state
https://kb.vmware.com/s/article/1003490 Restarting the Management agents in ESXi
https://kb.vmware.com/s/article/56450 ESXi host is non-responsive and disconnected in vCenter
https://kb.vmware.com/s/article/1017135 Determining why an ESX/ESXi host does not respond to user interaction at the console