Thursday, June 25, 2009

ESX Server Craziness

Had two server 2008 research VM's (which were deployed from a template) that wouldn't power on. After the initial attempt to power on, they both would not allow any other changes to be performed on them, because there was already a "task in progress," "fault null" or something to that end. Couldn't remove from inventory, couldn't delete, was able to migrate sometimes, couldn't power on.

Now, does VMware 2.5 support server 2008 templates? Nope. I knew that going into this. But somehow, I've gotten away with it before on other test servers, so it was worth a shot. Research suggested a solution would be to kill the process on the esx server that is hung trying to reboot the server. I tried all sorts of things, including stuff like

ps -ef grep
Kill -9 <>
rm -rdf machine_name

etc. What it came down to was that our vmware virtual center thought a task was being performed on these screwed up VMs, although the breakdown could have been with either the communication between the vcenter and the esx server, or the esx server with the guest OS.

Anyway, this was resolved not by leet linux commands, but by a good old restart, which was what I had been trying to avoid the whole time. First I tried a "service mgmt-vmware restart," which made all the guest OS's appear offline in vcenter. I have since learned that you should accompany this command with a "service vmware-vpxa restart."

The two services are tied together in some way - I didn't have a chance to research this today. When the servers did not come back online after 10+ minutes, I did a restart of the entire machine using

/sbin/reboot

You should preferably put the esx server in maintenance mode before doing this.
vimsh -n -e /hostsvc/maintenance_mode_enterbounce
vimsh -n -e /hostsvc/maintenance_mode_exit

I satisfied my engineering curiosity of "what if" by just bouncing it :-) Not to worry, it only hosts research vms, so this was as risk-free as you get. There are two esx servers in this cluster, and I rebooted B. Interestingly, A eventually went down (red mark on the esx server in vcenter). Then came back up. Then went down again. B stayed down. After 10+ minutes, both esx servers came up in very close chronological proximity, happy and refreshed.

And I was able to remove those two VM's from inventory, delete them, and free up space to continue my GPO testing :-D

Speaking of which, found today two issues with my GPO admx testing.
1. "Policy Definitions" is not the same thing as "PolicyDefinitions" in the sysvol. Only the second will be recognized by the DC.
2. You need SOME adml files to support the ADMX files. ADML's are only optional after you implement the first set. Else you'll get swarmed with errors when you open GPMC.

There's one of the problems I solved today.

Carry on!

No comments:

Post a Comment