Nutanix Node Removed from MetaData Store
My team at Dell is working feverishly to bring the new Dell and Nutanix partnership to fruition through new and exciting hardware platforms and solutions. Part of the process is learning and playing…a LOT. A few weeks ago there was a very bad storm in Austin that knocked my team’s datacenter offline. Our four-node Nutanix cluster, along with everything else, went down hard. Fortunately three nodes came back up without issue, the 4th only partially. At this point we’re just doing POC work and functionality testing so there is no alerting configured. We had a failed Controller VM (CVM) on one of our nodes and didn’t know it. This is probably an extraordinary situation which is why I want to document it.
This cluster is running Server 2012 R2 Hyper-V with an SMB3 storage pool. The architecture is simple, all nodes and their disks participate in two layers of clustering: one private cluster controlled by and run for Nutanix, the other a failover cluster for the Hyper-V hypervisors. SMB resources cannot be owned by a cluster in Hyper-V anyway as there are no disks to add to the cluster, this is simply a namespace that you utilize via UNC pathing. The two clusters operate completely independent from one another. The physical disks are owned by the Nutanix CVMs and are completely obscured from Hyper-V. So even though our 4th node was fine from a Hyper-V perspective, able to run and host VMs, the CVM living on that node was caput as were its local disks, from a Nutanix cluster perspective.
In Prism, it was plain to see that this node was having trouble and the easy remediation steps didn’t work. Rebooting the CVM, rebooting the host, enabling the metadata store, all had no effect, neither did trying to start the cluster services manually. I removed the host from the cluster via Prism hoping I would be easily able to add it back.
Once the disks had been completely removed in Prism, the remaining nodes could see that this CVM and it’s physical resources were gone. Unfortunately, I was unable to expand the cluster and easily add this node back into the mix. I could not open the prism console nor the cluster init page on this CVM. To clean up the metadata and “factory reset” this CVM I ran the following command in the CVM’s console:
cluster –f –s 127.0.0.1 destroy
Once complete, I tried to expand the cluster again in Prism and this time the CVM was discovered. Woot!
Cluster expanded successfully and all is well in the world again, right? Not quite. The disks of the 4th node never joined the pool even though Prism now saw four nodes in my cluster. Hmm. Running a “cluster status” on the CVM revealed this:
We’re not out of the woods yet. Enter the Nutanix CLI to check the disk tombstone entries. Both of my SSDs and three of my SATA disks had been tombstoned as part of the previous configuration, so were being prevented from being assigned now.
One by one these entries needed to be removed so that the CVM and these disk resources could again be free to join the pool.
Now, perform a “cluster start” to start the services on this CVM and voila, back in business.
Check the current activities in Prism which reflect the node getting added (for real this time).
All disk resources are now present and accounted for.
Pretty simple fix for what should be a fairly irregular situation. If you ever run into this issue this is what solved it for us. Big shout to Mike Tao and Rob Tribe at Nutanix for the assist!
No comments: