In the context of the Red Hat Enterprise Virtualization environment, fencing is a host reboot initiated by the Manager using a fence agent and performed by a power management device. Fencing allows a cluster to react to unexpected host failures as well as enforce power saving, load balancing, and virtual machine availability policies.
Fencing is essential in insuring that the role of SPM is always assigned to a functional host. If the problem host was the SPM, the SPM role is relinquished and reassigned to an responsive host. Because the host with the SPM role is the only host that is able to write data domain structure metadata, a non-responsive, un-fenced SPM host causes its environment to lose the ability to create and destroy virtual disks, take snapshots, extend logical volumes, and all other actions that require changes to data domain structure metadata.
When a host becomes non-responsive, all of the virtual machines that are currently running on that host can also become non-responsive. However, the non-responsive host retains the lock on the virtual machine hard disk images for virtual machines it is running. Attempting to start a virtual machine on a second host and assign the second host write privileges for the virtual machine hard disk image can cause data corruption. Fencing allows the Red Hat Enterprise Virtualization Manager to safely release the lock on a virtual machine hard disk image because the Manager can use a fence agent to confirm that a problem host has truly been rebooted. When this confirmation is received, the Red Hat Enterprise Virtualization Manager can safely start a virtual machine from the problem host on another host without risking data corruption. Fencing is the basis for highly available virtual machines. A virtual machine that has been marked highly available can not be safely started on an alternate host without the certainty that doing so will not cause data corruption.
When a host becomes non-responsive, the Red Hat Enterprise Virtualization Manager allows a grace period of thirty (30) seconds to pass before any action is taken in order to allow the host to recover from any temporary errors. If the host has not become responsive by the time the grace period has passed, the Manager automatically begins to mitigate any negative impact from the non-responsive host. The Manager uses the fencing agent for the power management card on the host to first stop the host, confirm it has stopped, start the host and confirm that it has been started. When the host finishes booting, it attempts to rejoin the cluster that it was a part of before it was fenced. If the issue that caused a host to become non-responsive has been resolved by a reboot, then it will automatically be set to Up status and be capable of starting and hosting virtual machines.