|
|
Requirements for Parallel ClustersLet us examine the critical issues pertaining to the scalable cluster and the parallel database:
Avoiding Split-BrainIt is reasonable to expect that server components are prone to failures. It is the responsibility of the cluster to detect, monitor, and stabilize the application running on the cluster. Cluster systems are geared to handle peculiar situations such as amnesia and split-brain conditions. We mentioned earlier that amnesia occurs when the cluster restarts after a shutdown, with cluster data older than at the time of the shutdown. This can happen if multiple versions of the framework data are stored on disk, and a new incarnation of the cluster is started when the latest version is not available. Split-brain condition occurs when a single cluster has a failure that results in reconfiguration of the cluster into multiple partitions; each partition forms its own sub-cluster without knowledge of the existence of the other. This leads to data collision and the corruption of shared data, because each sub-cluster assumes ownership of shared data. As an example, when two systems have access to the shared storage, the integrity of the data depends on the communication of heartbeats through the private interconnects. When the private links fail, or if one of the systems is hung or too busy to transmit heartbeats, each system thinks the other system has exited the cluster. Each system then tries to become master (or form a sub-cluster), and claim exclusive access to the shared storage. This condition leads to split-brain. There are definite methods, called fencing, to avoid such a tricky and undesirable situation. There are two basic approaches to fencing: resource-based fencing, and system reset (or STOMITH or STONITH) fencing. Resource-based fencing includes I/O fencing and the maintenance of quorum disks. In resource-based fencing, a hardware mechanism is employed which immediately disables or disallows access to shared resources. If the shared resource is a SCSI disk or disk array, one can use SCSI reserve/release (or better yet persistent reserve/release operations). If the shared resource is a fiber channel disk or disk array, then one can instruct a fiber channel switch to deny the problem node access to shared resources. In general, the errant node itself is left undisturbed, and its resources are instructed to deny access to it. If the node is able to later become part of a cluster with quorum, it will then go through the normal channels to reacquire its resources. STOMITH stands for ‘Shoot the Other Machine in the Head’. STOMITH fencing takes a completely different approach. In STOMITH systems, the errant cluster node is simply reset and forced to reboot. When it rejoins the cluster, it acquires resources in the normal way. In many cases, STOMITH operations are performed via smart power switches, which simply remove power from the errant node for a brief period of time. The method of avoiding split-brain varies from vendor to vendor. It also depends on the type of shared storage in use for the cluster. For example, Sun Cluster avoids split-brain by using the majority vote principle, coupled with quorum disks and a Linux cluster using a Polyserve Matrix Server employing fabric fencing. Let us examine these techniques in detail. I/O Fencing – Exclusion StrategyThere will be some situations where the leftover write operations from failed database instances (the cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts. Since these write operations are no longer in the proper serial order, they can damage the consistency of the stored data. Therefore, when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing, disk fencing, or failure fencing. The two main functions of I/O fencing are to prevent updates by failed instances, and to detect failure and prevent split-brain in the cluster. Cluster Volume Manager (in association with the shared storage unit) and Cluster File System play a significant role in preventing the failed nodes from accessing shared devices. For example, in a Sun Cluster system, disk fencing is done with a SCSI-2 reservation for dual-hosted SCSI devices, and a SCSI-3 PR environment for multi-hosted devices. The Veritas Advance Cluster uses a SCSI-3 persistent reservation to perform I/O fencing. In the case of Linux clusters, a cluster fencing system such as Polyserve or Sistina GFS is able to perform I/O fencing, using different methods like fabric fencing that employs a SAN access control mechanism. SCSI-3 PRSCSI-3 PR (persistent reservation) supports device access through multiple nodes, while at the same time blocking access to other nodes. SCSI-3 PR reservations are persistent across SCSI bus resets or node reboots, and they also support multiple paths from host to disk. For SCSI-2 disks, reservations are not persistent—they do not survive node reboots. SCSI-3 PR uses a concept of registration and reservation. Systems that participate, register a ‘key’ with the SCSI-3 device. Each system registers its own key. Registered systems can then establish a reservation. With this method, blocking write access is as simple as removing the registration from a device. When a system wishes to eject another system, it issues a ‘pre-empt and abort’ command, which ejects another node. Once a node is ejected, it has no key registered so it cannot eject others. This method effectively avoids the split-brain condition. Another benefit of the SCSI-3 PR method is that since a node registers the same key on each path, ejecting a single key blocks all I/O paths from that node. For example, SCSI-3 PR is implemented by EMC Symmetrix, Sun T3, and Hitachi Storage systems. In the case of SCSI-2 reservations, it works with only one path and one host. Arbitration through Quorum DisksIn the case of SCSI-2 reservations, the clusterware seeks to reserve a quorum disk to break the tie if the cluster splits. A quorum disk is a nominated device in the shared storage connected to the relevant nodes. The reservation is enacted as a SCSI-2 ioctl. The node that is granted the reservation causes the second attempt to fail. The SCSI-2 reservation ioctl is part of the SCSI-2 command set. This is commonly implemented in most modern disk firmware. However, the reservation call is neither persistent (i.e. capable of surviving reboots) nor able to cope with multiple paths to the same device. A quorum disk must be defined for a two-node cluster. This arrangement enables any single node that obtains the vote of the quorum disk to maintain majority and continue as a viable cluster. Clusterware forces the loosing node out of the cluster. Fabric FencingThe Polyserve matrix server (cluster file system), which is widely used on Linux clusters, implements node exclusion strategy by following the fabric-fencing approach. The Polyserve matrix server includes a storage control layer that uses a SAN access control mechanism to arbitrate which servers have access to which storage resources. This is achieved by turning off the fiber-channel ports to which the offending node is attached. Advantages of this approach include:
Exclusion with STOMITH approachThis method uses a network-controlled power switch to cut off a server’s power supply when it is no longer deemed to be a reliable member of the cluster. Some of the characteristics of this approach:
Linux clusters adopted this method in an earlier period of their growth, when the Linux SCSI reserve/release support was immature and not consistently implemented. Some of the problems with this method include:
In another example, Sistina GFS supports multiple, cascading I/O fencing methods including manual, network power control, and fiber channel switch zone control. Oracle’s Instance Membership RecoveryWe need to note that fault scenarios are more complicated than a generic cluster system. For example, when an instance dies or crashes, clusterware may not be aware of it. The quorum mechanism maintained by Oracle helps to manage this. This method is implemented by the IMR (Instance Membership Recovery). Each active instance in the cluster writes a bitmap to the control file. This is part of the checkpoint progress record. Thus, every instance has a membership vote. IMR perceives members as expired when they do not provide normal periodic heartbeats to the control file. All instances read this CFVRR (Control File Voting Results Record) to do arbitration. Cache Coherency and Lock ManagementOne of the most critical features for a parallel database is its ability to control global concurrency of the data (pages or blocks) located in the individual node’s cache. As each of the nodes has its own local cache containing current data blocks, their status and access need to be controlled globally. Another node’s cache might need to access concurrently. Blocks are moved frequently across the nodes when needed. Also, there should be effective and accurate monitoring of the status of blocks in the cache. Lock acquisition, lock release, and lock conversions should be performed at rapid speeds. Low latency and high speed communication between the nodes is an essential requirement. Since a data block can be present in the database buffers of more than one node, when an update occurs all other buffered copies become obsolete. Global cache control mechanisms invalidate the obsolete data blocks. Another important feature is the way in which the cache reconfiguration occurs when a node fails. To maintain the integrity of the data blocks, a failed instance’s resources need to be taken over or re-mastered by another node’s instance. For more information, see the book Oracle 11g Grid and Real Application Clusters - 30% off if you buy it directly from Rampant TechPress . Written by top Oracle experts, this RAC book has a complete online code depot with ready to use RAC scripts.
|
|
|