Contents
Overview 1
Cluster Maintenance 2
Troubleshooting Cluster Service 11
Lab A: Cluster Maintenance 24
Review 30
Module 7: Server
Cluster Maintenance
and Troubleshooting Information in this document is subject to change without notice. The names of companies,
products, people, characters, and/or data mentioned herein are fictitious and are in no way intended
to represent any real individual, company, product, or event, unless otherwise noted. Complying
with all applicable copyright laws is the responsibility of the user. No part of this document may
be reproduced or transmitted in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of Microsoft Corporation. If, however, your only
means of access is electronic, permission to print one copy is hereby granted.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual
property rights covering subject matter in this document. Except as expressly provided in any
written license agreement from Microsoft, the furnishing of this document does not give you any
license to these patents, trademarks, copyrights, or other intellectual property.
2000 Microsoft Corporation. All rights reserved.
Lead Technology Manager: Sid Benavente
Lead Product Manager, Content Development: Ken Rosen
Group Manager, Courseware Infrastructure: David Bramble
Group Product Manager, Content Development: Julie Truax
Director, Training & Certification Courseware Development: Dean Murray
General Manager: Robert Stewart Module 7: Server Cluster Maintenance and Troubleshooting iii
Instructor Notes
This module is intended to prepare the students to successfully back up and
restore a server cluster. Students need to know how to use the troubleshooting
tools available for troubleshooting server cluster problems. The module covers
common Cluster service problems and possible resolutions.
After completing this module, you will be able to:
Perform the steps to successfully back up a server cluster.
Perform the steps to successfully restore a server cluster.
Evict a node from a server cluster.
Identify the tools that are necessary to troubleshoot a cluster failure.
Interpret the entries on the cluster log.
Identify and troubleshoot common server cluster failures: network
communications, small computer system interface (SCSI) configuration
problems, group, resource, and quorum failures.
Materials and Preparation
This section provides the materials and preparation tasks that you need to teach
this module.
Required Materials
To teach this module, you need the Microsoft® PowerPoint® file 2087A_02.ppt
• Backup: Backing up the system state backs up the cluster configuration
files; however, you also need to back up each node’s data and operating
system and the cluster disks.
• Restoring the First Node: The overall procedure for restoring a cluster is
outlined on this page. The first step, restoring the operating system on
the first node, is also covered. The remaining steps are covered in detail
on the following pages.
• Restoring Cluster Disks: Cluster service uses the disk signature file to
identify the cluster disk. To replace this disk, you must write the disk
signature file of the old disk onto the new disk.
• Restoring the Second Node: Restoring the remaining nodes of the cluster
is similar to restoring the first node, except that after it is restored, you
need to test the failover capabilities of the cluster before putting the
cluster back into the production environment.
• Evicting a Node: Evicting a node is a manual process through Cluster
Administrator. As always, it is important to have a good backup of the
server prior to the eviction process.
Module 7: Server Cluster Maintenance and Troubleshooting v
Troubleshooting Cluster Service
The key point of this section is to give the students the tools and techniques
that are useful in reducing the time it takes to find a root cause for common
Cluster service problems.
• Troubleshooting Tools: The tools that are used to help troubleshoot a
problem with Cluster service are the same tools that are used to help
troubleshoot a server running Microsoft Windows
® 2000.
• Examining the Cluster Log: Cluster service logs every change
configuration and problem to the cluster log. It is important for the
students to become familiar with the syntax of the log.
procedures. Both nodes will uninstall Cluster service.
Module 7: Server Cluster Maintenance and Troubleshooting 1
Overview
Cluster Maintenance
Troubleshooting Cluster Service
*****************************
ILLEGAL FOR NON-TRAINER USE******************************
Server cluster maintenance and troubleshooting are considered two separate
disciplines. Maintenance is continuous, whereas troubleshooting has a
beginning when the problem is discovered, and an end when the problem is
resolved. The two disciplines are complimentary, however. When every
troubleshooting procedure that you follow fails, you will need to rebuild the
cluster from a backup tape that was generated during a maintenance procedure.
After completing this module, you will be able to:
Perform the steps to successfully back up a server cluster.
Perform the steps to successfully restore a server cluster.
Evict a node from a server cluster.
Identify the tools that are necessary to troubleshoot a cluster failure.
Interpret the entries on the cluster log.
Identify and troubleshoot common server cluster failures: network
communications, small computer system interface (SCSI) configuration
problems, group, resource, and quorum failures.
Topic Objective
To provide an overview of
the module topics and
objectives.
fundamental tasks for
maintaining a server cluster.
Lead-in
The only maintenance
performed on a cluster is
backing up and restoring
Cluster service.
Module 7: Server Cluster Maintenance and Troubleshooting 3
Backup
Backing Up the System State
Backing Up the Local Disk
Backing Up the Cluster Disk
*****************************
ILLEGAL FOR NON-TRAINER USE******************************
Backing up the cluster is no different from backing up Microsoft
Windows 2000 Advanced Server. It is recommended that you perform regular
backups by using the Windows 2000 Backup program (NTBackup), or other
compatible backup programs. Additional backup agents are still necessary to
back up applications running on the cluster, such as Microsoft SQL Server
™
and Microsoft Exchange.
A cluster-aware backup program will be able to perform the same backup
operations as NTBackup, especially with regard to backing up the System State
and the cluster configuration database.
Backing Up the System State
Backup is essential, but regular testing to make sure that backups and
restores actually work as expected is also necessary. A good practice is to
schedule test backup and restore operations frequently.
Backing Up the Cluster Disks
It is critical to back up cluster files on the quorum disk and data on the cluster
disks, because Cluster service will write information to files in the
\mscsdirectory on the quorum disk and cluster-aware applications will likely be
placing data on the cluster disk. Because either node of the cluster could own
the cluster disk resource at any time, it is possible for each node to back up the
data on the drive. However, having each node back up data would require you
to install backup hardware and software on each cluster node, which is not the
best solution.
One possibility is to identify a nonclustered server running Windows 2000
Server and schedule it to back up data remotely through a network connection
to the Cluster disk’s administrative share or a hidden share that you create. For
example, you might create FBackup$, GBackup$, HBackup$, and WBackup$
file share resources on the virtual server for the root of drives F, G, H, and W.
F, G, and H would be cluster disks with data, and W would be the drive letter
for the quorum disk. Hidden shares would not appear in a browse list and you
could configure them to allow access only to members of the Backup Operators
group.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 5
Restoring the First Node
Steps For Restoring a Server Cluster:
1. Restore the first node
and covers the first step,
Restoring a Node. Details
about the other three steps
follow on the next pages.
6 Module 7: Server Cluster Maintenance and Troubleshooting
Restoring a Node of the Cluster
To restore a node in a server cluster, you follow the same procedure that you
would use in restoring a Windows 2000 operating system.
1. Install a fresh copy of Windows 2000 Advanced Server on the node to be
restored.
2. Log on as Administrator and restore the system and boot partition, system
state, and associated volumes from the backup. Make sure that you select
the option to restore the system state to the original location in the backup
program.
3. Restart the node.
4. Perform the steps for restoring the cluster disk. These steps follow in the
next section. The difference between the time of the backup and the time of the
restoration to the new computer may affect the computer account on the domain
controller. You may have to join a workgroup and then rejoin the domain.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 7
Restoring Cluster Disks
Restoring Disk Signature Files
Restoring the Data on the Cluster Disk
To describe how to restore
the cluster disk by restoring
signature files, data and
cluster configuration files.
Lead-in
Restoring a cluster disk
involves restoring the disk
signature file.
For Your Information
Be familiar with Q224075,
“Disk Replacement for
Windows 2000 Server
Cluster,” found on the
Student compact disk.
Note
8 Module 7: Server Cluster Maintenance and Troubleshooting
Restoring the Data on the Cluster Disk
Restoring the data on the cluster disk is the same as a restore of a local disk.
Before restoring the data, make sure that you have associated each cluster disk
to the same drive letter as before the disaster or failure. When restoring, make
sure that you restore the data to the original location and verify the integrity
after you have completed the restore.
Restoring the Cluster Configuration Files
The cluster configuration files include the cluster database and the quorum log.
The cluster database is the database or configuration data (cluster objects and
their settings) that are pertinent to the cluster. This database is the product of
the cluster registry key checkpoint and the changes that are recorded in the
quorum log. All of the nodes of the cluster hive maintain a local copy of this
database in the nodes local registry.
restoring the first node of a cluster, except that you will not have to restore the
cluster disks.
Performing Node Testing
Testing the failover and failback policy is recommended before putting the
cluster back into production.
1. Verify that the disk and cluster resources are available on the correct node.
2. Fail over each group and resource to verify that they can successfully start
on the other node of the cluster.
3. Test the failback policy of each resource by allowing the resource to fail
back to a preferred owner after the node has come back online.
Topic Objective
To describe how to restore
the second or remaining
nodes of a cluster and test
the failover and failback
policies.
Lead-in
The last step in restoring the
cluster is to restore the
second node and then test
the components of the
cluster.
10 Module 7: Server Cluster Maintenance and Troubleshooting
Evicting a Node
Steps for Evicting a Node
1. Back up both nodes
2. Verify backup
3. Move all groups to the remaining node
You must first evict a node
from the cluster to add a
new node to the cluster.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 11
Troubleshooting Cluster Service
Troubleshooting Tools
Examining the Cluster Log
Troubleshooting Network Communications
SCSI Configuration Problems
Group and Resource Failures
Quorum Log Corruption
*****************************
ILLEGAL FOR NON-TRAINER USE******************************
Troubleshooting a problem with Cluster service can be more complex than
troubleshooting a single server because of the virtual servers and the need for
intracluster communications. Virtual servers change ownership from one node
to another, which may cause network connectivity problems. Applications
running on the cluster are difficult to troubleshoot, because they are running on
a virtual server instead of a physical server. You could also have a node-to-node
communication problem because servers usually work independently of each
other and not together. You might experience hardware problems with the
shared bus and the cluster disk resources.
The most common failures are due to improper configurations within groups
and resources. Cluster service will fail if the quorum log becomes corrupt. It is
important to know how to repair the quorum log to restart the cluster.
*****************************
ILLEGAL FOR NON-TRAINER USE******************************
When troubleshooting Cluster service, you can use the same tools and
methodologies that you would when troubleshooting Windows 2000 Advanced
Server.
Cluster service writes logging information to the system log of every node in
the cluster. Cluster service also writes a more detailed log of cluster activity to
the cluster log on each node. Use these two sources to gather information when
you begin troubleshooting a problem. You will be able to determine whether
the problem is related to the network, to services or applications, or to physical
components in the cluster.
Use Event Viewer to filter the system log on event source: ClusSvc. You
can view general events, such as if Microsoft Cluster service failed to join the
cluster on this node and Microsoft Cluster service successfully created a cluster
on this node.
After you have determined the type of problem, you can use the following tools
to search for the source of the problem. You must check each node individually
when using any of these tools.
Disk Manager. You check disk manager to find out the health of the cluster
disk. You can check whether the operating system recognizes the disks, and
whether the cluster disks are basic versus dynamic. You also need to verify
that the drive letters of the cluster disks are the same on both nodes.
Task Manager. You can verify that Cluster service is running in Microsoft
Windows 2000 Task Manager. You can also use Task Manager as a
performance monitor, but you do not obtain the level of detail as you would
with a performance monitor. In Task Manager, you will be able to verify the
CPU utilization percentage and the memory resources on the node.
through the services snap-in to ensure that the default properties have not
changed. Verify that Cluster service:
• Is set to start automatically.
• Is set to log on as the designated domain service account.
• Is set to restart after a failure.
Make sure that the four following services have started:
• Network Connections (Network Connections has a Remote Procedure
Call (RPC) dependency)
• RPC
• Windows Management Instrumentation Driver Extensions
• Windows Time
14 Module 7: Server Cluster Maintenance and Troubleshooting
Examining the Cluster Log
Copy of cluster - Wordpad
Creates a new cluster group
000003b8.000003b4::2000/10/02-19:44:12.946 [CS] Cluster Service started – Cluster Node Vers
000003b8.000003b4::2000/10/02-19:44:12.946 OS Version 5.0.21
000003b8.000002f0::2000/10/02-19:44:12.957 [CS] Service Starting…
000003b8.000002f0::2000/10/02-19:44:13.007 [EP] Initialization…
000003b8.000002f0::2000/10/02-19:44:13.057 [DM]: Initialization
000003b8.000002f0::2000/10/02-19:44:13.097 [DM]: Loading cluster database form D:\WINNT\clu
000003b8.000002f0::2000/10/02-19:44:13.397 [DM] DmpStartFlusher: Entry
000003b8.000002f0::2000/10/02-19:44:13.397 [DM] DmpStartFlusher: thread created
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Initializing…
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Local node name = SERVER1.
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Local node ID = 1.
000003b8.000002f0::2000/10/02-19:44:13.427 [NM] Creating object for node 1 (SERVER1)
000003b8.000002f0::2000/10/02-19:44:13.437 [NM] Initializing networks.
The cluster log is enabled by default when you install Cluster service, but will
not start logging information until after the first restart of the node. Cluster log
output is written to %SystemRoot%\Cluster\Cluster.log, and you can view it
with Microsoft Wordpad.
Setting the Logging Level
You can set four logging levels in the cluster log. Four logging levels are
possible. The default level is two, which logs enough information necessary for
normal troubleshooting. To set a different logging level, click Start, point to
Settings, click Control Panel, and then double-click the System icon. Create a
system environment variable under the Advanced button called
ClusterLogLevel with a value of 0, 1, 2, or 3, where 0=no logging, 1=Errors
only, 2=Errors and Warnings, and 3=Everything that happens.
Setting the Log File Size
The log file defaults to a maximum size of 8 megabytes (MB). When the log
file size reaches 8 MB, the log file will start overwriting the data in the log file.
To specify a larger file size, add the registry entry ClusterLogSize under
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc\
Parameters. ClusterLogSize has a type of DWORD and it should specify the
maximum size in MB for the log file. If this value is set to 0, logging is
disabled.
Topic Objective
To learn how to use the
cluster log to troubleshoot
Cluster service problems.
Lead-in
The cluster log is the best
source of information that
you have available to
troubleshoot a cluster.
Delivery Tip
In the following example, [NM] indicates the component that wrote the event to
the cluster log; in this case, NM stands for node manager.
378.380::1999/06/09-18:00:50.881 [NM] Forming cluster membership.
Resource DLL Log Entries.
The following example is a cluster log entry for a resource DLL event. This
example is one of the entries from the disk arbitration process.
15c.458::1999/06/09-18:00:47.897 Physical Disk <Disk D:>:
[DISKARB] Arbitration Parameters (1 9999).
Instead of listing an abbreviated component name between the timestamp and
event description as component log entries do, entries describing resource DLL
events list the following information:
Resource type (Physical Disk)
Resource name (<Disk I:>)
The event description in this example is [DISKARB] Arbitration Parameters
(1 9999).
16 Module 7: Server Cluster Maintenance and Troubleshooting
Troubleshooting Network Communications
Troubleshooting Node-to-Node Communication
Verify RPC Communication’s
Verify Cluster Heartbeats
Troubleshooting Client-to-Node Communications
Check NetBT Cache with Nbtstat
Ping IP Address
WINS Static Mappings
To describe how to
troubleshoot node-to-node
and client-to-node
communication.
Lead-in
Depending on the symptom,
you may have to
troubleshoot node-to-node
or client-to-node
communications.
Note
Module 7: Server Cluster Maintenance and Troubleshooting 17
Verifying Cluster Heartbeats
As with RPC communication, to verify that cluster heartbeats are occurring
between the nodes of a cluster, you must use a network capture utility.
Cluster service uses User Datagram Protocol (UDP) port 3343 to send
heartbeats on the network. Use Network Monitor to capture port 3343 to verify
both nodes of the cluster are sending and receiving cluster heartbeats.
Troubleshooting Client-to-Node Communications
After a failover occurs, clients must still be able to gain access to a cluster, even
though they will be accessing a different node. The client must be able to
resolve any cluster network names so that they will always connect to the node
on which the resources are online. If clients cannot connect to virtual servers,
verify that:
The client is accessing the cluster by using the correct network name or IP
address.
The client has the Transmission Control Protocol/Internet Protocol (TCP/IP)
protocol correctly installed and configured.
Note
18 Module 7: Server Cluster Maintenance and Troubleshooting
SCSI Configuration Problems
SCSI Controllers
SCSI Terminiation
SCSI Cabling
*****************************
ILLEGAL FOR NON-TRAINER USE******************************
If you suffer from hardware failures, you may have to replace hardware
components of the cluster. If you replace components in the SCSI subsystems,
you need to make sure that the new SCSI configurations conform to the
following guidelines.
SCSI Controllers
SCSI IDs Each device on the shared SCSI bus must have a
unique SCSI ID. Most SCSI controllers default to SCSI
ID 7. Therefore, you must change the SCSI ID for one
of the controllers on the shared SCSI bus to something
other than ID 7.
Boot Time SCSI Bus Reset Cluster service uses SCSI bus resets, but in a controlled
way during a membership regroup operation. Some
SCSI controllers reset the SCSI bus when they
initialize at start time, before Windows 2000 is loaded.
If the SCSI controllers reset the SCSI bus, the bus reset
can interrupt any data transfers between the other node
and drives on the shared SCSI bus. Therefore, you
should disable automatic SCSI bus resets, if possible,
access the quorum disk.
On-Card Termination Many SCSI controllers provide on-card termination;
however, the on-card termination does not provide
termination when the computer is not turned on. On-
card termination only becomes an issue when external
terminators are not used. When using external
terminators, the on-card termination should be
disabled.
SCSI Cabling
Tri-Link or Y-cable SCSI
Connectors
Attaching Y-cables or tri-link connectors to the back of
the SCSI controllers at each end of the bus is one
method that you can use to allow the SCSI bus to
remain terminated even when one node is turned off.
These components allow you to use external
terminators that will continue to provide termination if
a node is turned off. You must ensure that the SCSI
cards in the nodes are not providing termination when
using these connectors.
Long Cables It is very common to have multiple external SCSI
drives on the shared SCSI bus. When configuring
multiple external drives, it is very important not to
exceed the maximum combined cable length that the
controller manufacturer recommends. The SCSI
specifications specify the maximum combined cable
length when using different types of cabling. If the
manufacturer of the controller recommends a shorter