Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120

In the previous post http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx we have discussed how CSVFS abstracts failures from applications by going through the pause/resume state machine and we have also explained what the auto-pause is. Focus for this blog post will be auto-pauses.

CSV auto pauses when it receives any failure from Direct IO or Block Redirected IO with a few exceptions like STATUS_INVALID_USER_BUFFER, STATUS_CANCELLED, STATUS_DEVICE_DATA_ERROR or STATUS_VOLMGR_PACK_CONFIG_OFFLINE, which indicate either user error or that storage is misconfigured. In both cases there is no value in trying to abstract the failure in CSV because as soon as the IO is retried it will get the same error.

When File System Redirected IO fails (including any metadata IO) then CSV auto pauses only when error is one of the well knows status codes. Here is the list that we have as of Windows Server Technical Preview for vNext:

STATUS_BAD_NETWORK_PATH
STATUS_BAD_NETWORK_NAME
STATUS_CONNECTION_DISCONNECTED
STATUS_UNEXPECTED_NETWORK_ERROR
STATUS_NETWORK_UNREACHABLE
STATUS_IO_TIMEOUT
STATUS_CONNECTION_RESET
STATUS_CONNECTION_ABORTED
STATUS_NO_SUCH_DEVICE
STATUS_DEVICE_DOES_NOT_EXIST
STATUS_VOLUME_DISMOUNTED
STATUS_NETWORK_NAME_DELETED
STATUS_VOLMGR_VOLUME_LENGTH_INVALID
STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR
STATUS_LOGON_FAILURE
STATUS_NETWORK_SESSION_EXPIRED
STATUS_CLUSTER_CSV_VOLUME_DRAINING
STATUS_CLUSTER_CSV_VOLUME_DRAINING_SUCCEEDED_DOWNLEVEL
STATUS_DEVICE_BUSY
STATUS_DEVICE_NOT_CONNECTED
STATUS_CLUSTER_CSV_NO_SNAPSHOTS
STATUS_FT_WRITE_FAILURE
STATUS_USER_SESSION_DELETED

This list is based on our experience and many years of testing, and includes status codes that you would see when communication channel fails or when storage stack is failing. Please note that this list evolves and changes as we discover new scenarios that we can help to make more resilient using auto pause. This list contains status codes that indicate communication/authentication/configuration failure, status codes that indicate that NTFS/disk on coordinating node are failing, and few CSV specific status codes.

There are also few cases when CSV might auto-pause itself to handle some inconsistency that it observes in its state or when it cannot get to the desired state without compromising data correctness. An example would be when file is opened from multiple computers, and on write from one cluster node CSV needs to purge cache on another node, and that purge fails because someone has locked pages then we would auto-pause to see if retry would avoid the problem. In these cases you might see auto-pauses with status codes like STATUS_UNSUCCESSFUL or STATUS_PURGE_FAILED or STATUS_CACHE_PAGE_LOCKED.

  Troubleshooting Cluster Shared Volume Auto Pauses – Event 5120

When CSV conducts an auto pause, an event 5120 is written to the System event log.  The description field will contain the specific status code that resulted in the auto pause.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Event ID:      5120
Task Category: Cluster Shared Volume
Level:         Error
Description:
Cluster Shared Volume ‘Volume1′ (‘Cluster Disk 1′) is no longer available on this node because of ‘STATUS_VOLUME_DISMOUNTED(C000026E)’. All I/O will temporarily be queued until a path to the volume is reestablished.

Additional information is available in the CSV operational log channel Microsoft-Windows-FailoverClustering-CsvFs/Operational.  This can be found in Event Viewer under ‘Applications and Services Logs \ Microsoft \ Windows \ FailoverClustering-CsvFs \ Operational’.  Here is an Event 9296 logged to that channel:

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      9296
Task Category: Volume Autopause
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6bAE-4405-b328-fd9d123469b3} is autopaused. Status 0xC000026E. Source: Tunneled metadata IO

Event Xml:

   
    0xffffe000badfb1b0
    {CA4CE06F-6B06-4405-B058-FD9D1CF869B3}
    Volume1me3
    false
    0xffffcf800fb72990
    0xc000026e
    11
    0x0
    0x0
 

In addition to status code the Event 9296 will contain the source of the auto-pause, and in some cases may contain additional parameters helping to further narrow down the scenario. Here is the complete list of sources.

  1. Unknown
  2. Tunneled metadata IO
  3. Apply byte range lock on down-level file system
  4. Remove all byte range locks
  5. Remove byte range lock
  6. Continues availability resume complete
  7. Continues availability resume complete for paging file object
  8. Continues availability set bypass
  9. Continues availability suspend handle on close
  10. Stop buffering on file close
  11. Remove all byte range locks on file close
  12. User requested
  13. Purge on oplock break
  14. Advance VDL on oplock break
  15. Flush on oplock break
  16. Memory allocation to stop buffering
  17. Stopping buffering
  18. Setting maximum oplock level
  19. Oplock break acknowledge to CSV filter
  20. Oplock break acknowledge
  21. Downgrade buffering asynchronous
  22. Oplock upgrade
  23. Query oplock status
  24. Single client notification complete
  25. Single client notification stop oplock

One of common auto-pause reasons is STATUS_IO_TIMEOUT, because of intra-cluster communication over the network.  This is happening when SMB client observes that an IO is taking over 1-4 minutes (depending on IO type). If IO times out then SMB client would attempt to fail IOs to another channel in multichannel configuration or if all channels are exhausted then it would fail IO back to the caller.

 You can learn more about SMB multichannel in the following blog posts

Configuring IP Addresses and Dependencies for Multi-Subnet Clusters
http://blogs.msdn.com/b/clustering/archive/2011/01/05/10112055.aspx

Configuring IP Addresses and Dependencies for Multi-Subnet Clusters – Part II
http://blogs.msdn.com/b/clustering/archive/2011/01/19/10117423.aspx

Configuring IP Addresses and Dependencies for Multi-Subnet Clusters – Part III
http://blogs.msdn.com/b/clustering/archive/2011/08/31/10204142.aspx

Force network traffic through a specific NIC with SMB multichannel
http://blogs.msdn.com/b/emberger/archive/2014/09/15/force-network-traffic-through-a-specific-nic-with-smb-multichannel.aspx

 Troubleshooting Cluster Shared Volume Auto Pauses – Event 5120

On the diagram above you can see two node cluster, Node 2 is coordinator node. Let’s say Application running on Node 1 issued an IO or metadata operation that CSVFS forwarded to NTFS over SMB (follow the red path on the diagram above). Any of the components along the red path (network, file system drivers attached to NTFS, volume and disk drivers, software and hardware on the storage box, firmware on the disk) can take a long time. Once SMB Client sent IO it starts a timer. If IO does not complete in 1-4 minutes then SMB Client will suspect that there might be something wrong with the network. It will disconnect the socket and would retry all IOs using another socket on another channel. If all channels were tried then IO would fail with STATUS_IO_TIMEOUT. In case of CSV there are some internal controls (for example oplock request) that cannot be simply retried on another channel so SMB Client would fail them back to CSVFS, which would trigger an auto pause with STATUS_IO_TIMEOUT.

Please note that CSVFS on the coordinating node would not use SMB to communicate to NTFS so these IOs would not complete with STATUS_IO_TIMEOUT from SMB Client.

The next question is how we can find what operation is taking time, and why?

First please note that auto pause with STATUS_IO_TIMEOUT would be reported on a non-coordinating node (Node 1 on the diagram above) while IO is stuck on the coordinating node (Node 2 on the diagram above).

Second please note that the nature of the issue we are dealing with is a hang, and in this case traces are not particular helpful because in the traces it is hard to tell what activity took time, and where it was stuck. We found two approaches to be helpful when troubleshooting this sorts of issues:

  1. Collect a dump file on the coordinating node while hanging IO is in flight. There are number of options how you can create a dump file from most brutal:

    1. Bugchecking your machine using sysinternals notmyfault (http://technet.microsoft.com/en-us/sysinternals/bb963901)
    2. Configuring KD and using sysnternals livekd (http://technet.microsoft.com/en-us/sysinternals/bb897415.aspx )
    3. Windbg. In fact this approach was so productive that starting from Windows Server Technical Preview cluster on observing an auto-pause due to STATUS_IO_TIMEOUT on non-coordinating node would collect kernel live dump on the coordinating node. We can open dump file using windbg (http://msdn.microsoft.com/en-us/library/windows/hardware/ff551063(v=vs.85).aspx ) and try to find out what IO is taking long time and why.
  2. On the coordinating node keep running Windows Performance Toolkit (http://msdn.microsoft.com/en-us/library/windows/apps/dn391696.aspx ) session with wait analysis enabled (http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-43-WPT-Wait-Analysis ). When non-coordinating node auto pauses with STATUS_IO_TIMEOUT stop WPT session and collect etl file. Open etl using WPA and try to locate the IO that is taking long time, the thread that is executing this IO and what this thread has been blocked on. In some cases it might be helpful to also keep WPT sampling profiler enabled in cases if thread that is handling IO is not stuck forever, but periodically makes some forward progress.

The reason for STATUS_IO_TIMEOUT might very from software, configuration, to hardware issue. Always check your system event log for events indicating HBA or disk failures. Make sure you have all the latest updates.

Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters
http://support.microsoft.com/kb/2920151

Recommended hotfixes and updates for Windows Server 2012-based failover clusters
http://support.microsoft.com/kb/2784261

Make sure your storage and disks have latest firmware supported for your environment. If it is not going away then troubleshoot it using one of the ways described above and analyze the dump or trace.

It is expected that you may at times see Event 5120’s in the System event log, I would suggest not to worry about infrequent 5120’s as long it is happening once is a while (once a month or once a week), if cluster recovers from that, and you do not see workload failures. But I would suggest to monitor them and do some data mining for frequency and type (source and status code) of auto pauses.  In some scenarios, an Event 5120 may be expected.  This blog is an example of when an Event 5120 is expected during snapshot deletion:  http://blogs.msdn.com/b/clustering/archive/2014/02/26/10503497.aspx  

For instance if you see that frequency of auto pauses increased after certain date then check perhaps you have installed or enabled certain feature that was not on or was not used before.

You might be able to correlate auto-pause with some other activity that was happening on one of the cluster nodes around the same time. For example backup or antivirus scan.

Or perhaps you see auto pause is happening only when certain node is coordinating. Then there might be some issue with hardware on that node.

Or perhaps physical disk is going bad causing failure then try to look for storage errors in the system event log and query disk resiliency counters using powershell

Get-PhysicalDisk | Get-StorageReliabilityCounter |  ft DeviceId,ReadErrorsTotal,ReadLatencyMax,WriteErrorsTotal,WriteLatencyMax -AutoSize

The list above is not exhaustive, but might give you some idea on how to approach the problem.

In this blog post we went over possible causes for the event 5120, what they might mean and how to approach troubleshooting. Windows Server has plenty of tools that would help you with troubleshooting 5120. Keep in mind that 5120 does not mean that your workload failed. Most likely cluster will successfully recover from that failure, and your workload will keep running. If recovery was not successful when you see event 5142, and that will be the subject of the next post.

Thanks!
Vladimir Petter
Principal Software Engineer
Clustering & High-Availability
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume (CSV) Inside Out
http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx
 
Cluster Shared Volume Diagnostics
http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Want something else to read? How about ‘Grievous Censorship’ By The Guardian: Israel, Gaza And The Termination Of Nafeez Ahmed’s Blog

Clustering and High-Availability