Summary
There are a number of common SAN problems that occur in deployed systems as well as pre-production environments. The most common problems, which are visible only with a good Fibre Channel analyzer, include credit starvation, undetected physical errors, file-system I/O splitting, and device bursting.

Additional Formats
PDF

A hardware-based Fibre Channel protocol analyzer (such as the LeCroy SierraFC series) captures transmitted inforation from the physical layer of the Fibre Channel network. The analyzer is physically located on the network (versus at a software re-assembly layer like many Ethernet analyzers). Well-designed Fibre Channel analyzers can monitor and display captured data from the lowest-level “native” transmitted state all the way up to the embedded upper-layer protocols, which for the storage industry are typically based on SCSI.

In order to maintain DC balance and signal lock for the differential receivers, Fibre Channel traffic is transmitted as encoded values called “symbols”. There are two encoding schemes used in Fibre Channel. For data rates up to 8G, each 8-bit byte is encoded into 10-bit symbols; for 16G data rates, to reduce overhead each 64-bit data block (8 bytes) is encoded into a 66-bit symbol. A properly-designed FC analyzer will capture the native traffic and be able to display the actual 10b or 66b symbols, as well as the higher-level decoded layers. For a more detailed discussion of the issues of designing analyzers to capture data transmission in the native state prior to decoding, see the article “Taking Full Advantage of 8b/10b Encoding in your USB 3.0 Designs” in the January 9, 2012 EE Times at http://i.cmpnet.com/audiencedevelopment/newsletters/-EmbeddedNL.html.

Contrary to popular belief, Fibre Channel network devices, HBAs, switches, and storage subsystems are not capable of monitoring most SAN behavior patterns. Also, for a number of reasons management tools that gather data from these devices are not necessarily made aware of problems occurring at the Fibre Channel physical, framing or SCSI upper layer. They may collect or poll some information from the SAN environment, but can do little to pinpoint or to re-create the error condition.

Fibre Channel devices spend the vast majority of their time dealing with the distribution and handling of incoming and outgoing data streams. When devices are under maximum loads, which is when problems most often occur, the device resources available for error reporting are typically at a minimum and are frequently inadequate for accurate error tracking. Also, Fibre Channel host bus adapters (HBAs) do not provide the ability to "sniff" raw network data, as is possible with many Ethernet network adapters.

There are a number of common SAN problems that occur in deployed systems as well as pre-production environments. The most common problems, which are visible only with a good Fibre Channel analyzer, include credit starvation, undetected physical errors, file-system I/O splitting, and device bursting.

Credit Starvation

Fibre Channel maintains strict flow control between devices by utilizing credits. Each credit received by a device allows that device to transmit one frame. In the FC protocol, when a frame is transmitted an R_RDY must be returned. In this manner credits are incremented and decremented during an exchange of data. When a device does not have credits available, it cannot transmit. When this occurs, SAN performance can suffer significantly. This is commonly referred to as Bottleneck Detection or High Latency / Slow drain devices.

Devices run out of credits for a number of reasons, including fabric congestion and the inability of a device to receive and process frames at Fibre Channel speeds. There are a number of factors that affect this including:

The number of available credits
Devices generally have a fixed number of credits. Many devices have eight credits, while some have 64 or more. Devices typically reserve half of their available credit buffers for frame transmission, with the other half are given out as credits. This means that the majority of devices on the market will only have four credits available for any given exchange.
Capabilities of the devices to quickly process data
A device has to be able to offload incoming data at the same rate that it is receiving it, or it will run out of credits. This can cause a domino effect: If one device runs out of credits, every device attempting to transmit data to it has to wait until credits become available again. Thus the others run out of credits as well. This is a typical Bottleneck scenario.
Link round-trip delays
The round-trip delay is affected not only by cable length, but also by the number of devices participating on a loop. Each device on a link adds approximately a 0.5 microsecond delay. This is equivalent to adding 100m of cable to the link per device. This may not seem like much, but 2KB reads on an arbitrated loop (typical in database applications) can be degraded by more than 18% for every 100m of cable delay.
Data loss during out-of-credit situations
The amount of data degradation due to outof-credit situations will vary depending upon I/O sizes. Small I/O operations (512 byte to 4KB) are generally affected more than larger ones by out-of-credit situations, because the average frame size is smaller and more credits are necessary to sustain constant data flow.

There are many factors that contribute to credit starvation. Several of these can be avoided by proper consideration of performance factors in the initial design of the SAN; however, it is often a good idea to analyze the data to test theory against reality.

Undetected Physical Errors

Undetected physical errors can be caused by a number of factors, including bad cable bends, termination problems, failing lasers, and faulty cables, which may not be seen at the user level.

Take, for example, the 62.5-micron cable for FDDI or ATM that many facilities have installed in their infrastructure. Fibre Channel multimode lasers are designed for 50-micron cable, but will run on 62.5- micron cable as long as the distance is less than around 200 meters. Longer cable runs will cause intermittent errors, code violations, and other network problems. Because Fibre Channel analyzers view the actual network, they can be a critical component in troubleshooting this type of error.

As with the other "hidden behaviors," physical errors are often undetected by devices and management tools. Fibre Channel is defined in a set of standards that is fully redundant down to the ordered set level. In combination with the recovery methods for SCSI, many framing errors may be automatically recovered and retried at the lowest physical layers and never reported to management tools. In addition, devices can automatically discard and replace many linklevel errors, such as code violations, which go unreported on the SAN.

Fibre Channel has many ways in which devices can recover from error situations. Existing SAN management tools traditionally look for link resets and CRC errors as indicators of problems, but these tools do not attempt to look for protocol-level errorrecovery mechanisms in the traffic.

Devices are also not capable of seeing errors that they have transmitted due to a faulty SFP or cable; this leaves it up to the error-receiving device to report them.

When errors do get reported up to the operating system and/or file system, they can result in the SCSI subsystem being "throttled" to allow only one outstanding I/O at a time (per device). Since most enterprise servers rely heavily on overlapping I/Os for storage performance, this can drop throughput to a crawl. For most operating systems, the only recovery for this is to reboot the server.

For an example of inducing credit starvation for test purposes, you can use the LeCroy InFusion (“jammer”) application which is available on both the SierraFC M8-4 and SierraFC M164. A screen shot is shown on the following page to help illustrate the simple scenario that will remove one R_Rdy. A morecomplex scenario could be built that would remove multiple R_Rdy’s and simulate a credit starvation condition for your specific environment.

Detecting Credit Starvation in Fibre Channel System Testing

Credit Starvation

The number of available credits

Capabilities of the devices to quickly process data

Link round-trip delays

Data loss during out-of-credit situations

Undetected Physical Errors