Brett Allison -

Encoding errors – are they distracting noise or important information that shouldn’t be ignored? There’s so much noise in a big fabric that it can be hard to know what should be investigated. In this blog we will discuss the meaning of a few of the common errors we see and discuss the risk associated with ignoring them as well as the best way to address them.

Encoding Errors in Your SAN Fabric

Encoding errors are low-level errors that indicate encoding disparity inside frames. These are errors that happen with Fibre Channel 1 standard encoding 8 to 10 bits and back, or, with 10G and 16G FC from 64 bits to 66 and back.

Since these happen on the bits that are part of a data frame, these are counted in this column. If the Cyclic Redundancy Check (CRC) Errors increase as well, then there is likely a physical link issue which could be resolved by cleaning connectors, replacing a cable, or replacing the small form pluggable (SFP) and/or Host Bus Adapter (HBA). If there are no CRC errors, then it is likely an issue with the cable only.

Figure 1 below shows the top 20 ports with Encoding Errors. The offending port, slot4 port6, is highlighted in the legend in bold.

Encoding Errors (top 20)

Figure 1: Encoding Errors (top 20)

 

In this case, we received an alert about the encoding errors as well as the CRC errors. While looking at the chart of the errors over time as shown in Figure 2, we noticed the same pattern.

Cyclic Redundancy Check (CRC) Errors

Figure 2: Cyclic Redundancy Check (CRC) Errors

 

Figure 2 shows the number of CRC errors. Encoding errors might lead to a CRC error, however, this metric shows frames that have been marked as invalid frames because of a CRC error earlier in the datapath.

According to FC specifications, it is up to the implementation of the programmer if he wants to discard the frame right away or mark it as invalid and send it to the destination anyway. There are pros and cons to both scenarios.

Essentially, if you see CRC errors it means the port has received a frame with an incorrect CRC, but this occurred further upstream. If the Encoding Errors increase as well, there is a physical link issue which could be resolved by cleaning connectors, replacing a cable or (in rare cases) replacing the SFP and/or HBA.

Fibre channel design best practice stipulates that each host maintains at least two paths to the data in order to maintain redundancy at all times. If one path fails due to pathing issues, then the redundancy is removed and the host is vulnerable to losing connection to the data.

In this case, we started to receive these errors along with a few other errors, and we alerted the customer to check the cable and SFP. During the period from 10:00 am on 11/11/2019 and 11:00 AM on 11/12/2019, during which the host initiator port was throwing these errors, the host only had a single path to the fabric. Fortunately, the other path was functional, but during this time period the availability risk of this server was high.

On 11/12 the cable connection was inspected in the data center and was found to be loose. Someone may have been working on the fabric and bumped the cable. It was reconnected tightly at around 11:00 and the errors and the datapath availability was resolved.

Avoid Risks in your Fabric Environment

In this blog we looked at some of the key indicators that your fabric is behaving well. It is essential that you monitor your fabric and alert when there are errors, as the errors are often a warning that something is going to fail soon or an alert that something has already failed.

It is important to understand the errors so you can treat them in accordance with their severity and impact. In order to effectively do this, you must collect the right data and set up the appropriate alerting mechanisms.

Are you monitoring your fabric for these types of issues? Do you have a way to filter out the false positives? Do you understand what all the errors signify, and which ones can be ignored? If you would like to have a review of the health of you SAN fabric or engage IntelliMagic for proactive monitoring of your fabric please let us know how we can help by sending an email to info@intellimagic.com.

This article's author

Brett Allison
VP of Operations
More from Brett

Share this blog

5 Things Every Storage Professional Should Be Checking

Related Resources

Blog

How to Detect and Resolve “State in Doubt” Errors

IntelliMagic Vision helps reduce the visibility gaps and improves availability by providing deep insights, practical drill downs and specialized domain knowledge for your SAN environment.

Read more
Blog

Finding Hidden Time Bombs in Your VMware Connectivity

Seeing real end-to-end risks from the VMware guest through the SAN fabric to the Storage LUN is difficult, leading to many SAN Connectivity issues.

Read more
Blog

Platform-Specific Views: Vendor Neutral SAN Monitoring Part 2

Each distributed system platform has unique nuances. It's important for a solution to be capable of getting the detailed performance data capable of supporting vendor-specific architectures.

Read more

Go to Resources

Request a Free Trial or Schedule a Demo Today

Discuss your technical or sales-related questions with our availability experts today