Brett Allison -

This blog was originally published on October 22, 2015.

One of our customers recently came across a problem in their environment that I think warrants some attention. The VMWare administrator had gone to the storage team and asked if they saw any issues on the Fabric or storage environment because the infamous “state in doubt” message was popping up in the /var/log/vmkernel log file messages on one of their ESX hostsThe messages were similar to what is shown below: 

<YYYY-MM-DD>T<TIME> esx12 vmkernel: 116:03:44:19.039 cpu4:4100)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device “sym.029010111831353837”state in doubt; requested fast path state update…

The error indicated that there was a time-out by the HBA because the command took longer than 5 seconds to complete. 

Typically this implies a problem anywhere along the SAN path from your physical ESX host/HBA to the back-end storage including: 

  • Fabric F ports to which the host is connected 
  • Fabric ISLs (E ports) if in path 
  • Storage ports 

VMWare has an excellent article in their knowledge base for more information on this topic: Information about the error: (1022026). 

Being the savvy storage engineers that they are, our customer did a quick check of the connected ports using IntelliMagic Vision’s topology views for the ESX host associated with this VM. 

esxhost_topology

 

Invalid Transmission Word Errors 

IntelliMagic Vision Port Errors identified issues with Invalid Transmission Words for the ESX Host Port 1 as demonstrated in Figure 2: 

Figure 2: Bad SFP – Invalid Transmission Word Errors 

invalid_twds 

A transmission word consists of four 10-bit codes that must be sent in a precise format.  If they are not in the correct format, the switch detects that the Transmission Word is invalid. This can happen for a number of reasons such as a faulty cable, a bad Small Form-factor Pluggable transceiver (SFP) or as a result of a cable being temporarily unplugged. If this happens once in a while, it might be that a cable got unplugged. However, if this happens continually, there are definitely issues that should be addressed. 

In this case, the only visible symptoms were “state in doubt” errors in the VMware error log and the red border in the IntelliMagic Vision Fabric Errors Report. If left unresolved, the situation could degrade further and could have resulted in significant performance and connectivity impacts for all hosts connected to the offending port. 

Because of the issue it was decided to replace the SFP on this port. The graph shows that this was the right decision: the Invalid Transmission Word errors ended abruptly on 4/19/2023 at 4:19 PM when the SFP was replaced. After that, both the “state in doubt” errors, as well as the Invalid Transmission Words on the switch port ended. 

SAN storage infrastructure is complicated because there are so many components. IntelliMagic Vision can help reduce the visibility gaps and connected storage by providing deep insights, practical drill downs and specialized domain knowledge for your SAN environment. 

The short video below demonstrates how we allow you to visualize all of the components from the VMware host through the fabric to the storage volume.


To learn more about IntelliMagic’s support for VMware and our Topology Viewer, visit intellimagic.com/vmware.

This article's author

Brett Allison
VP of Operations
Read Brett's bio

Share this blog

Related Resources

Blog

Finding Hidden Time Bombs in Your VMware Connectivity

Seeing real end-to-end risks from the VMware guest through the SAN fabric to the Storage LUN is difficult, leading to many SAN Connectivity issues.

Read more
Blog

Platform-Specific Views: Vendor Neutral SAN Monitoring Part 2

Each distributed system platform has unique nuances. It's important for a solution to be capable of getting the detailed performance data capable of supporting vendor-specific architectures.

Read more
Webinar

Should There Be a Sub-Capacity Processor Model in Your Future? | IntelliMagic zAcademy

In this webinar, you'll learn about the shift towards processor cache efficiency and its impact on capacity planning, alongside success stories and insights from industry experts.

Watch Webinar

Go to Resources

Request a Free Trial or Schedule a Demo Today

Discuss your technical or sales-related questions with our availability experts today