The Good, the Bad, and the Ugly – Part 3 “The Ugly”
In the first part of this three-part blog I introduced the SMI project and provided some reasons why it is a good thing. In the second installment I discussed some of the challenges with SMI project, and in this last installment I will discuss the ugly parts. I will focus on the Block Services Performance (BSP) Sub- profile as that is our domain area.
The BSP Sub-profile is the component of the SMI specification that describes the various physical and logical components, their associated attributes and metrics. The BSP is well designed, but as mentioned in Part 2 “The Bad” the conformance standard for the BSP is set so low that vendors can provide an implementation that both conforms to the standard and is mostly useless at the same time.
Like the character Tuco, this type of implementation and the resulting data is just plain ugly.
I’m going to provide a couple of examples of the types of useless but conforming implementations:
Example 1: Insufficient metrics to conduct performance analysis. Some vendors have chosen to conceal response times that are available natively through CLI or APIs. This leads to the following ElementType6 (Front-end Port Data) example:
Description |
ElementName |
Element |
Kbytes |
SampleInterval |
Total IOs |
$VENDOR_BlockStatistical DataFCPort |
$Vendor_Block StatisticalDat aFCPort |
6 |
22,423 |
00000000000300. 000000:000 |
723 |
In addition to the statistic time that is removed for brevity, you can see in the table there are only two metrics: KBytesTransferred and TotalIOs. So a vendor that is not inspired to do much in regards to performance is compliant by supplying only a single ElementType, in this case the front-end port, and just two metrics. Additionally, the conformance test does not test whether or not the counters are actually incrementing as they don’t have an awareness of workload. In summary you can conform to the standard and provide absolutely no useful information. Alternatively, when using the vendor’s native interface, for the example above, you can gather response times for each port. Port response times are a true indicator of port congestion and overall utilization. Without response times, performance analysis of ports is a whole lot of guesswork.
Example 2: A good implementation of ElementType6. Unlike the first example which showed a conformant but poorly implemented ElementType6 example, the following table shows an incredible range of per storage port statistics ranging from read and write response times to remote connection response times to fabric errors. This is a good example of the type of information that is useful and exactly what performance analysts need.
InstanceID | SCSIReadTimeAccumulated |
ElementType | SCSIWriteTimeOnChannel |
StatisticTime | MaxKB |
TotalIOs | MaxMS |
AbbreviatedID | MaxCount |
SCSIReadOperations | FCLinkFailureErrorCount |
SCSIWriteOperations | FCLossSynErrorCount |
PPRCReceivedOperations | FCLossSignalErrorCount |
PPRCSendOperations | FCPrimitiveSequenceErrorCount |
ECKDReadOperations | FCInvalidTransmissionWordCount |
ECKDWriteOperations | FCCRCErrorCount |
SCSIBytesRead | FCLRSentCount |
SCSIBytesWritten | FCLRReceivedCount |
PPRCBytesReceived | FCIllegalFrameCount |
PPRCBytesSend | FCOutOfOrderDataCount |
ECKDBytesRead | FCOutOfOrderACKCount |
ECKDBytesWritten | FCDupFrameCount |
ECKDAccumulatedReadTime | FCInvalidRelativeOffsetCount |
ECKDAccumulatedWriteTime | FCSequenceTimeoutCount |
PPRCReceivedTimeAccumulated | FCBitErrorRateCount |
PPRCSendTimeAccumulated |
Example 3: Ugly data. Ugly data can mean a lot of things but what I mean is that in fifteen years of looking at performance data from all kinds of sources (application, OS, middle-ware, DB) I have never seen so many problems with data ranging from missing data, counters that don’t increment, column headers being incorrect, and counters that wrap.
In this example, ElementType10 (Disk) data is demonstrated. This example reflects two intervals for a single disk. In order to calculate how much activity occurred you need to determine the delta between values at Time 2 and the values at Time 1. Intuitively, the sum of the KBRead (1,256) and KBwrite (258) should equal the measured total of 3,057. In this case the calculated sum is 1,514 which is not the measured total of 3,057.
Time
|
KBRead
|
KBWrite
|
KB Total (Measured)
|
---|---|---|---|
Time 1
|
14,715,076
|
2,316,916
|
33,662,806
|
Time 2
|
14,716,332
|
2,317,174
|
33,665,863
|
Delta (T2-T1)
|
1,256
|
258
|
3,057
|
Calculated Total Read+Write
|
=1,256+258=1,514
|
Sometimes the vendors can help explain how to interpret the data. The good side for us is that we can add more interpretation in our product, sanitizing the numbers and providing more meaningful results.
These are just a couple of the examples of the ugly aspects that result from minimal conformance requirements and poor implementations. For these ugly results to be eliminated from SMI-S all the major storage vendors need to:
1) Recognize that their native interfaces are not always the primary means of platform management. Do not provide a small subset of the data from the native interfaces, but include the same counters in the native and SMI-S interfaces. Or ideally abandon their native interfaces as the primary means for platform management. Instead adopt their tools to leverage SMI-S as a means for platform management and data collection (kudos to those that already have).
2) Take pride in the performance of the hardware and the associated measurements. Invest enough time and energy in this space to provide high quality support for the BSP above and beyond the minimum requirements. This will show the masses that they have high quality products and they are not afraid to enable accurate measurement via an open standard. Design the metrics to allow problem diagnosis, rather than to hide problems.
3) Embrace changes to the BSP that put more teeth in the standard and increase the requirements for compliance such that the data becomes more broadly usable.
For more information on the standard see “The Storage Management Initiative (SMI)”
This article's author
Share this blog
Related Resources
The Total Economic Impact of IBM Z IntelliMagic Vision for z/OS
Discover the benefits of IBM Z IntelliMagic Vision for z/OS. Watch the webinar for insights on cost savings, risk reduction, and ROI.
A Mainframe Roundtable: The Leaders | IntelliMagic zAcademy
Join industry leaders as they discuss the evolution of mainframes, sharing personal experiences, challenges, and future insights, along with practical advice from their careers in the Z ecosystem.
Integrating Mainframe Resource Consumption into the Business | IntelliMagic zAcademy
Discover how to apply FinOps in the mainframe ecosystem to better align IT resources with business objectives, optimizing costs and decisions.
Request a Free Trial or Schedule a Demo Today
Discuss your technical or sales-related questions with our availability experts today