Disk Queue is often thought of the first indicator of poor application performance, but it’s frequently blamed too early. The following explanation is a quick and dirty guide to understanding some basic approaches to demystifying Disk Que.
Let’s break this up into two parts; the basic hit and run survival guide for those of you that don’t have the time to read on and a more in depth understanding of why Disk Queue got to be a focal point in the first place.
The Survival Guide:
AN Optical Prime project shows the number of outstanding IOs from the OS's perspective for each sample throughout the recording period. If Disk Queue is the problem, it should be very tightly associated with latency in the same period. So, from good to worse:
- Low Disk Queue and Low Latency = A likely happy application and user experience
- High Disk Queue and Low Latency = As long as latency stays desirable this is should be OK
- Low Disk Queue and High Latency = Needs attention, but it is unlikely to be your storage
- High Disk and High Latency = Should look at your storage as a potential bottleneck
This last one is the one we want to investigate and here is where the value of Optical Prime representing performance over time comes into play. If Disk Queue is causing the Latency, one should see tightly correlated patterns between the two values.
Here is an entry that shows an example of good correlation between latency and Disk Queue.
To understand the basics of Disk Queue Length think of a check-out line at your local “Food-Mart.” Everyone knows the drill… you select your items, you get in line to be checked out, when it’s your turn you pay, and finally you own the item.
Everyone has also been there on holiday or late at night when the line is long and the poor check-out clerk has a queue of upset people saying, “Why doesn’t management just open up more check-out lanes!”
At a very basic level of definition, Disk Queue is the number of outstanding disk operations that are “waiting in line” and thus the reason that it’s often looked at to indicate storage trouble.
We all know that adding more check-out clerks at Food-Mart would make the line fan out and go faster and it’s doing so because we increased our degree of parallel work. The same basic principles can be applied to IO requests. If I had only one disk in my server trying to do all that work or even, let’s say, a small RAID 5 trying to do that work. Then we could conceive that the application would generate a workload demand where the check-out line for I/Os would get backed up. This high Disk Queue phenomenon is called being “Spindle Bound.” Plain and simply put the disks just can’t keep up with the demand so a line is formed and that will manifest itself as Latency to the Operating System.
The basic rule of thumb is that a Disk Queue of more than 2-4 is bad.
Easy, right? Well it gets more complicated unfortunately.
That rule is actually that a Disk Queue of more than 2-4 per disk is bad… the reason it gets hard is that Optical Prime doesn’t tell you how many disks make up that “F: Drive.”
Seems trivial enough, why don’t we just grab the number of disks and call it a day? Well, we can’t always do that. Some drives are just really partitions and an E: & F: drive might be on the same disk. A better mask of the truth comes from storage arrays themselves.
Any external disk array that might be representing a Volume or a LUN to an Operating system could be masking any number of drives from the OS. For example, an EMC array could have a RAID group of 4 or 9 disks making up the LUN that is represented to the Windows OS as the “F: Drive”… so if we have a Disk Queue of 15 is it bad… or is it OK?
Application Interference
Some applications can actually manage the Disk Queue or be responsive to it. As a management technique applications such as SQL server can actually throttle back the IO as to not create too many outstanding IOs. If they see the Disk Queue climbing, it can mask the problem by not allowing it to get out of hand.
Data Patterns
Back at the check-out counter at Food-Mart… When the manager does finally wake up and opens 3 new check-out lanes the people can fan out and go through the lines because their purchases are totally unrelated. They are like Random IO. Each person goes through the line totally independent of anyone else. Random IO is basically the same. Each operation just wants to finish as quickly as possible and doesn’t really care too much about anyone else.
Sequential data is the opposite and can be thought of more like a movie. A movie is a series of still frame photos that are played “in sequence” to give you the effect of a motion picture. For the movie to make sense those frames need to be played in order and are therefore dependent on that order for the movie to make sense. (minus any Quentin Tarantino films of course)
Often sequential IO can’t be effectively broke into parallel activity. Depending on the nature of the program running the sequential workload, you may or may not see an increased amount of Disk Queue and Latency, but might see a similar correlation extending to the IO Transfer size. To learn more about this, read the post on how IO Transfer Size can effect Latency.
Summary
Today with SSDs and virtualized storage the chances of having disks being the bottleneck aren’t like they were when 15K RPM drives was the highest tier. Nonetheless, it’s something worth looking into every time you are hunting down a latency issue.
It’s almost easier to rule out that disk is causing the latency than find the actual cause of the latency. But a least you can have one less place to look :)