Being Comfortable with Page Faults
This entry will deal with Page Faults (PFs) as they apply to Live Optics and how to make use of the data.
There are two valid types of PFs, but Live Optics only tracks one of these.
Soft Page Faults
When a program tries to access memory for a data point and the data is already in memory, but its entry isn’t cataloged, we call this a Soft Page Fault. This is like knowing that a house is in a particular city, but you don’t have the street address to locate it.
Soft Page Faults are relatively fast for an Operating System to fix and therefore we have decided to focus on the more intrusive version of this called Hard Page Faults.
Hard Page Faults
Alternatively, when a program tries to access a memory data point and the data isn’t there at all, then the data must be retrieved to memory.
This action has more involved consequences as the OS needs to first retrieve the data. That can create 1 or more read IOs from disk. Secondarily, there needs to be place in memory to store that data.
If there is a free location to store the data, then the data is brought into memory for use by the application. If the there is no location to write the data, then some data currently in memory must be written out to free up space. This can create 1 or more write IOs to disk. In other words, it must “swap” data in memory for data that needs to be moved into memory.
Is it Bad? Is it Normal?
Page Faults are a very normal part of the OS behavior. This is especially true for Windows environments where Page Faulting will be very common. Linux-based OSes will also Page Fault, but at a much lower rate by design.
As a rule, Windows OSes will Page Fault frequently and it’s normal. Often these can even spike to very high thresholds.
As a rule, Linux OSes will also Page Fault; however, the values will be much lower and much less frequent.
How to spot a problem and what to do
The following three examples will help you understand what might be an area of concern, plus two examples of known issues.
High PF Values
Linux machines will generally run in the 100s and sometimes in the low 1,000s at a spike. Anything in the Linux/Unix world that deviates from this should warrant your attention.
Windows, on the other hand, will often be in the 1,000s or higher. There is no definitive rule here, but the bottom line is that you must use some common sense. If we have a machine faulting 5,000 times per second for minutes and you have 5 spinning hard drives in that machine, then you are potentially sending a punishing amount of activity to disk that is going to steal from your performance capabilities.
The first place to look if you feel this machine is suspect is the memory counters themselves. If the machine is running at 100% memory, then a suggestion might simply be it needs more memory! It’s not a bad thing to use all your memory, but if you are using all your memory due to the fact that you don’t have enough, then that might be your issue.
However, if the machine is not using all its memory, then you might want to look at the memory allocated to the application itself. There is free memory to use -- why not give to the applications that need it?
Long Running or “Table Top” PF Graphs
Servers can absorb spikes and often ones that are pretty large. The most commonly viewed issue will be long running or “table top” patterns in your graphs. These means that the applications have sustained issues with trying to make data available in memory. This often means the application doesn’t have enough memory allocation or perhaps even that the system isn’t able to sustain the rate at which data is being swapped.
The former can be solved with physical or logical memory allocation. The latter can be addressed by determining where this swap space is allocated and what resources it has.
Sloping PF Graph of DOOM!
OK, POP Quiz time!
When would you have an ever-increasing amount of Page Faults that never seem to recover?
Well, if you guessed a memory leak then you would be right. When a poorly-written application fails to manage memory correctly and it never releases information it no longer needs, then you will have an ever-increasing consumption of memory. Once full, it will be forced to get new data from outside of memory at an increasing rate.
Eventually, this problem self-corrects… and not in a positive event!