Continuously Learning: Windows OS Performance Guidelines

These are notes from a workshop I was able to be a part of a couple of weeks ago. The workshop was on Windows Performance Monitor and monitoring vital signs on Windows Servers. The windows performance monitor has a multitude of counters that a person could potentially monitor on a server. I'm just going to point out a few critical ones that were highlighted in the workshop and what their tolerances are.

Some points to note before I get into these values:

Don't always look at the graph first. The counter graphs can be scaled right out to lunch so they aren't necessarily a good first glance indicator. For a particular counter, your focus should first be on the Minimum and Maximum values in perfmon just below the graph.
In the past, I would tend to keep an eye out for sympathetic counter relationships on the graphs. However, keeping an eye out of inverse counter relationships is also a good idea (where one counter could be decreasing in value while another is increasing)
Counters can get corrupted. Apparently this happens more often than one would think. They can be rebuilt - directions are in this KB post.
You can attach a PID (process ID) to a perfmon counter in an OS older than Windows Server 2008 by modifying reg keys. Details are in this kb article. Versions of perfmon on 64bit OS's come with this already set up.

On to the objects/counters of note:

Process Object

Handle Count - greater than 500 handles may point to a problem.
Private Bytes - greater than 250MB could be a problem. I've seens procs over a Gig and they definitely were a problem (they can get that high).
Working set - greater than 250MB could be a problem
Thread Count - Greater than 500 threads needs to be watched to ensure they aren't increasing over time

Processor Object

Processor Time - all core instances. _Total can get you an overall trend. Greater than 91% utilization is potentially an issue.

Network Interface Object - you need to know what the spec is for your network interface to determine it's capacity. Anything over 80% of capacity could point to a problem

Current Bandwidth - help you determine the NIC's capacity
Output Queue Length - greater than 2 is an problem
Bytes Total - greater then 65% of capacity utilized is past the warning threashold (blinking read with siren)

Memory Object

Free System Page Table Entries - the higher the better here. Lower than 5000 is considered critical. I've seen boxes 'run' (aka hobble) around 2500.
Available Megabytes - again higher here is desirable. Less than 100 MB or 5% free is very problematic
Pool Non Paged Bytes - greater than 80% consumed is out of spec (not at all good)
Pool Paged Bytes - Same as non-pages bytes. Anything between 60-80% should be watched.

Logical and Physical disk Objects - they have the same critical counters so I've put them together here

Idle percentage - 19 to 0 percent is critical. Anything over 55 is warning.
Current or Avg Disk Queue Length - 3 to 31 in the queue - you need to keep an eye on it. greater than 32 is an issue
%Avg Disk Sec Read or Write - 25ms and above is critical

Other interesting points of note:
- Mark Russinovich was the original developer of Perfmon. His blog is apparently pretty good and our facilitator was very impressed with him.
- Windows Server 2008 does processor/core parking. This means the server will 'retire' (effectively turn off to save power) CPU's when there isn't a heavy load on the box. Our facilitator told us that one ISP moved all their boxes to Windows Server 2008 for this reason and their power bills were 15% less a month.

Continuously Learning

Sunday, November 21, 2010

Windows OS Performance Guidelines

No comments:

Search This Blog