Monday, February 14, 2011

Excessive logging kills AppPool (log4net)

We had an issue where one of our Application Pools (IIS 6) was 'dying'.  The interesting thing about it was while the application itself was dead (error page, sometimes showing an outOfMemory exception, other times showing a meaningless exception), the appPool itself appeared in the GUI to be running.  The more troubleshooting I do with Windows OS's, the more I do not trust what the GUI says.  Appearances in Windows can be deceiving.  For several weeks - until we understood the root cause of the issue - are resolution to the dead application was to restart the 'running' appPool. 

A side note - this appPool was running on a Network Load-balanced server, on the dominant node.  The appPool for this particular application only ever failed on the dominant node.  It never failed on the other node of the cluster, so we had reason to believe that the 'load' had something to do with the root issue.

In the end, we discovered the root cause of our dying application by accident.  We were cleaning up our files on the prod servers when it became apparent to us that log4net on another application (and another appPool) was set to log 'ALL'.  This resulted in 1.5Mb of log created about every 5 minutes.  I had a hunch that all this logging was having a memory impact on the server - we certainly didn't want to be logging at the 'ALL' level anyway, so we created a change request and fixed that issue ASAP.  Then we crossed our fingers, hoping that our dying AppPool issue would go away as well. 

It's been more than two weeks since we implemented this log4net change and our flaky appPool hasn't died once.  Lessons learned:

- Watch for excessive logging on prod servers
- Excessive logging can apparently cause other appPools to through outOfMemory Exceptions!

Wednesday, February 9, 2011

Windows VM False Performance Monitor Stats

We almost exclusively run VM's now in most of our environments. Recently, we added two new Windows VM's in production for a particular solution.  Thinking we were being prudent, we asked to have the boxes provisioned with 4 Gigs of Memory since that's what the old 'hardware' boxes had. We deployed our solutions to the new boxes and tested them without a hitch.  So a week later a switch was flipped and the new VM's went live while the old boxes no longer took requests.  After a few days, we discovered (by looking at the task manager) that we were getting some conflicting and concerning readings for memory utilization. Check this out:

- the task manager graph said we were using 3.7 gigs of memory
- looking at the actual numbers below the graph, we only had 300Mb of memory available on the machine
- looking in the processes tab, none of the processes were using a significant amount of memory (several were using between 200-300MB, but that was within normal tolerances when compared with servers running the same system in other environments.  However these same processes where paging memory like they shouldn't have been!  Page faults were very high.
- Totaling the memory used in the individual processes, we were NOT using 3.7 Gigs of memory.

After doing some more investigation, we asked the team that built the server how the memory had been provisioned for these new boxes.  It turned out that they both had originally been given 1Gig of memory and then later (after they were running) 3 more Gigs were added.  At this point our hunch was that these  boxes weren't seeing the additional 3Gigs that had been subsequently added.  After doing a hard reboot on both boxes, this turned out to be the case.

Lesson learned: make sure you follow VM procedure (hard re-boots) for adding/removing resources to VM servers.  Short-cuts can create some pretty confusing statistics in the OS.