Saturday, July 10, 2010

Key Questions in Software Sustainment

Here are some of the questions I ask myself when a software system has stopped running properly...

The first big question is: What changed between when the application was running well, and the time the application stopped running well?  To get to the bottom of that question, there are a number of other questions that can help point the way.

Question isolation - Is this problem isolated in any way?  Is it only on a specific network, environment, or group of servers?  Does it only happen for a specific group of users or a specific client?  Is there a period of time it's isolated to?  Is it isolated to a particular 'item' in your data?  I've seen users try to make an application  use a different browser version than the documented supported versions.  Some times it takes a while to get to the bottom of simple issues like that.  Also, we've run into situations where an organization will not have given users the rights on their machines to install third party active X components for their browser that the application they are trying to use requires.

Question data integrity - Whether you are looking at legacy data or no logic to manage special characters in your data, you need to break the problem down. With legacy data, is the problem isolated to a particular user or group of users, or a particular item?  Or if it's an ETL function, does relational integrity from the first DB line up to the second DB correctly - are you missing or adding 'types' of data that the second DB either is or isn't expecting?  Sometimes you need to pinpoint the exact row or time when the issue occurred to determine what the problem was.

Question continuity - Did something stop running or listening? A service, appPool, web site, 3rd party server?  Are your cron jobs or scheduled tasks still there? Monitoring would quickly and easily answer this question for you.

Question communication -  Are the lines of communication open to all of the dependencies that your application has?  Has a network cable been severed by a backhoe down the street? (I've seen that before)  Is there too much communication going on?  Too many calls from one routine can take you system down.  We had an issue like that with some javascript that called a data access function to a GIS server.  As soon as one too many layers got added to the map, performance died as there were too many round trip calls to the server in one request. Internal systems that need to do identity verification or IP Geolocation are heavily dependent on external third parties to operate.  These external vendors can in turn can be dependent on other external services.  Know all your dependencies and have monitoring and SLA's in place for all of them or you could be sweating bullets.

Question dependencies - Here I'm thinking more in the context of internal dependencies - what internal 3rd party services are you dependent on?  Databases, reporting tools, monitoring systems, document management systems,  any internal system that your application depends on that you aren't responsible for fits into this category.  Are they up and running?  How do you know?  Are they running properly?

Question the un-questionable - Is your hvac system working at your Co-Lo?  As redundant as your service provider tries to be, there will always be something under the radar.  Always.  I've seen an external hvac system take down an entire enterprise.  Many of you have likely seen a McAfee or a Norton patch take down an enterprise.  How secure is your UPS (Uninterrupted Power Supply) management console?  I've logged into one I've found by accident using username:admin password:password!

Question known changes - software enhancements and patches, file or db permissions, config changes, etc. I've seen issues where a changed file path will take down a critical Ftp routine, a 3rd party software patch for a document management system crippled a production system by removing key indexes in the database, and an (unfortunately un-automated) database refresh will have missing roles or users or the db will still be restricted mode.  Doing anything manually can get you into trouble.  A fat finger can push a wrong dll/library or mis-type an entry in a config file.  Even fat fingering an automated deploy can get you into problems.  Automated deploys still depend on data that is manually entered.  We have pointer to app servers in our web.config files entered wrong and multiple entries in machine.config for a particular component.

Question security - Malicious changes are possible either internally or externally.  Unfortunately, the question of security is a larger one everyday.  And it has to be scrutinized at every level.  Does your system have a firewall and IP Sec rules in place?  Does you application provide you with an audit trail?  How secure is your production data? 

Resources at your disposal that can help in your investigation:
Log files - event logs, server logs, if you're logging to tables in the DB, don't forget to look there.
Users - it's like CSI - you need to get ALL the information you can about their problem.  You cannot be afraid to ask.
Thread dumps - killing a hurting server and ensuring that it does a thread dump when it terminates can be very effective in your problem search
Networking tools - WireShark, telnet, ping, netstat - these are great for checking your communication.
Monitoring tools - Nagios, SCOM, GroundworkOpenSource
Books -  Michael Nygard's Release It! and Luke Hohmann's Beyond Software Architecture.

No comments: