Within the last 8 months, all of the production servers for a client of mine were moved out of the building they've been in and into a 'bunker', or co-lo. There was a variety of reasons for doing this. As an architect choosing a Co-location provider, and planning the move, you would want to make sure as many systems there are as redundant as possible - power supply, UPS system, networking, etc.
Well, this week everyone painfully discovered a system that wasn't redundant (or if it was, it wasn't as redundant as it should have been). Apparently the HVac systems went down. This resulted in servers getting too hot and consequently having to be shut down. Production servers.
Everything was resolved in a couple of hours, but it just goes to show, it seems that there is always one system that is forgotten.
I've actually run into another situation like this before. It wasn't nearly as big an issue for our shop as it only affected development environments. However, the Calgary airport(!!) along with most of NE Calgary was without a network connection just as long as we were. Apparently a construction crew was using a backhoe doing some digging at an intersection and accidentally cut a main networking cable that supplied most of NE Calgary with it's network connection. Kind of makes you wonder if municipalities should be considering redundant underground networking, doesn't it?
Networking Architects work hard to eliminate as many single points of failure in their systems as they can. Some are hard to control though. Recently McAfee released a virus library update that wasn't tested properly and as a result shut down a plethora of systems across North America. It thought a Windows dll was a virus (false positive) and sent thousands of systems into a perpetual reboot. It would be an interesting bit of process engineering to figure out the best way protect your production systems from the most current viruses AND protect them from bugs like this sent out by the 'big' virus scanning companies at the same time.