Downtime in 2010 – lessons?

This article about major downtimes that have been reported triggered some reflection on my part.

2010 – The Year in Downtime

In the past week and a half I have been involved in quite a few discussions about levels of uptime.  I find it interesting how some topics pop up repeatedly in a short period of time.

The first uptime discussion had to do with establishing a design objective for 100% uptime by using two data centers (DCs)- with the key discussion centering on what considerations would be needed.  The key being the geographical separation to ensure that a single disaster event wouldn’t impact both DCs simultaneously.  This separation needed to be balanced with the fact that further separation means longer latency, which causes an impossibility of high-performance synchronous transactions.  This question came from a leader in the data center space – someone who knows what costs are involved, the tradeoffs in design, and the rigourous business analysis that goes behind designing and justifying data center designs.  It covered just one small topic that needs to be considered.  Considering how important DC design is to overall availability it was pretty key.  (further note – this was a design goal exercise, not an SLA driven one – I can’t imagine how expensive a 100% uptime SLA would cost.)

The second discussion was more of a theme – it cropped up in about 5 distinct discussions, on a range of topics ranging from where MASAS goes long-term, to approaches some teams take to building out solutions that had nothing to do with MASAS.  The key here was not understanding what ongoing costs are like at high-calibre DCs.

I was speaking to a trusted advisor who specializes in offering highly-available managed services.  He had quoted some folks for a system that needed high availability (>99.9% – the exact availability # is confidential) but his rates caused the vendor (very small company) to consider self-hosting – what I call the “in my garage” hosting.  Later that same day I was talking to another individual who was looking to create some services that needed high availability, and he had bemoaned what he felt was the unrealistic cost of the services he had been quoted on.  Multiple groups had come to the same idea, clearly showing that they don’t have a full understanding of what high availability entails.  They don’t understand the costs in the deployment of N+1 or 2(N+1) systems in data centers.  Add geospatial redundancy and they were so far out of alignment with normal costs in the space are that they were thunderstruck.  The next day the topic popped again.  I was quite amused – and concerned, especially as some of my other conversations this week have been about taking nascent technologies forward to move into operational use under a shared-services model.  I’ll dive into that topic in the future as there are some thought leaders out there that are working on how this can be supported in an informal community like MASAS.

I figure that if the pros (2010 – The Year in Downtime) are having significant downtime, perhaps this will let the non-data center community better understand the difficulties in meeting the design goals and SLA objectives.  The pros had things figured out largely, though the failure of the power protection, transfer switches, and the likes must be concerning as these are items that were designed to do one main thing.  Granted, it’s a tough thing to transparently shunt power over from live power to UPS to generator.  The human mistakes in other instances really bring it home.  The best of the best have difficulties – how can someone in a garage think they can compete.

SIDENOTE:  These discussions are quite a change since some Y2K military tents that we used (disclosure:  I am not ex-military – I was a key contractor providing support) as a “data center” in 1999 – one claim – we have a generator and won’t go down (let’s forget the logic there).  Worked great til someone put diesel into the gas generator …

One thought on “Downtime in 2010 – lessons?

  1. You’re right, it’s all about risk. The more money spent, the theoretical lower risk there is in the downtime. Someone just has to do the math of what’s lost if the systems go down, and can that justify the cost of the added N+1 or 2N redundancy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s