How Reliable are your Services?
We have very few outages, but they do occur.
Service delivery over the internet is generally very reliable,
and we have redundant servers and services to enable us to cope
if something goes wrong.
The 'cloud' involves a number of separate pieces:
- our servers
- our network
- our physical connection to the internet
- our ISP
- your ISP
- your physical connection to the internet
- your network
- your computer
If one of these failed, the service would break.
If they were all 99.995% reliable the overall reliability would be only 99.95%
which means 4 hours in a year the service would be down.
Based on past experience we have found the weakest link is the ISP.
A typical year sees us have less than four hours of down time, and
the failures have mostly been ISP related and have
never been our servers.
The four most serious failures of the last ten years have been
- Loss of physical connection when work on a nearby major road
resulted in the physical internet connection being severed for
most of one day.
We now have two ISPs on completely separate networks (cable and DSL)
to make such a thing less likely.
- A power outage that affected all of the North East and lasted more
than a day.
Since almost everyone was affected, with most shops and businesses
closed because of the power loss, our customers were not inconvenienced
by our service being down at the same time they were.
We are considering a standby generator since some of our newer
customers are self sufficient in electrical supply.
- When we moved premises our ISP was unable to reconnect us
for over a week.
We relocated a server to another site with a different IP and rerouted
internet services to there until our ISP could sort it out.
We were not 'down' at all as far as customer service was concerned
because the temporary solution was in place for start of business
Monday AM.
It was a nuisance all the same.
- Our ISP had a failure of their configuration server and 'lost'
our static IP along with those of over a thousand other subscibers.
It took over half an hour for us to find out there was a problem,
ten minutes to discover it was at the ISP end, an hour of failed
promises from them before we decided to switch services.
It took another hour to switch the equipment
and configuration to use the alternate ISP for the rest of the day.
We were down again in the afternoon for nearly 15 minutes when they
said it was fixed, but it wasn't.
Next time, customers, please don't be polite and email about your
problem.
Get on the phone and wake someone up if you have to.
We should have been quicker to switch services, and
we need to look into having a server permanently on both ISPs.
Apart from these we have rarely been offline, and then only for a few minutes.
We back up the databases several times a day as well as every night.
We have two of everything in our network so we can quickly replace a failed
component.
We have two separate ISPs on completely separate physical networks.
We can route around a problem with either one (and have done so twice).
Unless someone cuts both cables in the lawn outside our premises we
should be safe from losing internet service.
Even if that did happen we have access to a facility ten minutes away
in another part of the city that will host a server for us short term.
Can your own IT people cope with catastrophe any better?