If you work with critical JEE applications, you know that single points of failure must be avoided at all costs. So, you have more than one server, more than one load balancer, redundant storage, the database servers are clustered, etc. Have you thought about the load on your application servers? If one of the servers in your cluster goes down, will the other servers be able to pick up the load, or will they fail or provide terrible Quality of Service (QOS)?
A good place to start the analysis of your load as a single point of failure is to look at the Service Level Agreement (SLA) of your application. The SLA is usually measured in number of nines. For example 4 nines neabs application uptime of 99.99%, 3 nines means application uptime of 99.9% and so on. Let's suppose that the SLA for your application servers is 99.99%. Also, the common SLA for commodity hardware is 97%. This means that the server will be down for about 3% of its time in use.
Can you meet the 99.99% SLA of application with just one such server? No, the SLA of that server is 97%, which mean that over its live that server will be down too often to meet the application SLA. This means that you need more than one server. Will 2 servers of that type meet the SLA? Let' calculate:
p1=probability of one server being down=0.03
p2=probability of two servers being down=0.03*0.03=0.0009 . Is this probability good enough? No, because to meet the SLA the probability of the both servers being down has to be equal or less than 0.0001 (1 - 0.9999).
Would 3 servers be enough? Let calculate:
p3=probability of three servers being down=0.000027 < 0.0001. Yes! Three servers will be enough. Now you think that with 3 servers you are ready to go. Not so fast. There is one more factor to consider, if 1 server is down, will the other two be able to take the load, and if two are down, will the last one be able to take the load of all them. Do I need to plan for two servers being down, or the probability of two servers being down is acceptable for the SLA?
p1=probability of one server down=0.03 not acceptable for SLA
p2=probability of two servers down=0.0001 no acceptable for SLA
The analysis of p1 and p2 tells us that the probability of two servers being down is not acceptable. Therefore we need to plan for the fact that one server should be able to take the entire application load. This translate into a max resource consumption limit on each server well bellow the server max. Let's use CPU usage for this calculation. To comply with the SLA the servers will need to not consume more than 33.3% of the CPU cycles. So, if two servers fail, one is going to be able handle the entire application load. If we follow the same logic for 4 servers, it means that none of the server can consume more that 50% of the CPU cycles to meet the SLA.
I deducted an inelegant mathematical formula express the relation between SLA, probability of failure, number of servers and the resulting max load. Instead of showing you that formula (you never know if one of my former professors from grad school might be reading this and be ready to comment of how ugly the formula is) I wrote a small Java function to express it:
public double getMaxLoad(int servers, double SLA, double failurePossibility) {
double resultLoad = -1;
for(int n = 0; servers > n; n++) {
if((1.0-Math.pow(failurePossibility, servers-n)) < SLA) {
if(n == 0)
return -1.0;
return ((double)n/(double)servers);
}
}
return resultLoad;
}
This function returns the max load a server can have according the number of servers, the SLA and the probability of one server going down. Bellow is a table indicating the max loads for different scenarios.
|
|
SLA |
|
| # Servers |
99 |
99.9 |
99.99 |
| 1 |
-1.00 |
-1.00 |
-1.00 |
| 2 |
0.50 |
0.50 |
-1.00 |
| 3 |
0.67 |
0.67 |
0.33 |
| 4 |
0.75 |
0.75 |
0.50 |
| 5 |
0.80 |
0.80 |
0.60 |
| 6 |
0.83 |
0.83 |
0.67 |
| 7 |
0.86 |
0.86 |
0.71 |
| 8 |
0.88 |
0.88 |
0.75 |
| 9 |
0.89 |
0.89 |
0.78 |
| 10 |
0.90 |
0.90 |
0.80 |
| 11 |
0.91 |
0.91 |
0.82 |
| 12 |
0.92 |
0.92 |
0.83 |
| 13 |
0.92 |
0.92 |
0.85 |
| 14 |
0.93 |
0.93 |
0.86 |
| 15 |
0.93 |
0.93 |
0.87 |
| 16 |
0.94 |
0.94 |
0.88 |
| 17 |
0.94 |
0.94 |
0.88 |
| 18 |
0.94 |
0.94 |
0.89 |
| 19 |
0.95 |
0.95 |
0.89 |
| 20 |
0.95 |
0.95 |
0.90 |
As you can see if for example if your SLA is 99.99% and you have have 10 servers for the application, the load on any of the servers can't be over 80% CPU utilization (if you are using CPU utilization to measure load). That means that when on of those servers hits 81% a new single point of failure is introduced: The servers can't take the load of other failing servers, even when the probability of that failure complies with the SLA.