Downtime may be planned or unplanned. Planned downtime is caused by the need to upgrade, add new features, or conduct preventative maintenance.

Unplanned downtime is caused by system failures or operator error that usually results from poor training, over-complexity, inadequate usability, or

under-skilled personnel.

According to studies from the Network Reliability Steering Committee (NRSC), procedural errors were the root cause for 33 per cent of reported service outages. The frequency of procedural outages has been on an upward trend, as shown in the figure on the right.

Industry analysis shows that people and process issues cause approximately 80 per cent of unplanned downtime, while the remainder can be attributed to product failures.

Human Factors in Unplanned Downtime

Humans are fallible and make errors. An often-quoted research report from the Gartner Group5 attributes up to 40 per cent of unplanned downtime to operator errors alone. This includes the operators, the administrators, and everyone who physically touches the communications systems. The people making procedural errors tend to be semi-skilled personnel who are more familiar with hardware installation and cabling. Administrators, with a broader skill set, are expected to remotely handle the more complex tasks.

Service providers are well versed with these issues and design their networks with such considerations. They do not like to deal with complicated cabling and they like to be able to remotely diagnose problems from the safety of controlled environments — keeping as many hands off the actual system as possible. In addition, comprehensive training, certifications, and education programs can help increase technical knowledge and reduce basic human errors.

Reliance on human intervention can significantly increase the MTTR of a system. A person has to show up on site (which can not always be guaranteed), and human response time is also usually far inferior to an automated recovery process. In addition, humans tend to make mistakes and can decrease the MTTF of other components in the system or significantly impede the MTTR of the failed component. Although system designers typically strive to remove the human element from the delivery of service as much as possible, they must determine up front if a human will be part of the fault management process in order to implement interfaces to minimize MTTR.

A fifth recommendation when designing a comprehensive availability strategy is to show that due consideration has been shown for human factors. As per the NRSC recommendation, high availability systems must strive to remove the human element from the delivery of service. If an error occurs, the system must be able to capture adequate diagnostic information and quickly return the system to service without waiting for human intervention.

Not only does this prevent human errors, it also reduces labor costs by requiring fewer people and shifts. More tasks require fewer people and cheaper labor can operate the system.

Ensuring that your systems enable service providers to test newer versions of software while the system is running is another good way to reduce the possibility of human error. This testing allows them to cut over to new software easily. If a problem is detected with the new software version, the system can be rolled back to a known stable version of the software.

Managing Unplanned Downtime

In spite of the best components and the best quality control procedures, component faults are inevitable and both fault detection and fault repair impact MTTR. The rate at which faults can be detected directly affects the time it takes for a system to recover. If a backup component is available and is able to assume at least some of the failed component’s responsibilities, a level of service availability is maintained. If the failed component has no backup or load sharing capabilities, then a service interruption occurs.

In order to properly manage unplanned downtime, a system must have a fault management plan. Fault management is typically a five-stage process, the tenets of which determine the efficiency of MTTR.

Detection – a fault is registered, but the failed component is not located

Diagnosis – the determination of which component has failed

Isolation – ensuring a fault does not cause a system failure. (Isolation does not necessarily make a system function correctly.)

Recovery – restoring system to expected behavior

Repair – restoring a system to full capability including all redundancy

Notification of the fault must occur at many points in this process. Examples of notification events include a change in system topology – when a board is taken out of service, put back in service, removed from the system, or inserted into the system. Between each of the five steps there must be notification to the next step or steps in the process. On fault detection, notification may occur to the diagnosis, isolation, and perhaps recovery software components simultaneously.

Perhaps a service provider’s greatest need is for better visibility into the system. They require visibility in order to determine the health of the system, predict impending failures, and perform fault detection, diagnosis, isolation, and repair. Service providers need proactive indications when anything changes in the system beyond a certain threshold and are also asking for remote notification and alarm functionality.

Share on LinkedIn Share with Google+