The location of the equipment in the network also affects availability. As the equipment moves towards the core infrastructure of the public network, the availability requirements become more aggressive; while at the edge, the availability requirements are more relaxed.

The local loop, for example,

does not have much built-in protection to guard against failures.

In fact, Telcordia specifies availability objectives for local exchange networks as low as 99.93 per cent, which it claims represents a balance of benefits and costs consumers find acceptable. However, the core infrastructure interconnecting these local exchanges must provide far greater availability.

The availability expectation for the different service types is also different. Critical and essential services, such as 911, are required to have much higher levels of availability than other non-essential services.

The first consideration in determining a system’s or component’s availability requirement is to determine where in the network the component will sit, what will it be used for, and how will it be combined with other systems to the ultimate end user solution.

Measuring hardware availability takes into account the individual components that make up the system integrated circuits, transistors, diodes, resistors, capacitors, relays, switches, connectors, and more.

There are a number of established methods for predicting hardware reliability and availability for hardware components. For this study, various hardware suppliers provided aggregated platform-level and the telephony board level MTTF data via the Bellcore methodology. Their data was used that data as input and starting points, but were not based on individual electronic components to determine the availability characteristics.

The way components are combined has a big effect on the total availability of the solution.

If components are combined in a series, the solution relies on the availability of all components and the total system availability can be much lower than the availability of the weakest component. When If components are assembled in parallel, however, there is some leeway in the level of individual component availability. The total system availability can even be higher than that of the most available component.

The second consideration for developers is to utilize parallel availability wherever possible.

Typically, architecting a solution for parallel availability does not add much cost to the total solution, as the cost is not realized until the parallel components are actually added. Service providers can deploy systems without the parallel component at first, then easily increase availability when the can be justified.

As redundancy is introduced into the system, the availability characteristics of a system change significantly. Availability calculations that must account for the effects of such redundancies, the success rate of failover to redundant components, the effect of the MTTR failed components, and the like – became a laborious and error prone endeavor. Better results can be arrived at using platform and telephony board MTTF data as input and employing Reliability Block Diagrams (RBDs) to accurately and precisely determine the system level availability characteristics.

With RBDs, interconnected blocks can be constructed to show and analyze the effects of failure of any component in the system. RBDs can also account for probabilities of successful failover in cases where there is redundancy built into the system along with operational factors, such as the lack of immediately available spare parts. Software from companies such as Relex Software Corporation* can be used to derive the system-level availability characteristics. These packages calculate the comprehensive set of failure paths to determine the overall reliability and availability of the system under thousands of failure scenarios. Since the number of failure paths grows exponentially as the number of components in the system grows, the software performs Monte Carlo simulations to arrive at the different reliability figures of merit for different confidence levels.

The third consideration for developers is to avail themselves of these relatively inexpensive tools and thoroughly analyze the availability characteristics of their solutions under the different availability configuration options. Such testing applies the necessary rigor to completely determine a system’s unique availability fingerprint.

Causes of Downtime

Overload

One of the leading causes of service outage is overload of the system or network: too few resources processing too many calls. Examples include an initial introduction of a new service, or when unexpected spikes occur.

When new services are introduced, it is very difficult to predict end user response or how the service will actually behave under real world conditions. Modeling helps; but too often, when trying to predict the actual behavior of a complex system, integral factors may be overlooked.

Usage spikes occur either when advertising campaigns work too well or during cyclical peaks such as Mother’s Day.

Unless systems are properly designed gracefully, they will fail. It may be difficult compound the problem, minor failures fault management systems themselves Ensuring systems are designed to developers. Systems must provide some form of load shedding and a graceful scale back of services when things start to go wrong. Operations, administration, and management (OA&M) systems, often relied on to assist in preventing excess load, must also be made highly available and fault resilient. Otherwise, they can bring down an entire system or compound the availability problems.

Part three will focus on planned and unplanned downtime.

Share on LinkedIn Share with Google+