ATIS - 0100037
Impact Weighted MTBF - A Metric For Assessing Reliability of Hierarchical Systems
| Organization: | ATIS |
| Publication Date: | 1 July 2013 |
| Status: | inactive |
| Page Count: | 16 |
scope:
Scope & Purpose
Modern day Service Provider (SP) network architectures are continuously evolving to provide a growing and complex range of telecommunications services to their customers. From a reliability perspective, the evolving nature of telecommunications networks with their inherently complex elements can pose a difficult challenge. Reliability metrics need to address network design and operations, service delivery, element functions, and provide guidance on development of Service Level Agreements (SLA).
Two traditional reliability measures - Mean Time Between Failures (MTBF) and Availability - can be applied only to elements with two states: Up State (Element is Functioning) and Down State (Element is Failed). Availability is a steady state metric defined as:
A = MTBF / (MTBF + MTTR)
where MTTR is the Mean Time to Repair, and Unavailability:
U = 1 - A = MTTR / (MTTR + MTBF)
For systems with partial failures, the telecommunication industry developed two reliability measures: Defects per Million (DPM) and Mean Time Between Outages (MTBO), which incorporate the impact of failures. DPM is a generalization of Unavailability where downtime of each failure is weighted by the number of customers impacted. DPM is calculated for a given period of time T as a fraction of the total weighted downtime during that period multiplied by 1,000,000 [ATIS-0100008].
MTBO is an extension of MTBF - it is a field metric that removes the main limitation of practical application of the field MTBF where only failures leading to Field Replacement Unit (FRU) are counted. MTBO counts all customer impacting failures, including software reboots that do not result in element replacement.
The impact of failures in modern systems for voice and data transmission (e.g., IP routers or a Radio Network Controller) as well as mobility and wire-line communication networks with hierarchical design increases progressively with the hierarchical level. For example, a silent failure of just one component in a router switch fabric (top hierarchical level) results in unacceptable packet loss across all router line cards (bottom hierarchical level) until the failure is detected and the router is taken out of service. As described in clause 5, DPM and MTBO have limitations that prohibit reliability assessments in the design phase of such systems. The intent of this document is to define a new metric that could be applied to reliability assessment of such systems at the design stage.
The proposed metric is an extension of the MTBF metric - Impact Weighted MTBF (IW-MTBF) - that combines MTBF values for all hierarchical levels of a given network element or network segment by weighting MTBF for each level by its respective impact on failures. The scope of this document is restricted to the definition of this new metric and the derivation of illustrative examples to demonstrate its power and usage. These illustrations provide guidance on estimating the reliability of complex systems in the design phase via IW-MTBF and then comparing this value with the actual field reliability via MTBO. The development of the necessary measurement capabilities needed to collect data for metric estimation is for further study.
Document History