The complexity of a fault-tolerant system is one of the major problems facing the system designer. This complexity is partly due to the addition of redundant components and partly due to the interaction between the components. Correspondingly, the complexity of the fault-tolerant system directly mirrors the reliability model.
As the number of system components and their failure modes increase, there is an exponential increase in system states, making the resulting reliability model more difficult to analyze. The large number of system states makes it difficult to solve the resulting model, to interpret state probabilities, and to conduct sensitivity analyses. In particular, it is difficult to identify the critical components. The basic issue to address here is how and when to use techniques both for largeness tolerance and for largeness avoidance. The former is dependent on automated design tools with adequate computer resources, while the latter requires designer ingenuity and experience.
This book addresses the problems associated with the practical design of complex yet reliable systems. These systems are used in many applications, including avionics, banking, etc. Designers of such systems have to be methodical and careful to meet all specifications and requirements, many of which have origins in safety or mission-critical applications. Throughout the text, we stress the Markov model approach as a means of providing a unified approach to reliability, performability, and system- and cost-effectiveness evaluation.
This book is written for the practicing systems and reliability engineers engaged in designing complex redundant systems.
It contains the necessary background material for probability theory and Markov analysis. It also contains an interactive Windows-based computer program suitable for solving small to medium sized problems.
Chapter 1 contains an introduction to the field of fault-tolerant systems reliability modeling. It discusses the ever increasing system complexity and the tools needed to cope with it. The emphasis is on the practical problem aspects.
Chapter 2 defines the concept of system and what analysis and modeling framework is required to design a fault-tolerant system.
Chapter 3 discusses the foundations of probability theory and needed probability definitions.
Chapter 4 relates the concepts of probabilistic faults and failures to a state-based reliability model. In particular, it discusses measures and specification of reliability. It also introduces the state-based reliability model.
Chapter 5 reviews the basic probabilistic reliability models including reliability block diagrams, fault trees, stochastic Petri nets, and Markov models. It also covers the fundamentals of reliability model development.
Chapter 6 reviews the various stochastic processes, introduces the state-based Markov model through matrix evaluation, state diagram mapping, and approximate solutions.
Chapter 7 applies the state diagram Markov modeling approach to various nonredundant and redundant hardware configurations to evaluate reliability. It covers single components, parallel configurations, standby configurations, and other hybrid configurations. Fault coverage and fault monitor modeling is also covered.
Chapter 8 applies the state diagram Markov modeling approach to various redundant software configurations which can experience failure. Topics discussed include reliability growth models and redundancy configurations.
Chapter 9 applies the state diagram Markov modeling approach to combined hardware and software configurations.
Chapter 10 discusses approaches to reducing large and complex Markov models to a manageable size through system state partitioning and mapping. Approaches include state simplification, system partitioning and reduction, and the mapping of subsystem states to the resulting system states.
Chapter 11 applies Markov modeling to the evaluation of maintainability for systems which can undergo failure and repair. A general approach to handling fault detection, isolation, and reconfiguration (FDIR) is also introduced.
Chapter 12 defines the concepts of availability of systems and the distinction between dynamic and static availability.
Chapter 13 introduces the field of safety analysis. Since system safety and reliability are closely interrelated, a general model, capable of handling safety and reliability, is introduced.
Chapter 14 details the important factors in Markov model evaluation, in particular computer-aided solutions. It discusses how the models are set up, equations derived snd solved, and the results displayed. This chapter also discusses the approximate solutions and their limitations. It continuous with a discussion of the available evaluation tools and end with an overview of the CARMS program.
Chapter 15 discusses the current approaches to system effectiveness modeling including availability, dependability, and capability. It also presents a general, state-based effectiveness model.
Chapter 16 lists the important support analyses to reliability evaluation, including mission definition, failure mode analysis, and trade-off evaluation.
Chapter 17 gives an extended system effectiveness evaluation example and describes other potential applications of the Markov model.
Chapter 18 is the User's Guide to the CARMS reliability evaluation program.
Chapter 19 is a model library of common redundant system configurations that can be evaluated through the CARMS program. For each configuration there is a short description of the application and the corresponding Markov diagram. These models are ready to run and can be modified as needed.
Chapter 20 is the reference manual for CARMS listing all of the commands and keyboard functions.