## Abstract

Cyber-physical systems (CPS) are the physical systems of which individual components have functional identities in both physical and cyber spaces. Given the vastly diversified CPS components in dynamically evolving networks, designing an open and resilient architecture with flexibility and adaptability thus is important. To enable a resilience engineering approach for systems design, quantitative measures of resilience have been proposed by researchers. Yet, domain-dependent system performance metrics are required to quantify resilience. In this paper, generic system performance metrics for CPS are proposed, which are entropy, conditional entropy, and mutual information associated with the probabilities of successful prediction and communication. A new probabilistic design framework for CPS network architecture is also proposed for resilience engineering, where several information fusion rules can be applied for data processing at the nodes. Sensitivities of metrics with respect to the probabilistic measurements are studied. Fine-grained discrete-event simulation models of communication networks are used to demonstrate the applicability of the proposed metrics.

## Introduction

Cyber-physical systems (CPS) [1] are the physical systems of which individual components have new capabilities of data collection, information processing, network communication, and even control mechanism, and have functional identities in both physical and cyber spaces. Internet of things (IoT) is an example application of CPS. IoT refers to uniquely identifiable physical objects that form an internet-like structure in cyber space [2]. The original idea of IoT was to extend the capability of radio-frequency identification chips with Internet connectivity. Later, the concept was generalized to any physical objects with data collection, processing, and communication capabilities. We can imagine that in the future, any object we interact with in our daily lives would probably have the functions of data collection and exchange, be it thermostat, pen, car seat, or traffic light. The objects in the physical environment also form a virtual space of information gathering and sharing. This information can affect every decision we make daily, such as which jacket to wear, which medicine to take, and which commute route to follow. These physical objects are realizations of CPS, and IoT is formed by the networked CPS objects or components.

There are some new challenges in designing CPS components. The complexity of CPS components has increased from traditional products. Designing each product requires the consideration of hardware, software, as well as network connectivity, which is beyond the existing mechatronics systems, where hardware and software are simultaneously designed but with much lower complexity. CPS components are meant to be internet-ready. Each component is an open system that can be re-configured and re-adapted into the evolution of the internet itself. Therefore, the concept of open system design with robust and diverse connectivity becomes important. In addition, the functions of networked CPS are collected efforts from individual components. The confederated systems formed by individuals do not have centralized control and monitoring units. Ad hoc networks are formed by vastly different and heterogeneous components. The reliabilities as well as working conditions of the individual components can be highly diverse. It would also be common that CPS networks experience disruptions because of harsh working environment or security breach. Good adaptability and resilience are important in designing the architecture of such networked systems. Yet, different from traditional communication networks, CPS networks do not just transfer information. Each node of the networks also generates new information through its sensing units. CPS networks are also different from traditional sensor networks, where the main task of sensors is collecting information whereas the logical reasoning for decision making is still done at centralized computers. In CPS networks, the level of computational intelligence and reasoning capability of the nodes are much higher and a major portion of decisions are done locally at individual nodes.

In this work, resilience of CPS network architecture is studied. The term resilience had been loosely used and semantically overloaded, until recently researchers started looking into more quantitative and rigorous definitions [3–11]. Generally speaking, resilience refers to the capability of a system that can regain its function or performance after temporary degradation or breakdown. Different definitions of how to measure resilience have been developed. All available quantitative definitions of resilience rely on some metrics of system function or performance. Nevertheless, how to quantify functionality or performance of systems such as communication and transportation networks still remains at a very abstract level in these studies. The performance metrics can be domain dependent. There is a need of developing quantitative performance metrics for systems of CPS. Based on the performance metrics, resilience of CPS networks then can be measured and compared. In this paper, formal metrics to quantify the functionality and performance of CPS networks are proposed, which are based on entropy and mutual information associated with the prediction and communication capabilities of networks. The performance metrics are defined based on a generic probabilistic model of CPS networks and demonstrated with detailed network simulations. The design and optimization of CPS network architecture based on the performance metrics for resilience is also demonstrated.

In the remainder of this paper, an overview of resilience research is provided in Sec. 2, which includes the quantitative studies of resilience and the applications in engineering and networks. It is seen that resilience is a common and interdisciplinary subject for complex system study across many domains. Yet, the effort of quantitative analysis for resilience engineering and system design is still very limited. A probabilistic model of CPS networks is described in Sec. 3, where the performance metric to quantify resilience is proposed. In Sec. 4, the metrics are applied in system design and sensitivity studies. In Sec. 5, the proposed metrics are demonstrated and the applicability is verified from detailed network simulations. Section 6 are the discussions and Sec. 7 concludes the paper.

## Background

### The Multidisciplinary Concept of Resilience.

The history of systematic resilience study can be retrieved back to early 1960s by ecologists, who were interested in ecosystem stability. The ecosystem may be stabilized at more than one stable equilibrium. In contrast, resilience studied in engineering focuses on the system behavior near one stable equilibrium and studies the rate at which a system approaches the steady-state following a perturbation. The studies are about how to improve the ability to resist the change and how to reduce the time of recovery.

The resilience perspective emerged in ecology more than four decades ago through the study of interacting population of predator and prey in an ecosystem [12–15]. Resilience is regarded as the capacity to absorb shocks and maintain dynamic stability in the constant transient states. The accepted definition of resilience in ecology is the capacity to persist within one or several stability domains. Resilience determines the persistence of relationships within an ecosystem and is a measure of the ability of these systems to absorb changes of state variables, driving variables, and parameters, and still persist [15]. The measure of resilience is the size of stability domains, or the amount of disturbance a system can take before its controls shift to another set of variables and relationships that dominate another stability region [16]. The concept of slow and fast variables at multiple time scales is observed in ecosystems. Because of the dynamics nature of the ecosystem, the terms “regimes” and “attractors” were proposed to replace “stable states” and “equilibria” [17]. The resilience of ecosystems emphasizes not only persistent and robustness upon disturbance, but also adaptive capacity to regenerate and renew in terms of recombination and self-reorganization. Ecosystem resilience has also been proposed to be a major index of environmental sustainability during economic growth. Economic activities are sustainable only if the life-support ecosystems on which they depend are resilient [18].

The resilience of regional economics is generally considered as the capability of
returning to a preshock state, as defined and measured by employment, output,
and other variables, after disturbances or adverse events such as economic
crisis, recessions, and natural disasters [19,20]. Several notions of
regional resilience have been proposed. For example, Foster [21] defined regional resilience as the
ability of a region to anticipate, prepare for, respond to, and recover from a
disturbance. Hill et al. [22] defined
it as the ability of a region to recover successfully from shocks to its economy
that either throw it off its growth path or have the potential to throw it off
its growth path. Yet, there is no standard and precise definition and
measurement. Unlike physical or ecological systems, a regional economy may never
be in an equilibrium state. It can grow continuously. Therefore, regional
economics resilience emphasizes on returning to the preshock path or state,
regardless whether it was in equilibrium or not. The four dimensions of regional
resilience are *resistance* (the vulnerability or sensitivity of
a regional economy to disturbances and disruptions), *recovery* (the speed and extent to return to the preshock state), *re-orientation* (the adaptation and re-alignment of regional
economy and its impact to the region's output, jobs, and incomes), and *renewal* (the resumption of the growth path) [20].

The term resilience has been used in materials science for decades. A material with good resilience is similar to a spring. It reacts on compression, tension, or shearing forces elastically and rebounds to its original shape. The term appeared in the literature of textile material [23–25] and rubber [26–28] as early as in 1930s. The resilience of a material is generally regarded as the energy dissipation property of storing and releasing energy elastically, and can be characterized as the ratio of energy given up in recovery from deformation to the energy applied to produce the deformation, which is measured through the energy loss during repeated load and unload cycles [28].

With the continuing downscaling of complementary metal–oxide–semiconductor technologies and reduction of power voltage, sporadic timing errors, device degradation, and external environment radiation may cause the so-called single-event transient errors in computer chips and microelectronic systems. Designers of such computing systems use resilience to describe the systems' fault tolerance [29–32]. The main approaches to enhance error resilience include error checking for recovery, co-design of hardware and software, and application-aware hardware implementation. Hardware resilience can be achieved by applying machine learning algorithms to process data collected from fault-affected hardware and perform classification for inference and decision making [33,34]. Statistical error compensation [35] can be applied to maximize the probability of correct prediction given hardware errors.

The reliability and resilience of cyberinfrastructure and cybersecurity have been the research focus for decades [36,37]. Resilience of computer network is regarded as the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation [38]. The considered factors for computer network resilience include fault tolerance due to accidents, failure, and human errors; disruption tolerance due to external environment such as weather, power outage, weak connectivity, and malicious attacks; and traffic tolerance because of legitimate flash crowd or denied of service attacks. Fault tolerance typically relies on redundancy if the failures of components are independent, whereas survivability depends on diversity for correlated failures.

To improve the reliability and safety of socio-technical systems with a proactive and systems engineering approach, resilience engineering is a term people coined to promote the concept of enabling the capability of anticipating and adapting to the potential accidents and system failures [39]. It is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. The emphasized capabilities are anticipation, learning, monitoring, and responding. It is concerned with exploiting insights on failures in complex systems, organizational contributors to risk, and human performance drivers in order to develop proactive engineering practices. In resilience engineering, failure is seen as the inability to perform adaptations to cope with the dynamic conditions in real world, rather than as breakdown or malfunction [40]. The scope of systems includes both physical and humans, as human error is one of the major sources of system failures. Domain experts' over-confidence could also impede the proper development of anticipation of unexpected severe situations [41]. The important issues of resilience engineering include the dynamics and stability of complex systems.

### Quantification of Resilience.

Most of the existing studies in resilience focus on the conceptual and
qualitative level of system analysis. Although various definitions of resilience
have been proposed [3,4], there are limited quantification
methods to measure the resilience of systems for analysis and comparison. These
methods calculate resilience based on the curve of recovery. The curve of
recovery shows the dynamic process that the function or performance of a system
degrades during a shock and recovers afterward. The typical concepts are
illustrated in Fig. 1 by which Francis and
Bekera [4] used to define resilience
factors. In the figure, *F _{o}* is the original stable
system performance level,

*F*is the performance level immediately postdisruption, $Fr*$ is the performance level after an initial postdisruption equilibrium state has been achieved,

_{d}*F*is the performance at a new stable level after recovery efforts have been exhausted,

_{r}*t*is the slack time before recovery ensues, and

_{δ}*t*is the time to final recovery. Other researchers used the curves with minor variations, for instance, without explicit consideration of the initial postdisruption equilibrium state $Fr*$, or the new stable state

_{r}*F*being the same as the original stable state

_{r}*F*. Definitions of resilience from the perspective of reliability are also available. For example, Youn et al. [5] and Yodo and Wang [6] defined resilience as the sum of system reliability and probability of restoration, which can be estimated from the information of probabilities that a system is at different states. Hu and Mahadevan [7] defined resilience with the considerations of probability of failure, probabilities of failure and recovery times, and performance.

_{o}where *S _{p}* is the speed recovery factor calculated from
recovery times to new equilibrium. In this metric of resilience,

*F*/

_{d}*F*captures the absorptive capacity of the system and

_{o}*F*/

_{r}*F*expresses the adaptive capability. Therefore, the more functionality retained relative to the original capacity, the higher the resilience is.

_{o}where *Q*(*t*) is a dimensionless functionality
function that has the value between 0 and 1, *t _{i}* is
the time when the adverse event occurs that causes the loss of functionality,
and

*t*is the time of full recovery. That is, resilience is the area under the curve of performance divided by the time of duration, which is the average functionality. Among four factors of resilience that authors proposed, rapidity, robustness, resourcefulness, and redundancy, the first two are quantified. Rapidity is the slope of the functionality curve during recovery as $dQ(t)/dt$, whereas robustness is quantified as 1

_{r}*− L*where

*L*is a random variable that represents the loss of functionality due to the adverse event.

where *F* is the performance curve as a stochastic variable, and *F*^{*} is the target performance curve. The
resistant, absorptive, and restorative capabilities are considered all together
in the integral form.

where $Td=td\u2212ti$ and $Tr=tr\u2212td$ are the disruption and recovery time periods, respectively. This metric provides the additional measures of failure and recovery speeds.

Notice that the above resilience definitions are based upon some performance measure $F$ or $Q$. This measure can be domain specific. The performance metrics proposed in this paper provide a formal way to quantify the performance of CPS networks so that the resilience can be assessed according to most of the above quantities.

### Resilience of Networks.

The most relevant domain to CPS network resilience is the resilience of telecommunication networks such as internet, wireless networks, and vehicular networks [38,42]. Resilience can be qualitatively measured in a state space formed by service parameters and operational state. The quantitative approaches measure system resilience by message delivery failure probabilities due to packet loss [43], payload error [44], or delay [45] during transmission. For topological analysis, the communication failures are quantified based on the connectivity in the Erdös–Rényi random graph [46]. Simulation models [47] have also been developed. The performance and resilience of networks are measured by packet delivery ratio [47], route diversity [48], node valence and connectivity [49,50], or quality of service [51,52].

The resilience of supply chain, logistics, and transportation networks has also been studied in the recent decade [53–56]. Most of the studies remain conceptual. In addition to the concepts of response and recovery, supply chain management also emphasizes proactive approach for readiness before and growth after disruption. Only limited efforts are given to quantitative analysis, particularly on resource allocation optimization under uncertainty, such as with differentiation between disruption and regular supply variability [57], facility location design [58,59], postdisaster recovery [60], multisourcing [61], and inventory control [62–64]. For networks design, node valence and topological distances are used to quantify accessibility, robustness, flexibility, and responsiveness of networks [65].

Different from the above efforts which focus only on the capability of information exchange or material supply in networks, both communication and reasoning capabilities of CPS networks are considered in this study. A probabilistic model is proposed to quantify the capabilities of CPS networks, which is described in Sec. 3.

## Probabilistic Model of Cyber-Physical Systems Network Architecture

*N*nodes representing IoT-compatible products and $E={(vi,vj)}$ is a set of edges that indicate the information flow from node $vi$ to node $vj$. An adjacency matrix $A\u2208IN\xd7N$ is used to model the topology and its elements defined as

In the probabilistic model, the correlations among nodes are represented with the correlation probability matrix $C\u2208[0,1]N\xd7N$ and its elements are conditional probabilities $Cij=P(xj|xi)$ with random state variables $x$'s associated with the nodes. Therefore, the edges in the probabilistic graph model are directed.

### Probabilistic Model.

*prediction probability*that node $vi$ detects the true state of world $\theta $ is

*i*th node. The information dependency between nodes is modeled with

*P-reliance probability*

*Q-reliance probability*

*entropy*corresponding to the prediction probability of the

*i*th node is

*conditional entropies*that quantify the information inter-dependency between state variables $xi$'s are defined as

*mutual information*between state variables $xi$ and $xj$ is defined as

### Performance Metrics of Cyber-Physical Systems Networks.

A metric that measures the performance of a system should have the following properties [66]. First, the metric should be deterministic and monotone so that one-to-one correspondence between systems and measures can be established. Mutual information of two random variables $x$ and $y$ is non-negative. It is zero when the two variables are totally uncorrelated. It reaches maximum when the two are the same variable. That is, $0\u2264M(x,y)\u2264M(x,x)$. In addition, mutual information is a symmetric metric and $Mx,y=M(y,x)$.

Second, the metric should be dimensionality independent so that the performances of systems can be compared regardless of their sizes. Calculating the average value of pairwise mutual information is necessary so that the measure is independent of the number of nodes. In addition, mutual information of random variables with discrete probability distributions also depends on the number of possible values for the random state variables, i.e., the size of state space or the probability mass functions associated with the state variables. A dimensionless measure for probabilistic design should incorporate the degrees-of-freedom for the system and the sizes of the state space.

Third, the metric should be sensitive to the change of systems when used for resilience measurement. The function and reliability of a system are sensitively dependent on those of subsystems and components. The metric should also be sensitive enough to reflect the changes at the component level.

*N*nodes and

*D*-nary state variables is

*D*= 2 (i.e., $xi=\theta $ or $xi\u2260\theta $).

where any correct prediction as a result of the information cue from any connected node leads to a success. The sampling iterations continue until enough numbers of samples for all nodes are drawn for one time-step. The prediction probabilities for all nodes are then updated based on the frequencies of correct predictions from the samples. The mutual information for each pair is calculated and the system performance in Eq. (12) is estimated. With the updated prediction probabilities, the system moves on to the next time-step, and the same sampling and update procedures continue until the predetermined time limit is reached.

During the simulation, the system disruption and recovery occur at certain time steps, which are modeled with the changes of reliance probabilities. When the disruption occurs, the reliance probabilities (both $pij$ and $qij$) of some randomly selected pairs are set to be zeros. At the recovery stage, these disconnected pairs are reconnected with the previous reliance probabilities recovered.

Figure 2 shows the performance measures from the simulation of a system with 10 nodes. For each iteration, 500 samples are drawn. The disruption starts at time-step 50 and ends at time-step 100, during which a number of connections are randomly selected as disrupted edges at each time-step. By the time-step of 100, the total number of disrupted connections is 39 for the case in Fig. 2(a) and is 76 for the case in Fig. 2(b). The recovery period starts from time-step 150 and ends at time-step 200. The system is fully recovered by time-step 250 and reaches the new equilibrium. It is seen that the proposed performance metric can sensitively detect disruptions from its trend. The volatility is mostly due to the relatively small number of nodes and sample sizes.

The dynamics of entropies and probabilities in the system in Fig. 2(b) is shown in Fig. 3. The average values of conditional entropies calculated from Eq. (10) and the average values of entropies calculated from the prediction probabilities in Eq. (8) are shown in Fig. 3(a). During the disruption, the conditional entropies decrease, while the entropies associated with the prediction probabilities increase. The entropies have small values during the normal working period, because the prediction probabilities are relatively high. This is illustrated in Fig. 3(b) where the maximum and minimum values of prediction probabilities among the ten nodes are compared. The highest prediction probability is one. During the disruption, the differences between the prediction probabilities significantly increase. In other words, disruption affects the prediction capabilities of some nodes, and their prediction probability drop. This in turn affects other nodes. It is seen that the highest value of prediction probability among the nodes becomes less than one.

The number of nodes affects the overall performance and reliability of the system. Figure 4 shows the simulation results when the number of nodes increases to 30 and the total number of connections is 870. It is seen in Fig. 4(a) that the system performs fairly robustly when the maximum number of disrupted connections is 49. The mutual information increases slightly instead of decrease during the disruption. This is because mutual information includes two components, entropy and conditional entropy, according to Eq. (11). During the disruption period, the conditional entropies associated with those disrupted edges reduce to zeros, whereas the prediction probabilities thus entropies of the relevant nodes are not affected. As a result, the mutual information increases. This phenomenon is also observed in Fig. 4(b), where the maximum number of disrupted connections is 828. Shortly after the disruption starts at time-step 50, the average mutual information increases. Again, this is due to the reduction of conditional entropies while entropies associated with prediction probabilities remain unchanged, which is verified by plotting the average entropies and conditional entropies in Fig. 5(a) and the maximum and minimum prediction probabilities in Fig. 5(b). As the number of disconnected edges keeps increasing, prediction probabilities are affected. Mutual information starts decreasing until the maximum number of 828 disconnections is reached at time-step 100. The system is stabilized in the next 50 time steps until recovery starts. During recovery, mutual information returns to the level prior to disruption reversely. After time-step 200, the system is fully recovered.

Notice that the average entropies are zeros at the normal working condition for the large network of 30 nodes in Fig. 5(a). This is because the prediction probabilities of all nodes are ones before disruption, as shown in Fig. 5(b). The network is fully connected at the beginning because all pair-wise reliance probabilities are randomly generated. The predictions by all nodes are accurate. The predictions become not reliable after the number of disconnected edges reaches certain level after disruption has started. Some of the prediction probabilities reduce. As a result, the average entropy increases. The prediction capabilities of the nodes quickly recover after some of the connections resume. Intuitively, the system should become more resilient to disruption when the number of nodes increases. It is confirmed by the simulation results. The examples show that the mutual entropy based performance measure is sensitive to the system topological change. It provides detailed information about the changes of prediction and reliance probabilities. The entropy and mutual information based metrics allow us to quantify the resilience of CPS networks or IoT systems described with the probabilistic model. These performance metrics can be applied in further studies of system resilience and probabilistic design of the system architecture.

## Probabilistic Design of Cyber-Physical Systems Network Architecture

With the performance metric quantitatively defined, system design and optimization can be performed. The overall goal of the system architecture design for CPS networks is to find the optimum network topology such that the system performance is maximized.

It is seen that the reliability of prediction is related to the number of nodes in the system and connections that are available during disruption. Larger systems with more nodes and more connections tend to be more robust and give correct predictions than smaller systems. Therefore, the design decision variables need to include the number of nodes, the respective prediction probabilities, and pair-wise reliance probabilities. Note that the topology of networks in the proposed probabilistic model is quantified by reliance probabilities instead of binary connectivity. In addition, the performance of prediction is also related to the information fusion rules, based on which the prediction probabilities are updated. Design decisions also include the selection of the rules.

In this section, several information fusion rules for reasoning at the CPS component level are described. The sensitivities of system performance with respect to the prediction and reliance probabilities are also analyzed. Sensitivity analysis of design variables provides some insight of search domains in design optimization.

### Information Fusion Rules at Cyber-Physical Systems Component Level.

The prediction probabilities are also sensitively dependent on the rules of information fusion during prediction update. When receiving different cues from topologically correlated neighbors, a node needs to update its prediction probability to reflect the true state of the world. Several rules can be devised in addition to the best-case rule in Eq. (13). They are listed as follows.

- Best-case (optimistic)If any of the$Pxj=1\u2212\u220fi=1M1\u2212P(xj|xi)$(14)
*M*correlated nodes provides a positive cue, the prediction of the node is positive. Some variations of the rule include when the cases of negatively correlated nodes are also considered, asas well as when the node's own observation is excluded, as$Pxj=1\u2212\u220fi=1M1\u2212P(xj|xi)1\u2212P(xj|xiC)$(15)$Pxj=1\u2212\u220fi=1,i\u2260jM1\u2212P(xj|xi)$(16) - Worst-case (pessimistic)The prediction of the node is positive only if all of the$Pxj=\u220fi=1MP(xj|xi)$(17)
*M*correlated nodes provide positive cues. Similarly, there could be some variations of the rule, such as$Pxj=\u220fi=1,i\u2260jMP(xj|xi)$(18) - Bayesian$P'xj\u221dP(xj)P(xj)r1\u2212PxjM\u2212r$(19)

The prediction of the node is updated to $P\u2032$ from prior prediction $P$ and the cues that the *M* correlated nodes provide among which *r* of them provide a positive cue.

Figure 6 shows the simulation results based on the Bayesian fusion rule, where the update of prediction probabilities is gradual and much slower than the update based on the other two rules. Some other rules can be defined for information fusion, such as product-sum, weighted average, and evidence-based. Those empirical rules are less restrictive than the above three conventional ones.

### Sensitivities of Performance Metrics With Respect to Probabilities.

It is seen in Eqs. (20) and (21) that the first derivatives of conditional entropy with respect to reliance probabilities are monotonically positive when $pij<0.5$ and $qij<0.5$. That is, for small reliance probabilities, increasing their values would increase the conditional entropies. On the other side, the derivatives become negative when $pij>0.5$ and $qij>0.5$, and the trend is the opposite.

The first derivatives of conditional entropies with respect to prediction probabilities are not monotonic, as seen in Eq. (22). They are functions of reliance probabilities, which have (0.5,0.5) as a saddle point, as shown in Fig. 7. When $qij<0.5$ and $qij<pij<1\u2212qij$, or $qij>0.5$ and $1\u2212qij<pij<qij$, the sensitivities are in the positive domain.

Understanding the local sensitivity of conditional entropies is useful for local adjustment of probabilities especially when the system's prediction probabilities are not sensitive to the changes of reliance probabilities. Either increasing the large reliance probabilities that are greater than 0.5 or decreasing the small ones that are less than 0.5 for those uninterrupted nodes will reduce the conditional entropies. Figure 7 also suggests that it is better to focus the adjustment of reliance probabilities in either the upper right quarter of the domain where both P- and Q-reliance probabilities are larger than 0.5, or the lower left quarter where both P- and Q-reliance probabilities are less than 0.5. Because the individual effect of adjusting probabilities in other two quarters could be similar. But with the combination, the overall trend can be compromised and dampened.

The sensitivity analysis is verified by the simulation results shown in Fig. 8. The sensitivity analysis is done by varying the levels of reliance probabilities. Six different situations are tested, including increasing and reducing all reliance probabilities by 25%, increasing and reducing only those large probabilities that are greater than 0.5 by 25%, and increasing and reducing only those small probabilities that are less than 0.5 by 25%. In case a probability value after such perturbation exceeds 1, it is set to be the value of 1 as the upper bound. It is seen in Fig. 8(a) that increasing the reliance probabilities will reduce the average conditional entropy, whereas reducing them will increase the conditional entropy. Increasing or reducing only the large reliance probabilities will have the same effect on the conditional entropy. That is, adjusting only the large reliance probabilities is sensitive enough to obtain desirable system performance. The trend of adjusting small reliance probabilities is the opposite. Increasing only the small reliance probabilities will increase conditional entropy. However, in this case, the end effect of adjusting small probabilities is not as significant as adjusting large ones. The end effect of adjusting probabilities on average entropy is the same. Both conditional entropies and entropies are more sensitive to the large reliance probabilities than to the small ones. Similarly, in Fig. 8(b), changing large reliance probabilities gives the similar results of changing all of the probabilities on the mutual information.

Therefore, improving those relatively reliable connections or sources of information with large reliance probabilities is more effective to optimize the system performance than simultaneously considering all connections in a system. In other words, the attention of resilience engineering for these networks needs to be focused more on the relatively good and trustable communication channels instead of the weakest links, as we usually do for reliability consideration.

The sensitivity of the system is also dependent on the information fusion rules. When the Bayesian rule is applied, the system is not sensitive to the changes of reliance probabilities any more. As shown in Fig. 9, the variation of the average mutual information as a result of different reliance probabilities is small.

According to the quantitative definitions of resilience in Sec. 2.2, the systems with the Bayesian rule are more robust, however less resilient, than the ones with the best-case rule. Notice that robustness, instead of resilience, is directly related to sensitivity. A system is less resilient if its performance is more likely to deteriorate under small disruption. The less resilient system can also be robust at the same time if it is not sensitive to the change or adjustment of system parameters and its performance always deteriorate quickly. In the above sensitivity studies, common random numbers are used in the comparison among different systems. This is to reduce the variance introduced in the simulation.

## Demonstration With Discrete-Event Simulations

To demonstrate how the proposed performance metrics can be applied to actual CPS networks and how effective the metrics can be used in measuring network performance, discrete-event simulation models for computer networks are used here to illustrate. The fine-grained simulation models, which are built with ns-2 [67], are detailed as the physical networks with the models of data packets and different Internet protocols such as transmission control protocol and user datagram protocol. Data are generated and transmitted from one node to another.

In the first example, a ring network with nine nodes is modeled, as shown in Fig. 10(a). transmission control protocol is used as the communication protocol. Application data flows with file transfer protocol sources are modeled from nodes #0 to #5, #2 to #6, #4 to #8, #7 to #3, #5 to #1, and #8 to #3. All connections have a packet loss rate of 0.01. The model is run to simulate the traffic for 10 s of time. At clock time 3.0 s, a network disruption occurs, where either one, two, or three edges are disconnected. The connections are resumed at clock time 5.0 s. The numbers of packets that are sent and received for each data flow path are summarized in Table 1. Each column in the table corresponds to a flow path. Four scenarios (no disruption, one-edge, two-edge, and three-edge disconnections during disruption) are simulated. In this model, sensing and prediction capabilities of CPS are not simulated. Only communication is modeled. It is assumed that only positive prediction information is transferred between nodes. Therefore, the prediction probability associated with each source node is estimated as the ratio between the number of packets sent and a reference number, assuming that sending more implies a higher capability of prediction. The common reference number can be set as the theoretical upper limit by which the maximum number of packets can be sent by a source under any circumstance for the time period under consideration. The upper limit used in this example as the reference is 5000. The P-reliance probability for each path is estimated as the ratio between the number of packets received by sink and the one sent by source. The ratio can be less than one because of packet loss and traffic jam. Assuming that Q-reliance probabilities are zeros, entropy, conditional entropy, and mutual information are calculated from the prediction and P-reliance probabilities. The average entropy, conditional entropy, and mutual information for all paths are also listed in the last column of Table 1.

#0 to #5 | #2 to #6 | #4 to #8 | #7 to #3 | #5 to #1 | #8 to #3 | Average | |
---|---|---|---|---|---|---|---|

(a) No disruption | |||||||

Packets sent by source | 2079 | 1264 | 1191 | 1177 | 1226 | 734 | |

Packets received by sink | 2055 | 1247 | 1191 | 1160 | 1211 | 727 | |

Prediction probability | 0.4158 | 0.2528 | 0.2382 | 0.2354 | 0.2452 | 0.1468 | |

P-reliance probability | 0.9885 | 0.9866 | 1.0 | 0.9856 | 0.9878 | 0.9905 | |

Entropy | 0.9794 | 0.8157 | 0.7920 | 0.7873 | 0.8036 | 0.6018 | 0.7966 |

Conditional entropy | 0.0378 | 0.0260 | 0.0 | 0.0257 | 0.0234 | 0.0114 | 0.0207 |

Mutual information | 0.9417 | 0.7897 | 0.7920 | 0.7617 | 0.7802 | 0.5904 | 0.7759 |

(b) Disruption (edges 6–7) | |||||||

Packets sent by source | 1490 | 1436 | 466 | 484 | 1034 | 569 | |

Packets received by sink | 1481 | 1419 | 466 | 476 | 1027 | 567 | |

Prediction probability | 0.2980 | 0.2872 | 0.0932 | 0.0968 | 0.2068 | 0.1138 | |

P-reliance probability | 0.9940 | 0.9882 | 1.0 | 0.9835 | 0.9932 | 0.9965 | |

Entropy | 0.8788 | 0.8651 | 0.4471 | 0.4588 | 0.7353 | 0.5113 | 0.6494 |

Conditional entropy | 0.0159 | 0.0266 | 0.0 | 0.0118 | 0.0121 | 0.0038 | 0.0117 |

Mutual information | 0.8629 | 0.8384 | 0.4471 | 0.4470 | 0.7232 | 0.5074 | 0.6377 |

(c) Disruption (edges 6–7, 2–3) | |||||||

Packets sent by source | 1471 | 586 | 721 | 909 | 225 | 205 | |

Packets received by sink | 1435 | 579 | 715 | 897 | 218 | 195 | |

Prediction probability | 0.2942 | 0.1172 | 0.1442 | 0.1818 | 0.045 | 0.041 | |

P-reliance probability | 0.9925 | 0.9881 | 0.9917 | 0.9868 | 0.9689 | 0.9512 | |

Entropy | 0.8741 | 0.5213 | 0.5951 | 0.6840 | 0.2648 | 0.2469 | 0.5310 |

Conditional entropy | 0.0187 | 0.0110 | 0.0100 | 0.0184 | 0.0090 | 0.0115 | 0.0131 |

Mutual information | 0.8554 | 0.5103 | 0.5851 | 0.6656 | 0.2558 | 0.2353 | 0.5179 |

(d) Disruption (edges 6–7, 2–3, 0–8) | |||||||

Packets sent by source | 1045 | 966 | 285 | 484 | 230 | 343 | |

Packets received by sink | 1037 | 964 | 280 | 476 | 222 | 336 | |

Prediction probability | 0.2090 | 0.1932 | 0.0570 | 0.0968 | 0.0460 | 0.0686 | |

P-reliance probability | 0.9923 | 0.9979 | 0.9825 | 0.9835 | 0.9652 | 0.9796 | |

Entropy | 0.7396 | 0.7081 | 0.3154 | 0.4588 | 0.2692 | 0.3607 | 0.4753 |

Conditional entropy | 0.0135 | 0.0041 | 0.0073 | 0.0118 | 0.0100 | 0.0099 | 0.0094 |

Mutual information | 0.7260 | 0.7040 | 0.3082 | 0.4470 | 0.2592 | 0.3508 | 0.4659 |

#0 to #5 | #2 to #6 | #4 to #8 | #7 to #3 | #5 to #1 | #8 to #3 | Average | |
---|---|---|---|---|---|---|---|

(a) No disruption | |||||||

Packets sent by source | 2079 | 1264 | 1191 | 1177 | 1226 | 734 | |

Packets received by sink | 2055 | 1247 | 1191 | 1160 | 1211 | 727 | |

Prediction probability | 0.4158 | 0.2528 | 0.2382 | 0.2354 | 0.2452 | 0.1468 | |

P-reliance probability | 0.9885 | 0.9866 | 1.0 | 0.9856 | 0.9878 | 0.9905 | |

Entropy | 0.9794 | 0.8157 | 0.7920 | 0.7873 | 0.8036 | 0.6018 | 0.7966 |

Conditional entropy | 0.0378 | 0.0260 | 0.0 | 0.0257 | 0.0234 | 0.0114 | 0.0207 |

Mutual information | 0.9417 | 0.7897 | 0.7920 | 0.7617 | 0.7802 | 0.5904 | 0.7759 |

(b) Disruption (edges 6–7) | |||||||

Packets sent by source | 1490 | 1436 | 466 | 484 | 1034 | 569 | |

Packets received by sink | 1481 | 1419 | 466 | 476 | 1027 | 567 | |

Prediction probability | 0.2980 | 0.2872 | 0.0932 | 0.0968 | 0.2068 | 0.1138 | |

P-reliance probability | 0.9940 | 0.9882 | 1.0 | 0.9835 | 0.9932 | 0.9965 | |

Entropy | 0.8788 | 0.8651 | 0.4471 | 0.4588 | 0.7353 | 0.5113 | 0.6494 |

Conditional entropy | 0.0159 | 0.0266 | 0.0 | 0.0118 | 0.0121 | 0.0038 | 0.0117 |

Mutual information | 0.8629 | 0.8384 | 0.4471 | 0.4470 | 0.7232 | 0.5074 | 0.6377 |

(c) Disruption (edges 6–7, 2–3) | |||||||

Packets sent by source | 1471 | 586 | 721 | 909 | 225 | 205 | |

Packets received by sink | 1435 | 579 | 715 | 897 | 218 | 195 | |

Prediction probability | 0.2942 | 0.1172 | 0.1442 | 0.1818 | 0.045 | 0.041 | |

P-reliance probability | 0.9925 | 0.9881 | 0.9917 | 0.9868 | 0.9689 | 0.9512 | |

Entropy | 0.8741 | 0.5213 | 0.5951 | 0.6840 | 0.2648 | 0.2469 | 0.5310 |

Conditional entropy | 0.0187 | 0.0110 | 0.0100 | 0.0184 | 0.0090 | 0.0115 | 0.0131 |

Mutual information | 0.8554 | 0.5103 | 0.5851 | 0.6656 | 0.2558 | 0.2353 | 0.5179 |

(d) Disruption (edges 6–7, 2–3, 0–8) | |||||||

Packets sent by source | 1045 | 966 | 285 | 484 | 230 | 343 | |

Packets received by sink | 1037 | 964 | 280 | 476 | 222 | 336 | |

Prediction probability | 0.2090 | 0.1932 | 0.0570 | 0.0968 | 0.0460 | 0.0686 | |

P-reliance probability | 0.9923 | 0.9979 | 0.9825 | 0.9835 | 0.9652 | 0.9796 | |

Entropy | 0.7396 | 0.7081 | 0.3154 | 0.4588 | 0.2692 | 0.3607 | 0.4753 |

Conditional entropy | 0.0135 | 0.0041 | 0.0073 | 0.0118 | 0.0100 | 0.0099 | 0.0094 |

Mutual information | 0.7260 | 0.7040 | 0.3082 | 0.4470 | 0.2592 | 0.3508 | 0.4659 |

It is seen from this example that the proposed metrics of entropy, conditional entropy, and mutual information are sensitively dependent upon the change of network traffic pattern. From scenarios of no disruption to three-edge disruption, the performance of network is reduced gradually. The average values of entropy, conditional entropy, and mutual information also change monotonically.

As the further comparison, the ring network in Fig. 10(a) is modified to Fig. 10(b), where a new node and four edges are inserted. The same four scenarios are simulated in the second ring network, and the statistics of packets are collected in the same way. The calculated metrics are average entropy (0.8869, 0.7524, 0.7524, and 0.7524), conditional entropy (0.0150, 0.0194, 0.0194, and 0.0194), and mutual information (0.8719, 0.7331, 0.7331, and 0.7331), respectively, for four scenarios. The metrics between the two examples are compared in Fig. 11. The metrics indicate that model 2 is more resilient than model 1, which is easy to verify from the topology since model 2 includes more edges and is less susceptible to disruptions.

## Discussions

The simulation studies in this research demonstrated that entropy and mutual information can be applied as the metrics for functionality and performance measures for CPS systems in order to assess resilience. The proposed probabilistic design framework requires prediction and reliance probabilities as the inputs. These quantities may be derived from historical data or solicitation. Obtaining reliable and consistent estimations of probabilities is a challenging research issue itself. The studies here mostly focus on communication. More comprehensive investigations are needed for sensing, reasoning, and prediction capabilities.

At individual node level, several information fusion rules such as best-case, worst-case, and Bayesian can be defined so that the prediction probability associated with a node is updated based on the received information from neighboring nodes during reasoning. It is seen that the system resilience and robustness are sensitively dependent on the fusion rules. During the system design process, information aggregation rules also need to be optimized based on the expected dynamics of performance.

The proposed metrics perform reasonably well with the simple reasoning scheme based on the information fusion rules. As future extensions, the proposed performance metrics need to be further tested with some other information fusion rules. Choosing appropriate rules is expected to be an important task in designing CPS networks and systems.

The sensitivity studies also show that the system performance is influenced more by the tightly coupled nodes, where reliance probabilities are high, than those loosely coupled ones. The optimization of systems is more effective if efforts are focused on these connections with high reliance probabilities, if the available resource is limited for improvement. Design optimization methods also need to be further explored based on the preliminary result of sensitivity analysis. The system design and optimization based on the performance and resilience metrics mostly requires a multi-objective optimization approach, since these metrics provide multifacet assessment. If system dynamics needs to be considered, dynamic programing approaches can also be taken.

Although the proposed metrics and probabilistic measure are in the context of CPS networks, the methodology can potentially be extended for other networked systems where strong interdependency exists among individual components. Information, energy, and material flows can all be modeled similarly. For instance, in supply chain or transportation networks, prediction probability can correspond to the probability that goods or supplies satisfy the demand at a node, probability distribution of demand, or the distribution of inventory levels at a node, whereas reliance probabilities characterize the correlations between demands at different nodes (percentage of supply from one node goes to another), percentage of transport capacities being employed, or probability that transportation is not interrupted. Different node types (source, sink, warehouse, hub, retailer, etc.) and edge types (shortest path, minimum cut, etc.) can be differentiated with different types of prediction and reliance probabilities.

## Conclusion

In this paper, generic CPS network performance metrics are proposed based on entropy, conditional entropy, and mutual information to allow for quantitative resilience engineering of such networks. In CPS networks, each node corresponds to a CPS component. The processes of communication during information exchange between nodes and reasoning at individual nodes are characterized with reliance and prediction probabilities, respectively, in a probabilistic design framework. The resilience of the system then can be quantified with the proposed performance metrics of entropy and mutual information. Simulation studies show that these metrics are reasonable and consistent quantities to measure how communication and reasoning capabilities are affected during network disruption. The metrics are shown to be sensitive to the changes of network topology.

## Funding Data

Division of Civil, Mechanical and Manufacturing Innovation, U.S. National Science Foundation (Grant No. CMMI-1663227).

### Appendix

For continuous variable, integral operator is used instead of summation in Eq. (A1)