Sensor fault detection and isolation via networked estimation: rank-deficient dynamical systems

This paper considers model-based fault detection of large-scale (possibly rank-deficient) dynamic systems. Assuming only global (and not local) observability over a sensor network, we introduce a single time-scale networked estimator/observer. Sensors take local outputs/measurements of system states with partial observability and share their information (including estimation and/or output) over a communication network, and gain distributed observability. We define the conditions on the network structure ensuring distributed observability and stabilising the error dynamics. However, system outputs are prone to faults and uncertainties, which affect the state estimation of all sensors as a consequence of communicating (possibly) faulty data. From the cyber-physical-systems (CPS) perspective, such faults add bias to the data transferred from the physical layer (dynamic system) to the cyber layer (sensor network). In this work, we propose a localised fault detection and isolation (FDI) mechanism at sensors to secure distributed estimation. This protocol enables every sensor to locally identify the possible fault at the sensor measurement, and, via local detection and isolation, to prevent the spread of biased/faulty information over the network. This distributed isolation and localisation of fault follows from our partial observability assumption instead of full observability at every sensor. Then, other sensors can estimate/track the system by using observationally-equivalent output information to recover for possible loss of observability. In particular, we study rank-deficient systems as they are known to demand more information-sharing, and thus, are more vulnerable to the spread of possible faults over the network. One challenge is the detection of faults in the presence of system/output noise without making (simplifying and unrealistic) upper-bound assumptions on the noise support. We resolve this by adopting probabilistic threshold designs on the residuals. Further, we show that additive faults at rank-deficiency-related outputs affect the residuals at all sensors, a consequence that mandates more constraints on the (distributed) FDI strategy. We address this problem by constrained LMI design of the feedback gain matrix. Finally, we design q-redundant distributed estimators, resilient to isolation/removal of up to q number of faulty sensors, and further, we consider thresholding residual history over a sliding time-window, known as the stateful FDI.


Introduction
Designing sensor networks resilient to faults or attacks, has motivated simultaneous estimation and FDI via Linear Dynamical State-space (LDS) models (Barboni et al., 2020;Giraldo et al., 2018;Yılmaz et al., 2016).This is to detect possible anomalies in the system outputs and to avoid biased (and unstable) tracking of the system states.In the CPS setup, such faults appear at the connection from the physical layer to the cyber layer, where well-designed detection strategies can significantly improve CPS reliability.This paper provides a "distributed" (or networked) estimation and FDI protocol to locally detect possible output faults with no need for central coordination.In particular, secure networked estimation with application to CPS (Doostmohammadian et al., 2020;Nweke et al., 2021) and Internet-of-Things (IoT) (Chen et al., 2018) is of interest in this paper.This is motivated by many existing large-scale applications, such as smart-grid monitoring (Abbaszadeh, 2019;Khan & Doostmohammadian, 2011;Khan & Stanković, 2013) and social network inference (Doostmohammadian et al., 2021).
CONTACT M. Doostmohammadian doost@semnan.ac.ir, mohammadreza.doostmohammadian@aalto.fiThis article has been republished with minor changes.These changes do not impact the academic content of the article.
In networked/distributed estimation, a collaborating group of (possibly spatially distributed) sensors tracks the state of the (large-scale) dynamical system with partial observations, local data-processing, and information-sharing over a communication network.The main concern in networked estimation is that the sensors' observations are prone to faults, anomalies, or can be even manipulated by adversaries/malicious-threats.Sharing such faulty data over the sensor network affects the estimation performance of some (or all) other sensors.Our remedy is to design a distributed and local FDI strategy to detect faulty sensors and, further, to reconfigure the sensor network setup by removing the faulty sensors and recovering the loss of observability.In contrast to the existing centralised solutions (e.g.joint estimation and detection in Yılmaz et al., 2016), various networked estimation scenarios are considered in the literature, ranging from multi time-scale (MTS) methods (Battilotti & Cacace, 2021;He, Hu, et al., 2019;He, Ren, et al., 2019;He et al., 2021;Olfati-Saber, 2009) to single time-scale (STS) approaches (Azizi & Khorasani, 2014;Battis-telli et al., 2012;Boukhobza et al., 2009;Kar et al., 2012;Lopes & Sayed, 2008;Park & Martins, 2012;Ren & Al-Saggaf, 2017).More recently, the protocols are further developed to detect possible attacks/faults in a distributed way (Deghat et al., 2019;Guan & Ge, 2017;He, Ren, et al., 2019;He et al., 2021;Mitra & Sundaram, 2018).Such distributed FDI methods are a significant improvement over the classic centralised FDI counterparts (Chong et al., 2015;Davoodi et al., 2014;Giraldo et al., 2018;Kodakkadan et al., 2017;Lee et al., 2015;Li & Jaimoukha, 2009;Li et al., 2012;Pajic et al., 2015;Rahimian & Preciado, 2015;Rank & Niemann, 1999;Zhang et al., 2021) since the fault can be detected and isolated locally with no help of centralised supervision.Such strategies prevent single-node-of-failure and, further, can be improved by changing the network connectivity; for example, Doostmohammadian et al. (2018) considers costoptimal network design while addressing distributed observability.A practical approach is to add observability redundancy (Lee et al., 2019;Mitra et al., 2021) to make the estimation resilient to the isolation of a few faulty sensors (or sensor failures).In this direction, some works based on structured systems theory are considered in centralised cases, e.g. the centralised graph-theoretic FDI on system digraphs (Commault et al., 2006) and preventive structural design to avoid the so-called zero-dynamics attacks (Weerakkody et al., 2017).Structural-based observability analysis of distributed CPS with applications in distributed estimation and networked control system is discussed in Doostmohammadian and Khan (2020).In some literature, Static Linear State-space (SLS) solutions are considered with no knowledge of the system dynamics.Such centralised SLS scenarios are mostly data-driven using machine learning and binary classification tools (e.g.supportvector-machines) to classify faulty output data from normal data (Abbaszadeh, 2019;Kurt et al., 2018).In general, as compared to LDS solutions, SLS solutions may need more system outputs (in theory as many as system states) to gain system observability.This is also referred to as static observability as compared to dynamic observability, where the difference is more significant in large-scale applications with an increasing number of states.Overall, distributed LDS detection scenarios are more compelling in large-scale as they do not require a large number of outputs/sensing-resources and long-range costly (even infeasible) coordination with the centralised detection unit.Further, SLS models only aim for FDI, while LDS models can aim for joint FDI and estimation by knowledge of the system dynamics with possible applications in, e.g.satellite attitude control system (Nasrolahi & Abdollahi, 2018) and Cyber-Physical-Energy-Systems (CPES) (Ilić et al., 2010) as shown in Figure 1.
Main contributions: We propose a joint estimation, detection, and isolation method in a 'distributed setup'.A group of sensors tracks the entire state of a large-scale dynamical system with system matrix A, while the system is partially observable to any sensor.Sensors share their estimates and/or measurements to 'gain observability over the network' (referred to as distributed observability).To overcome this challenge, some works propose a multi time-scale (MTS) distributed estimation (Battilotti & Cacace, 2021;He, Hu, et al., 2019;He, Ren, et al., 2019;He et al., 2021;Olfati-Saber, 2009), where sensors need to perform large number of communications and consensus iterations over the sensor network between every two consecutive timesteps of the system dynamics.Consequently, with more iterations than the sensor network diameter, every sensor receives information of every other sensor and, therefore, the underlying system becomes observable to all sensors between every two system time-steps.This solution imposes high communication load/traffic over the sensor network and fast costly processing/communication units.In particular, this is more challenging for large-scale CPS as the number of sensors and either the sensor network diameter or linking grow larger.To avoid this, single time-scale (STS) distributed estimation protocols are proposed (Azizi & Khorasani, 2014;Boukhobza et al., 2009;Kar et al., 2012), where sensors perform only one step of communication and consensus update at every system time-step (i.e. at the same time-scale).However, these works assume the system is observable in the direct neighbourhood of all (or some) sensors.This requires densely-connected sensor networks where every sensor is directly linked to many other sensors.Both mentioned solutions are costly in terms of either the network traffic or the need for fast communication and computation units, or both.On a large-scale, these might be even infeasible as they require (i) very fast long-range communication and data-processing or (ii) high network traffic at every timestep that may increase congestion, latency, and packet losses.In this paper, we provide a single time-scale (STS) protocol with no local observability assumption at any sensor node, with no need for costly fast communication/processing facilities, and less communication requirement over the sensor network (to reduce the mentioned networking issues).The proposed STS protocol is significant in the following aspects: (i) Unlike (Battistelli et al., 2012;Doostmohammadian et al., 2018;Lopes & Sayed, 2008;Park & Martins, 2012) and similar to Deghat et al. (2019), Mitra and Sundaram (2018), it is not assumed that the system is full-rank or self-damped, and the solution is valid for general possibly rank-deficient systems; (ii) Unlike (Azizi & Khorasani, 2014;Boukhobza et al., 2009;Kar et al., 2012) and similar to Deghat et al. (2019) and Mitra and Sundaram (2018), no local observability in the neighbourhood of any sensor is assumed to reduce network connectivity and communication traffic/load on sensors.
The performance of the proposed STS protocol is analysed under possible sensing faults.We adopt a residual-based FDI scenario to detect and isolate possible faults (or bias) at state measurements by sensors (considering possible system rankdeficiency).Recall that localisation of faults is more challenging here as the system is not necessarily observable in the neighbourhood of any sensor.This follows our partial (and not full) observability assumption in the local neighbourhood of sensors, however, it demands more complex residual analysis and more constraints on the observer gain design.From structured systems theory perspective (as in Commault et al., 2006), the adjacency graph (or system digraph) of such possibly rank-deficient systems contains the so-called contraction components 1 (dual of the dilation components for controllability Doostmohammadian, 2019;Liu et al., 2011).Outputs of state nodes in contractions play key roles in the (distributed) structural observability (Doostmohammadian & Khan, 2020).In this direction, this paper advances the current state-of-the-art in distributed FDI by addressing how rank-deficiency affects networked estimation and detection.We provide a sensor classification to divide sensors into three groups based on their state measurements and the system digraph structure.One class of sensors recovers the system rank-deficiency (with outputs of contractions), the second class recovers the output-connectivity (with outputs of parent-SCCs), and the third class is auxiliary or unnecessary for observability.We show that for the first class any measurement/output fault affects the residuals at all sensors (making it harder to isolate the fault), while for the other two classes the fault only affects the residual at the faulty sensor.This demands different detection and isolation methods at different classes.In this direction, this work extends and generalises our previous works (Doostmohammadian et al., 2021;Doostmohammadian & Meskin, 2020) in which only full-rank dynamical systems are considered.In the proposed STS estimation protocol, the system rank-deficiency is addressed by adding a step of measurement sharing (or innovation-update) over the sensor network.In this work, probabilistic thresholds are designed (in contrast to deterministic thresholds) on the residuals of the noise-corrupted sensors to detect possible additive faults.Stateful measures (also known as distance measures) considering the history of the residuals over a sliding time-window can also be considered as in Doostmohammadian et al. (2021).The faulty sensor needs to be removed (or isolated) to avoid cascading faulty-data over the distributed estimation network.Further, to recover for distributed observability, a graph-theoretic algorithm is proposed to replace the faulty sensor with an observationally-equivalent one.In another viewpoint, using the notion of q-redundant observability, we propose a mechanism to design q-redundant distributed estimators/observers, where the observability (and observer error stability) is preserved after removal/isolation of up to q number of failed/faulty sensors.The main contributions of this paper are summarised as follows: • We propose joint (STS) distributed estimation and (LDS) detection over sensor networks, with no assumption on the system rank and the local observability at any sensor.As mentioned above in detail, a protocol in this framework outperforms existing ones in terms of needed communication traffic, network connectivity, and data-processing load on sensors.• This work advances the previous works on full-rank system dynamics (Doostmohammadian et al., 2021;Doostmohammadian & Meskin, 2020), to rank-deficient models.We show that system rank-deficiency imposes more outputs and network connectivity on the distributed estimation network.This makes the detection strategy more challenging in terms of extra constraints on the LMI gain design as compared to full-rank systems in Doostmohammadian and Meskin (2020), Doostmohammadian et al. (2021), Doostmohammadian and Khan (2013a) and Khan and Jadbabaie (2011) (Chong et al., 2015;Kar & Moura, 2011;Kodakkadan et al., 2017;Lee et al., 2015;Pajic et al., 2015;Rank & Niemann, 1999), we do not assume that the system and/or measurement noise are of bounded support.We avoid this simplifying assumption by considering Gaussian noise with no upper-bound.This is more realistic as compared to the existing literature, and results in our probabilistic threshold design.We clearly define the probability of false-alarm and false-negative for the proposed probabilistic thresholds.• We consider an observability recovery method based on the concept of observational-equivalence (Doostmohammadian & Khan, 2016;Doostmohammadian et al., 2018).In particular, we provide a graph-theoretic algorithm to replace a faulty sensor (of each class) with an observationallyequivalent one of the same class.This is closely tied with the concept of redundant observability, relating a security index for attack detectability/recovery to observabilitypreserving (Lee et al., 2019).In this direction, a graphtheoretic algorithm is proposed in this paper to design q-redundant networked observers/estimators, i.e. distributed observers robust to failure or removal of up to q sensors.In other words, after removal of up to q number of faulty sensors (and cutting their network links), the remaining sensors can still estimate the system over the (reduced-connectivity) sensor network with stable steady-state error.Similarly, resilient solutions to the removal of up to q communication links over the network can be considered.• We extend the stateful detection methods in Doostmohammadian et al. (2021) to rank-deficient systems.We introduce distance measures based on residual history and probabilistic thresholds via χ 2 CDF to improve the detection probability and reduce the false-alarm rate.
Paper organisation: Section 2 gives the preliminaries and states the problem.Section 3 describes the networked estimation protocol.In Section 4, the fault detection scenario and sensor classification are presented.Section 5 presents the algorithms for (i) fault compensation scenario to recover for loss of observability and (ii) designing q-redundant distributed estimators.Section 6 provides an illustrating example and comparison with recent literature.Finally, Section 7 concludes the paper.

The framework
In this work, we consider noise-corrupted linear systems in discrete-time as, where x k ∈ R n represents the system state (vector) at time-step k, A = [a ij ] ∈ R n×n is the system matrix, and ν k = N (0, Q) is the system noise with covariance Q ∈ R n×n .Note that in this paper we make no assumption on the rank of the system matrix A and it could be rank-deficient.Also, to avoid trivial solutions, we make no assumption on the stability of the system, i.e. it is possible that ρ(A) > 1, where ρ(A) denotes the spectral radius of matrix A. System outputs are taken by a group of N sensors as, where C i is the measurement/output 2 matrix at sensor i.The above can be written in brief as, where ζ k = N (0, R) is the measurement noise which is assumed to be independent at each sensor.Matrix R is the covariance matrix, which is a diagonal matrix of the variances, and f k represents additive fault at sensors.We make standard assumptions on Gaussianity and statistically independency of the noise terms as E(ν k ν m ) = 0 and E(ζ k ζ m ) = 0 for all timesteps k = m.Further, without loss of generality, we assume that every sensor i takes the output of one system state (i.e.C i is a row-vector and y i k is a scalar, i.e.y i k ≡ y i k ) and, hence, Remark 2.1: Typically in fault/attack detection literature (Chong et al., 2015;Kar & Moura, 2011;Kodakkadan et al., 2017;Lee et al., 2015;Pajic et al., 2015;Rank & Niemann, 1999) it is assumed that the noise terms are of boundedsupport, i.e. |ζ i k | < ζ and ν k < ν, with ζ and ν as some upper-bounds.In this paper, we make no such assumptions and consider the noise in its most general form.

Preliminaries on graph theory
We introduce some graph-theoretic concepts needed for structural analysis of system observability.Denote a graph by G = {V, E} with V = {1, . . ., n} as the set of nodes and E as the set of links i → j (or pair (i, j)) with i, j ∈ V.A graph is stronglyconnected (SC) if there exists a directed path from every node to every other node.In a non-SC digraph, define a stronglyconnected-component (SCC) as a component/subgraph in which nodes are strongly-connected to every other node in that component.Among the SCCs, define a parent SCC S p i , as the SCC with no outgoing link to any other SCC.The number of parent SCCs is denoted by |S p | (with | • | as the set cardinality).The decomposition of SCCs and finding their partial order are set via Depth-First-Search (DFS) algorithm (Cormen et al., 2009).Next, define a maximum matching M as the maximal set of links in E that share no end-node or beginnode.Define a matched node as the end node of any link in M. Otherwise, the node is unmatched.Denote the set of unmatched nodes by δM.One can find the maximum matching and unmatched nodes via Dulmage-Mendelsohn (DM) decomposition (Dulmage & Mendelsohn, 1958;Murota, 2000).In graph theory, a connected graph G is called κ-connected (or κvertex-connected) if it remains connected after removing κ (or fewer) nodes.From Menger's theorem, the size of the minimum node-cut (or vertex-cut) between two nodes i and j (the minimum number of vertices whose removal disconnects i and j) is equal to the maximum number of pairwise disjoint paths (also referred to as q + 1-linking (Weerakkody et al., 2017)) from i to j.In particular, we use q-connected digraphs (with q pairwise disjoint paths) to design q-redundant observer networks in Section 5.2 along with a simulation example in Section 6.We refer interested readers to Hellwig and Volkmann (2008) for sufficient conditions for designing κ-connected digraphs and to Lau et al. (2009), Sadeghi andFan (2019) and Umsonst (2019) for a similar concept for survivable network design.

System digraph representation
For linear system given by (1), define the system digraph G A = {V A , E A }, as a graph associated to the system matrix A, where V A = {1, . . ., n} is the set of state-nodes and E A is the set of links.There is a link from state-node i to node j, i.e. (i, j) ∈ E A or i → j, if A ji = 0.Many generic system properties including system observability/controllability (Dion et al., 2003;Lin, 1974) are tightly related to the structure of the system digraph.Moreover, one can detect input faults (Commault et al., 2008) or determine the generic/structural rank (or S-rank) of the system matrix (Harary, 1962) using the system digraph G A .The following theorem is proved in our previous works (Doostmohammadian & Khan, 2016;Doostmohammadian et al., 2020).
Theorem 2.1: Given the system digraph G A associated with dynamic system (1), the following state-outputs are necessary for system observability: (ii) outputs of all the unmatched state-nodes in δM.
Note that the number of unmatched nodes in the system digraph is tightly related with the S-rank of the system.It can be proved that Doostmohammadian and Khan (2013b) and Doostmohammadian et al. (2018): ( 3 ) with | • | denoting the set cardinality.Recall that the S-rank of the system matrix is the number of non-zero entries in distinct rows and distinct columns of the matrix A (Harary, 1962).
In G A , the S-rank equals the maximum size of a disjoint family of cycles spanning all state nodes (Reinschke, 1988).Note that based on (3), the number of unmatched nodes equals the rank-deficiency of the system (matrix).Therefore, there is no unmatched node in the system digraph associated with a fullrank dynamical system, e.g. a self-damped system.It is known that condition (i) in Theorem 2.1 recovers the output connectivity and condition (ii) recovers the system S-rank.Therefore, for full-rank systems (as in Doostmohammadian et al., 2021;Doostmohammadian & Meskin, 2020) condition (ii) is already satisfied.

q-redundant observability
Recall that observability refers to the possibility of reconstructing the entire system states x k via some state measurements (outputs) y k over finite time-interval.Denote by O A,C the observability Gramian matrix defined as, Having rank(O A,C ) = n implies that the system (1) is observable by outputs (2) and the states (at every time k) can be obtained by solving the following set of linear algebraic equations on the outputs, . . .
Recall that if the conditions (i)-(ii) in Theorem 2.1 hold on the structural system representation, observability Gramian O A,C is full-rank for almost all numerical values of the non-zero entries in the system matrix A and output matrix C (Dion et al., 2003;Doostmohammadian & Khan, 2020).
Definition 2.1 ( Lee et al., 2019): Define C as the reduced-size output matrix by removing the outputs associated with the set , i.e. by setting the rows of C indexed by to all-zeros.The pair (A, C) in ( 1)-( 2) is then said to be q-redundant observable if the pair The above definition simply implies that the system is observable by using any subset of outputs of size n−q (or larger), or it remains observable by removing any subset of outputs of size q (or smaller), i.e. rank(O A,C ) = n.We use this notion along with the structural observational-equivalence to design q-redundant networked observers/estimators in Section 5.2.

Problem statement
Given the system and measurements by ( 1)-( 2), for (both centralised and distributed) estimation, it is necessary that the pair (A, C) be observable.Thus, assuming the observability conditions in Theorem 2.1, we introduce an STS networked estimation protocol that requires only one step of estimate sharing and innovation-update (possibly via sharing outputs) between every two time-steps k and k + 1 of system dynamics.We provide structural sensor classification to derive the necessary conditions for distributed observability and design the network topology and the gain matrix to stabilise the networked estimation error.In the case of non-zero additive faults at sensors, a residual-based fault detection logic is proposed.Possible bias f i k = 0 (deviating the real output from the predicted one) causes the residual (or distance measure) at sensor i to exceed certain probabilistic thresholds, which triggers the fault alarm at that sensor.The schematic of the networked estimator/observer and the local fault detectors is shown in Figure 2. In the case of detecting a faulty sensor i and isolating it (to prevent the distribution of faulty data over the sensor network), a compensation method is proposed to use observationally-equivalent sensors to recover for loss of observability.We further propose a technique to design q-redundant networked estimators, such that by isolation (or removal) of up to q faulty (or failed) sensors the system remains observable (in the distributed sense) to the rest of the sensors over the network.

Single time-scale networked estimation
In this section, we propose a single time-scale estimator.At every time-step k every sensor sends (to its direct out-neighbours) and receives (from its direct in-neighbours) one packet of data consisting of estimate and/or output/ measurement.Every sensor performs one step of (i) averaging (consensus) on the priori estimates (to improve its predictions) and, (ii) measurement-update (known as innovation) to refine the estimate by output information, which gives the posteriori estimate.This represents a distributed predict and update mechanism, resembling an alpha-beta-type filter as in the Kalman filter.The innovation phase further helps to find the residual between the predicted output and the true output to detect possible output bias (for model-based FDI).The proposed distributed protocol is as follows, where x i k|k is the estimate of system state x k given all the measurements up to time-step k.Define N β (i) as the inneighbourhood of sensor i over which sensors share their estimates and N α (i) as the in-neighbourhood of sensor i over which sensors share their measurements.In this direction, two networks are considered: (i) network G β for estimate-sharing, and (ii) network G α for output-sharing.The combination of the two, makes the communication network or the sensor network.Both G β and G α need to be designed such that distributed observability is satisfied (as we explain in the rest of this section).Matrix W includes the consensus weights for averaging prior estimates, i.e.W ij is the consensus gain at sensor i on the prior estimate of the in-neighbour sensor j.The structure of W follows the graph topology (or structure) of G β , while the entries satisfy rowstochasticity (e.g.see Charalambous & Hadjicostis, 2013;Xiao et al., 2005) to ensure the consensus nature of Equation ( 5).We make no constraining assumption on the entries W ij and they follow random consensus weights.In case of using non-random entries, e.g.Metropolis-Hastings fusion rule (Xiao et al., 2005)), the observability should be rechecked numerically (see Doostmohammadian et al., 2021 for more details on this).Define matrix U as the 0−1 adjacency matrix of G α , i.e.U ij = 1 if j → i and U ij = 0 otherwise.Note that, we assume self cycles (or selfloops) at every node, i.e.U ii = 1 and W ii = 0 for all i, since all sensors use their own outputs and priori estimates.The matrix K i , representing the feedback gain for stabilising the estimation error at sensor i, needs to fulfil specific constraints for FDI design (discussed in Section 4.1).
Remark 3.1: Recall that in the LDS model, knowing the system dynamics A (all sensors), significantly reduces the number of necessary sensors to satisfy system-observability (the socalled Gramian rank-condition in Section 2), while in the SLS model with no information of system dynamics more sensors (in theory as many as system states) are needed.
Note that the proposed estimator outperforms the MTS networked estimation proposed in Olfati-Saber (2009), He et al. (2021), He, Ren, et al. (2019), He, Hu, et al. (2019) and Battilotti and Cacace (2021) in terms of real-time capabilities and network communication load at sensors.In these works, sensors need to perform L steps of averaging (consensus) and L steps of communication between every two steps of system dynamics k−1 and k.This requires the communication and processing units at sensors to be L times faster than the sampling of the system dynamics.However, the proposed protocol ( 5)-( 6) only needs 1 step of consensus and communication between time-steps k−1 and k of the system dynamics, see Figure 3.

Remark 3.2:
In MTS networked estimation, the number of communication iterations L is (at least) more than the diameter d N of the sensor network.This implies that the information of every sensor reaches every other sensor in the network between every two time-steps k−1 and k, and thus, the system becomes observable to every sensor at every time-step k, which makes the solution trivial by imposing high communication/consensus load on sensors.
We need to design the structure of W and U matrices (to address distributed observability) and to find the structured gain matrix K to stabilise the estimation error in the steady-state.In this direction, define the estimation error at every sensor i at time-step k as e i k = x k|k − x i k|k .This error can be formulated based on the system parameters in Equations ( 1) and ( 2) and estimation parameters in Equations ( 5) and (6) as follows, Substituting (1)-(2) in the above, Using the row-stochastic property of the W matrix, Substituting this in the error equation, where η i k collects the noise and fault terms as, Let e k = (e 1 k ; . . .; e N k ) be the collective error vector.Then, where K := blockdiag(K i ) is the block-diagonal gain matrix (to be designed to stabilise the error dynamics (9)) and D C is defined as, Using these definitions, the collective noise vector η k is, where 1 N is the column vector of 1s of size N and ) with '°' as the entry-wise (or Hadamard) product.Based on Kalman (1960), the error dynamics ( 9) is stabilizable (in fault-free case) if the pair This condition is referred to as distributed (or networked) observability (Doostmohammadian & Khan, 2016).In this direction, the structure of matrices W and U (i.e. the topology of G β and G α networks) need to satisfy (W ⊗ A, D C )-observability.This is discussed in the following lemma.Proof: The proof follows from (minimal) sufficient conditions for observability of Kronecker product networks in Doostmohammadian and Khan (2020).Note that the digraph associated with W ⊗ A (as the overall distributed system matrix) can be represented as the Kronecker network product (also called the Tensor graph product) of the graphs G A and G β , and then, the lemma directly follows from the results in Doostmohammadian and Khan (2020).
We compared the communication rate and networkconnectivity for different distributed estimation/detection methods in Table 1, considering a large-scale system A, where rank(A) = n − n α with n α n as the system rank-deficiency.From Lemma 3.1, for the proposed STS estimator (and the LDS model), the minimum number of links to ensure distributed observability is n α (n − 1) on G α and n on G β with communication rate of 1 per step (O(n α n) links).This is illustrated in Section 6 with an example.Note that for large n the STS estimator for the SLS dynamics needs an all-to-all network with O(n 2 ) communication links, since the system dynamics A is not (needed to be) known.Note that, later in Section 5, we show that α-sensor i needs to share its output information with sensors that are not observationally-equivalent.Satisfying the above sufficient conditions for distributed observability, one can design the block-diagonal gain matrix K such that matrix A in ( 9) is a Schur stable, i.e. ρ( A) < 1.Further, this gain matrix K is constrained to be block-diagonal, to address the distributed nature of the networked estimator.We design this K by using iterative cone-complementary optimisation methods (see Ghaoui et al., 1997) as the solution to the following Linear-Matrix-Inequality (LMI), min trace(XY) with stopping criteria ρ( A) < 1 or trace(Y t X + X t Y) < 2nN + (with predefined ).The iterative algorithm to solve the above LMI can be implemented at the same time-scale of the system dynamics, where the agents update the gain matrix K k and shared it at each time-step k to address implementations in real-time.For static sensor network and fixed K, however, it is typical to find this K matrix once offline and provide K i s to sensors before the estimation process.See more information in Doostmohammadian and Meskin (2020), Doostmohammadian and Khan (2013a) and Khan and Jadbabaie (2011).

Sensor fault detection: threshold design and sensor classification
This section presents our sensor fault detection logic based on the residual definition and design of probabilistic thresholds via a specific sensor/output classification.Following the methodology in Sundaram (2012), we first define the residuals for all sensors.Given the estimation x i k at sensor i, define the estimated output at this sensor as y i k = C i x i k .Define the residual as the difference between the original output y i k and the estimated (or predicted) output y i k as, and the absolute value residual is, where A i is the ith hyper-row (i.e. the block of rows n(i − 1) The results of previous section implies that in the absence of faults, i.e. f i k = 0 for all i, the estimation error e i k and the residual r i k are bounded steady-state stable for all i.Then, one can detect possible faults in case of sufficiently large and biased residuals.The schematic of the proposed fault-detection at every estimator/sensor i is presented in Figure 4.
From the Schur stability of A, we skip the first term in (11) and the second term is, In case f j k = 0 for (at least) one sensor j ∈ N α (i), the term k and the residual at sensor i.This implies that, following the G α network connectivity in Lemma 3.1, the residual at sensors in the out-neighbourhood of the faulty sensor j might get biased.In case N α (j) = {j}, sensor j does not share its output over G α as j / ∈ N α (i) for any i = j and f j k = 0 only appears at the residual r j k and the fault can be easily isolated.This motivates the sensor classification in the next subsection.

Sensor classification
Following Lemma 3.1, we classify the sensors based on the type of their state measurement and structure of the system digraph.For a given system digraph G A , three classes of sensors are as follows: • i ∈ α: every sensor i with output of an unmatched state node (in δM).This type of sensors recover system rankdeficiency.• i ∈ β: every sensor i with output of a state in a parent SCC S p l .This type of sensors recover output-connectivity of the system.
• i ∈ γ : every sensor i with output of any unnecessary state for observability, i.e. every sensor i / ∈ α ∪ β.
where the network connectivity requirement of every class (from Lemma 3.1) is defined as follows, 3 • Type-α: every α-sensor sends its output via a direct link to every other sensor over G α network.• Type-β: every β-sensor sends its estimates via a path to every other sensor over network G β .• Type-γ : there is no connectivity requirement for a γ -sensor to share its output/prediction.Every γ -sensor only need to receive output information from α-sensors over G α and estimates from β-sensors over G β .
In case some contractions and parent SCCs share state nodes, output of that state node is treated as both Type-α and Typeβ, see more details in Doostmohammadian (2019) for the dual case of input classification for controllability.Next, assume nonzero additive fault on at least one sensor, e.g.f j k = 0. Recall from Equation ( 13) that this may affect the residual of only sensor j or many other sensors as discussed in the following, • Type-α: C j K j C j f j k = 0 at the faulty sensor j and C i K i C j f j k = 0 at all its out-neighbours j ∈ N α (i).
• Type-β and Type-γ : C j K j C j f j k = 0 only at the faulty β/γsensor j and C i K i C j f j k = 0 for i = j, since N α (j) = {j}.
In other words, additive fault at every α-sensor (i.e.∀i ∈ α) affects the residuals at all other out-neighbouring sensors, while fault at any β/γ -sensor only affects the residual at the same sensor.This results in the following fault-detection logic: If only the residual at sensor i is biased (over certain threshold), then sensor i is faulty.If the residual at all sensors is biased, one of the α-sensors is faulty.Since, in the latter case, f j k = 0 at α-sensor j results in C i K i C j f j k = 0 at all out-neighbouring sensors i, j ∈ N α (i), we need to add a new constraint on the gain K in (10).We redesign the matrix K such that |1 − C j K j C j | is sufficiently large for every α-sensor j, while |C i K i C j | is sufficiently small ∀i = j.This implies that the residual at faulty α-sensor j is greater than the residuals at all other non-faulty sensors ∀i = j, and one can detect the faulty α-sensor with greater residual as compared to all other sensors.
Remark 4.1: The design of the gain matrix K for distributed FDI on rank-deficient systems is more constrained than fullrank systems.In order to isolate the residuals at every α-sensor j, the constrained LMI design in ( 10) is revised as follows to suppress the cascaded bias at residual of sensors i (caused by bias at sensor j ∈ N α (i)) by constant 0 < τ < 1, min trace(XY) with the same stopping criteria as (10).
Note that smaller constant τ implies smaller residual values at neighbouring sensors i and, thus, better detection and isolation of faulty α-sensor j, while this may affect the convergencetime of the LMI optimisation problem as it reduces the set of possible solutions for K.In the next subsection, we design the fault-detection thresholds on the residuals.

Threshold design
In this subsection, using the noise statistics, we design the thresholds to alarm faults whenever the residual is biased over the threshold.Unlike many literature (Chong et al., 2015;Kar & Moura, 2011;Kar et al., 2011;Kodakkadan et al., 2017;Lee et al., 2015;Pajic et al., 2015;Rank & Niemann, 1999) assuming the noise terms are only supported on a bounded interval as |ζ i k | ≤ ζ and/or ν k ≤ ν, we do not limit the system/measurement noise to be of bounded support, see Remark 2.1.This is more realistic in real applications of FDI and estimation since the (commonly assumed) Gaussian noise (as a random variable) is of unbounded support over all real values.This mandates probabilistic threshold design instead of deterministic thresholds in Pajic et al. (2015), Kar et al. (2011), Chong et al. (2015), Lee et al. (2015), Kodakkadan et al. (2017), Kar and Moura (2011) and Rank and Niemann (1999) designed based on the bounds ζ and ν.
Let P k = E(e k e k ) and = E(η k η k ).From Equation ( 9) we have, Following the Schur stability of A, in the steady-state (as k → ∞), Therefore, Following similar analysis as in Khan and Jadbabaie (2014), one can find a bound on P ∞ 2 as, for some b < 1 as a function of A 2 .Moreover, in the absence of any faults, where 1 NN denotes the N by N matrix of all 1s.Applying the Eoperator on both sides of the above equation and recalling that Taking the 2-norm of the both sides, where Let define a 1 := I Nn − KD C 2 2 and a 2 := K 2 2 .Then, using Equation ( 16), P ∞ 2 defines the 2-norm of the covariance of the collective error e k .To find the 2-norm of the covariance of e i k at every sensor i, P ∞ 2 is scaled by the number of sensors N in Equation ( 20).This equation defines the bound on the error covariance at every sensor i.A similar setup is adopted in Acemoglu et al. (2008) for social learning dynamics.As compared to Acemoglu et al. (2008), in case Q 2 → 0 over time (referred to as diminishing innovations) and N → ∞, we have = 0 implying perfect estimation for any potentially unstable dynamics in contrast to neutrally-stable systems in Acemoglu et al. (2008).
Following Khan and Jadbabaie (2014), for the fault-free case (and large k → ∞ in steady-state), E(e i k ) = 0 and E(r i k ) = 0.Then, following the Gaussianity assumption on e i k and r i k , one can claim that where c = C i 1 denotes the norm 1 (the sum of absolute value) of the output vector (c is also referred to as the measurement gain).Assuming one state measurement at every sensor i, c is equal to the absolute value of the non-zero entry of C i .Following the notion of confidence intervals, the probability of r i k < 2c + 2R and r i k < + 3R are (more than) 95% and 99%, respectively.This simply implies that, assuming non-zero fault at any sensor i, one can introduce probabilistic thresholds on the residual r i k to detect the fault.Let define T 68% = c + R, T 95% = 2c + 2R, T 99% = 3c + 3R as the probabilistic thresholds on the residuals.If the residual at sensor i (or more sensors) is biased over T 68% , the detector triggers the alarm and declares possible fault, where the probability of false alarm (also known as false positive (Yan et al., 2019)) in this case is less than 32%.This implies high-rate (32%) of incorrectly raising the alarm when there are no true faults on the sensors.To reduce this rate, one can choose higher thresholds on the residuals, e.g.T 99% with (less than) 1% probability of false alarm.For a given low false-alarm rate κ, the threshold is defined as For example, to have false-alarm rate less than κ = 3%, the threshold is Similar claims can be stated for stronger fault detection thresholds, e.g.T 99% = 3c + 3R.In general, for a given residual value r i k the highest probability of false-alarm (and the associated threshold) can be defined as follows (see the illustration in Figure 5), Similar to false-alarm rate, the probability of false-negative (i.e.detector raises no-alarm while indeed there is non-zero fault at the sensor/output (Yan et al., 2019)) can be defined as illustrated in Figure 5, where the normalised residuals in the presence (red curve) and the absence (blue curve) of the fault follow Normal PDFs; the probability that Therefore, in the fault-free case, with probability 1 − π, the absolute residual r i k ≥ m(c + R) = T π and the fault detector triggers the alarm (false positive).With similar line of reasoning for faulty PDF (red curve), with probability 1−π 2 the detector does not trigger the alarm (false-negative) 4 .Note that these thresholds closely depend on the parameters in (20).To find tighter upper-bounds on P ∞ 2 , one can reduce ρ( A), I Nn − KD C 2 2 , and K 2 2 by optimal design of the gain matrix K. This, further, improves the convergence rate of the networked estimation error in the fault-free case (which is a direction of our future research).By embedding these thresholds at every sensor node, sensors can detect possible faults locally at the state output with no need of any centralised decision making.Our proposed simultaneous networked estimation and FDI is summarised in Algorithm 1.

Chi-squared detector via the residual history
One can extend the results of this paper to consider the residual's history over a predefined sliding time-window θ, instead of residual at every time-step.This is also referred to as the stateful detection (Giraldo et al., 2018) via distance measures.Define, the sum of the normalised squared residuals (at sensor i) over the sliding time-window θ as, It is known that the distribution of this distance measure v i k follows the Chi-square PDF with degree θ, i.e.E[v i k ] = θ (Doostmohammadian et al., 2021;Greenwood & Nikulin, 1996;Renganathan et al., 2020).Extending the results in Doostmohammadian et al. ( 2021) to rank-deficient systems, one can design Algorithm 1: The proposed algorithm for distributed estimation and FDI.
1 Input: System matrix A and digraph G A , system output matrix C, false-alarm rate κ or detection probability π = 1 − κ. 2 Classify the sensors (α, β, or γ ) as in Section 4.1; 3 Design the networks G α and G β (the sensor network) via Lemma 3.1; 4 Design the block-diagonal gain matrix K via LMI ( 14) with small τ < 1; 5 begin at every time-step k every sensor j: thresholds using the CDF of χ 2 θ -distribution as, with κ as the probability of false-alarm and −1 (•, •) denoting the inverse regularised lower incomplete gamma function (Greenwood & Nikulin, 1996).However, such detection method typically raises the alarm after certain delay depending on θ.This is because, instead of the instantaneous residual at one timestep, the summation in (22) over a (possibly) large time-window θ is considered and compared with the thresholds; therefore, the deviation of the residuals over many time-steps over θ would raise the alarm.This is more illustrated by simulation in Section 6.

Observationally-equivalent state nodes
In this subsection, we propose a strategy to replace the faulty sensor (detected via the logic in the previous section) with an observationally-equivalent sensor.The concept of observational-equivalence, introduced in Doostmohammadian and Khan (2016) and Doostmohammadian et al. (2018), provides a methodology to recover for loss of observability by replacing the biased/faulty output with an equivalent output in terms of observability.This concept is better defined on the system digraph (see Section 2.2).Recall that form Theorem 2.1 outputs/measurements of (i) one state in every parent SCC and (ii) every unmatched node are necessary for observability.All state nodes in the same parent SCC, S p l , are observationallyequivalent in the sense that output of any one fulfils the observability condition in Theorem 2.1.On the other hand, the set of unmatched nodes δM and maximum matching M are not unique in general (Murota, 2000).In this direction, a contraction, C l , is defined as the set of nodes in Let the set of all contractions in G A be denoted by C; for every choice of maximum matching M, there exists one unmatched node in every contraction C l and the number of all contractions equals the number of unmatched nodes in G A , i.e.C = {C 1 , . . ., C |δM| }.It is known that every observation/output of a state node in a contraction C l recovers the system S-rank by 1 (Doostmohammadian et al., 2018).In other words, denoting the outputs of all nodes in C l by C C l , we have, The above holds for any nonempty subset of C l , and implies the observational-equivalence of all state nodes in C l .From Section 2.3, output of states in C l also recover the rank of O A,C C l by (at least) 1.We refer interested readers to our previous work (Doostmohammadian et al., 2018) for more information.Define the sets A i ⊂ α, B j ⊂ β with i ∈ {1, . . ., |δM|}, j ∈ {1, . . ., |S p |} as observationally-equivalent sensors with outputs from states in C i and S p j , respectively.Using the notion of observational-equivalence, our fault compensation logic is as follows, • Type-α: replace the faulty sensor a ∈ A l by another sensor a ∈ A l (both a and a possess outputs from the same contraction C l ).• Type-β: replace the faulty sensor b ∈ B l by sensor b ∈ B l (both b and b possess outputs from the same parent SCC S p l ).• Type-γ : remove the faulty sensor of this type.This type plays no necessary role in system observability.
where the above compensation logic assumes that |A i | ≥ 2 and This assumption says that there exist (at least) two (observationally-equivalent) state nodes in every parent SCC and contraction.The connectivity of the new substitute α or β sensor follows the same connectivity as the removed faulty one (see connectivity conditions in Section 4.1).For Type-γ , there is no need for substitute sensors, and the network connectivity of the remaining sensors needs to be adjusted to fulfil the conditions in Section 4.1.

Proposed algorithm for q-redundant networked observability
Following the definitions in Section 2, the concept of observational-equivalence is in close relation with q-redundant observability.We are interested to design resilient networked estimators to tolerate isolation/removal of faulty sensors (or failed sensors).In this direction, Algorithm 3 designs qredundant observable estimators.q number of detected faulty sensors (via the FDI logic in Algorithm 1) can be isolated/removed, while the remaining sensor network preserves distributed observability.Note that q is limited by the minimum size of observationally-equivalent sets (i.e.contractions and parent SCCs), and this min size is defined by q in Algorithm 3. Similarly, one can extend the results to design q-edge-connected networks which remain SC after removal of (less than or equal to) q number of links.This is referred to as survivable network design (Lau et al., 2009;Sadeghi & Fan, 2019;Umsonst, 2019) and particularly is related to the connectivity requirement of the Type-β sensors.In other words, designing a q-edge-connected sensor network ensures strong-connectivity after removal of (up to) q links (or q lost-connectivity/missing-packets), which guarantees the connectivity requirement sensors over G β .Similarly, for G α , one can add more links from observationally-equivalent α sensors.

Illustrative simulation
We consider a linear system of 12 states with the digraph shown in Figure 6.The link weights (chosen randomly in [0.02 1.1]) represent the non-zero entries of system matrix A; for example, link from state 1 to 2 with weight 0.79 implies that a 21 = 0.79 (a 21 is the entry at column 1 and row 2 of A).The system is considered in the discrete-time form (1) with noise ν = N (0, 0.01).For 1 ≤ q ≤ q define N = (q + 1)(|S p | + |δM|) as the min number of necessary state outputs for q-redundant observability;

8
Assign N sensors to q + 1 state-outputs from every  2019) Any sensor l with outputs of states in {C i ∩ S p j = ∅} is both Type-α and β, i.e., l ∈ A i , l ∈ B j ; sensors β 1 , β 2 (Type-β) and α 1 , α 2 (Type-α), with measurement noise ζ = N (0, 0.01).These outputs fulfil the necessary conditions for (A, C)-observability (see the observability conditions in Doostmohammadian & Khan, 2020).The networks G α and G β are designed as given in Figure 6, satisfying distributed observability conditions in Lemma 3.1.The networked estimation protocol follows Equations ( 5) and ( 6), where sensors share their estimates over the SC network G β (solid arrows) and their measurements over the network G α (dashed arrows).Link weights in G β are chosen randomly such that the incoming link weights sum to 1 to satisfy the row-stochasticity of consensus matrix W, and for network G α the link weights are equal to 1.The output gains, i.e. the non-zero entries of the C matrix, are also equal to 1. System is unstable as ρ(A) = 1.155 > 1.In G α , the α-sensors are the hubs of the network and directly send their outputs to other sensors.We choose τ = 0.2 in the LMI gain design ( 14), so that the residual at a faulty α-sensor (say α 2 ) be about 1 0.2 = 5 times greater than the residuals at the other nonfaulty sensors (say α 1 , β 1 , β 2 ).Therefore, the conditions in ( 14) are as follows, For the sake of space, only K α 2 (feedback gain at sensor α 2 ) is given below as an example and the rest are skipped.
Next, we replace the two faulty sensors with new observationally-equivalent sensors via the results in Section 5 to recover for loss of observability.The unmatched state 10 (with output to sensor α 2 ) belongs to the contraction C 2 = {10, 11}.We replace α 2 by non-faulty sensor α 2 with output of state node 11.Similarly, sensor β 1 with output of state node 1 in the parent SCC S p 1 = {1, 2, 3} is replaced by non-faulty sensor β 1 with output of state node 2. This new setup is shown in Figure 8, where the state  6 for given faults at sensors β 1 and α 2 .The residuals at faulty sensors β 1 and α 2 are over the threshold T 95% , implying possible faults with probability more than 95%.For the same faults, the distance measures are over the threshold T 99.7% and Chi-squared detector reveals a higher detection probability of 99.7%; however, the distance measures go over the thresholds with certain delay (i.e.delay in raising the alarm).
Figure 8.This figure shows the setup to compensate for the loss of observability (after isolating/removing faulty sensors α 2 and β 1 in Figure 6).The set of orangecoloured and green-coloured states respectively belong to a parent SCC and a contraction.The faulty sensors α 2 and β 1 with outputs from states 10 and 1 (in Figure 6) are replaced with new sensors α 2 and β 1 respectively with outputs of equivalent states 11 and 2 in the system digraph.The new setup ensures system observability and results in bounded stable MSEE as shown in Figure 9.
nodes in the contraction C 2 and parent SCC S p 1 are coloured.Figure 9 presents the performance of the networked estimation protocol under the compensated setup (after replacing the faulty sensors with their equivalents).As it is clear, the MSEE is bounded steady-state stable at all sensors with no (considerable) residual, implying the distributed observability recovery in the new setup by Figure 8.
Note that, for this example system digraph, q = 1 from Algorithm 3, as minimum size of the contractions and parent SCCs is equal to 2. Thus, one can design a 1-redundant observer as shown in Figure 10.The design follows from Algorithm 3: we have N = 8 and N = 1 as C 1 ∩ β 1 = {4, 9}, and the minimum number of necessary outputs for this case is 7. Therefore, the sensor with output of state 9 is both Type α and β, shown as α 1 β 1 in Figure 10.Every sensor receives two direct links from every set A 1 = {α 1 , α 1 } and A 2 = {α 2 , α 2 }.The undirected network G β as an undirected cycle of all sensors is 1-connected (see  8 and 9.

Comparison with recent literature
In this subsection, we compare our proposed methodology with recent works (He et al., 2021) and He, Ren, et al. (2019) as they make similar assumptions on system stability and distributed observability, while most literature considers stable dynamics and local observability in the direct neighbourhood of sensors.2019) is deterministic and claims to detect faults of a certain magnitude.We run this MTS protocol over the same setup as in the previous subsection (Figure 6) for comparison, using the same system parameters and outputs.The only difference is that in He et al. (2021) and He, Ren, et al. (2019) the sensor network needs to be undirected and we used bidirectional G β with weights W = W +W 2 instead.Table 2 gives the chosen parameters of the MTS protocol.Adopting these parameters the MSEEs at all 4 sensors are shown in Figure 11.According to He et al. (2021) and He, Ren, et al. (2019), the fault is detected whenever the measurement-update at a sensor is over the threshold .From the figure, one can see that both faults at sensors α 2 and β 1 are detected.However, the detector triggers alarm falsely at sensor β 2 .
The improvements of our proposed distributed strategy over this MTS solution are as follows:  (Right) The proposed network topology to design 1-redundant networked observer/estimator is shown (following Section 4.1).The networked estimation can tolerate removing/isolating any one faulty sensor.For example, assume S α1 is isolated by cutting all its communications (colored by red).The black-coloured links show the reduced communication network of G β (solid links) and G α (dashed links).The conditions in Lemma 3.1 and Theorem 2.1 hold on the reduced setup, i.e. the system (digraph) is structurally observable (in distributed sense) to all other sensors communicating over the reduced network.This 1-redundant networked observer can tolerate isolation/removal of 1 sensor from every observationally-equivalent set A 1 , A 2 , B 1 , B 2 as shown in Figures 6-7 and 8-9.
• First (and most important), the MTS strategy in He et al. (2021) and He, Ren, et al. (2019) requires L (in this example 20) consensus and communication iterations between time-steps k and k + 1, where L needs to be (at least) more than the diameter d N of the sensor-network and, further, needs to be large enough for resilient and faulttolerant filtering.L > d N guarantees that the information of every sensor reaches every other sensor between k and k + 1.Thus, the system becomes locally observable to all sensors at all time-steps k via the many iterations of consensus/communication.In contrast, our proposed networked estimator gains distributed observability at the same timescale k via the established network topology and LMI gain design.The MTS approach in He et al. (2021) and He, Ren, et al. (2019) requires (L-times) faster communication facilities along with (L-times) faster computation units to process the received information at every sensor.In largescale systems, e.g.geographically distributed grid monitoring, with long-range communications and sparse network connectivity (large diameter d N ), the MTS approach is more costly and even infeasible.Our proposed strategy demands low-cost equipment with equivalent (and even better) estimation and FDI performance (compare the MSEE performance in Figures 7 and 11).Following Table 1, here the 'network-connectivity' × 'communication-rate' is 8 × 20 as compared to (6 + 4) × 1 in Figure 6.• The strategy in He et al. (2021) and He, Ren, et al. (2019) detects the fault but provides no recovery solution for faulty sensors, i.e. all other sensors keep receiving biased information from the faulty ones.Having few averaging iterations (small L) may bias the MSEEs at all sensors (see Figure 11).To overcome this, the number of averaging iterations L needs to be increased, which, in turn, demands even faster communications/data-processing.In contrast, using our recovery method via observationalequivalence, one can recover the loss of observability and even add redundant sensors to make the system qredundant observable.For example, compare the MSEE performance of Figure 9 (after observability recovery) and Figure 11.2019) via L = 20 steps of averaging/consensus between every two consecutive steps of system dynamics.The measurementupdates (while tracking the unstable system in Figure 6 via the MTS protocol) are given at all 4 sensors.This figure says that the faults at sensor α 2 and β 1 is above the threshold (detected), while the detector raises false alarm at sensor β 2 .

Conclusion
A residual-based distributed FDI method over distributed estimation networks and a probabilistic threshold on the residuals is proposed in this paper to detect possible additive faults.We validated our results on an academic simulation example.
The performance measures of the proposed joint estimation and FDI algorithm are summarised as follows: (i) no need for high communication/computation rate and costly networking/processing resources as compared to MTS estimators, (ii) less network traffic/connectivity as compared to locally observable STS estimators, (iii) more accurate detection via probabilistic threshold design with no simplifying upper-bound assumption on the noise support (as compared to the deterministic thresholds), and (iv) estimation recovery via adding observationally-equivalent sensors and designing q-redundant and fault-tolerant distributed estimators.
It is worth mentioning that the proposed methodology in this paper is of polynomial-order complexity, which makes it applicable in large-scale systems.The LMI approach to design the gain matrix is of polynomial-order complexity (Nesterov & Nemirovskii, 1994;Ye, 1993).Moreover, the computational complexity of the DFS algorithm is O(n2 ) (Cormen et al., 2009) and the computational complexity of the most efficient algorithm for DM decomposition and finding contractions in the system digraph is O(n 2.5 ) (Micali & Vazirani, 1980).As a direction of future research, we are aiming to consider possible time-delays over the communication network using the results in Hadjicostis and Charalambous (2014) and Doostmohammadian et al. (2021).

Notes
1.The relevant concepts of contractions and parent SCCs (stronglyconnected-components) and other structural observability notions are defined later in Section 2.1.2. Throughout this paper, the terms output and measurement are used interchangeably.3. Note that in this section, we do not consider observationally-equivalent sensors, e.g, with outputs from the same parent SCC S p l .This is discussed later in the next section.4. Note that the given probability for false-negative is the approximate value as we consider the absolute residual r i k .This implies folded normal distribution, i.e. to fold-over the probability mass to the right of 0 axis by taking absolute value.Then, the exact probability of false-negative is . For m ≥ 2 one can approximate κ by 1−π

Figure 1 .
Figure 1.This figure shows a possible application of distributed fault detection and state estimation in CPES (Ilić et al., 2010) (in contrast to the centralised solutions).A geographically distributed sensor network monitors a renewable-energy grid including wind and solar-farms.Localized estimation and detection strategies enable distributed data-processing (e.g. in cloud-based architecture) and monitoring with no need for central coordination.This prevents a single-node-of-failure and global shut-down of the entire grid by localising the detection and isolation of faulty assets.

Figure 2 .
Figure 2.This figure shows the schematic of the proposed localised detection via a network of local estimators.All sensors are subject to noise and possible fault.The networked estimators at sensors cooperatively track the states of the dynamical system (e.g. an energy grid) locally while having partial observability.A localized detector is embedded at every sensor, which triggers the alarm in case the residual is over a probabilistic threshold (with specified false-alarm (false-positive) and false-negative rates).

Figure 3 .
Figure 3.This figure shows the difference of single and multi time-scale estimators (STS versus MTS).In STS (left) every sensor performs only one step of communication and consensus update between two consecutive time-steps k and k + 1 of system dynamics (hence, the same time-scale).In MTS scenario (right) sensors do many steps of communication and consensus update between k and k + 1 to gain local observability (hence, multi time-scale).

Figure 4 .
Figure 4. Following Figure 2, this figure shows the procedure of fault detection at every sensor/estimator.The difference of the true output and the estimated output gives the residual at sensor/estimator i, i.e. r i k = |C i x i k − C i x i k |.The detector at estimator i compares the residual with the probabilistic threshold T π (designed in Section 4.2) and raises the alarm if r i k ≥ T π .The threshold T π can be designed based on specific false-alarm rate.

Figure 5 .
Figure5.This figure shows an example distribution of non-faulty residual (blue curve) versus faulty residual (red curve) at sensor i.The green and red vertical lines represent two example residuals r i k : the green residual is very close to the expected value (zero) of the fault-free PDF and, thus, most likely is due to system/measurement noise; the red residual is far from the expected value (zero) and, thus, faulty with probability more than 68.3% and less than 95.4% (based on the shown confidence intervals as grey vertical lines).More accurate probability of detection and false alarm for this example can be defined as π = erf( 1.5√ 2 ) = 86.7%andκ = 1 − π = 13.3% (for r i k ≥ T π = 1.5(c + R)), respectively.Considering the absolute value of the residual r i k , the blue shaded area equals to 1−π 2 , representing half of the probability of false-positive (the other half on the left-side is not shown for simplicity), and the red shaded area equals to 1−π 2 , representing (approximate) false-negative probability.

Figure 6 .
Figure6.This figure shows a setup for networked estimation over an example system digraph (left), with outputs of the blue-coloured state nodes taken by the redcoloured "sensor" nodes.The communication network (right) of the sensors includes network G α (dashed arrows) over which α-sensors share their outputs to every other sensor and SC network G β (solid arrows) over which sensors share their estimates.See Section 3 for details.

Figure 7 .
Figure 7.This figure shows the residuals and MSEEs at the setup given in Figure6for given faults at sensors β 1 and α 2 .The residuals at faulty sensors β 1 and α 2 are over the threshold T 95% , implying possible faults with probability more than 95%.For the same faults, the distance measures are over the threshold T 99.7% and Chi-squared detector reveals a higher detection probability of 99.7%; however, the distance measures go over the thresholds with certain delay (i.e.delay in raising the alarm).
He et al. (2021) andHe, Ren, et al. (2019) propose an MTS protocol for distributed filtering and fault detection, which requires L steps of consensus between two consecutive steps of system dynamics.The detection logic inHe et al. (2021) andHe, Ren,  et al. (

Figure 9 .
Figure 9.The MSEEs, residuals, and distance measures at all 4 sensors in the recovered setup in Figure 8 are shown here.The steady-state stability of MSEEs at all sensors implies the networked observability of the proposed recovered estimation setup.

Figure 10 .
Figure10.(Left) 2 outputs from states in every parent SCC and contraction are considered in the system digraph, so that the pair (A, C) is 1-redundant observable.(Right) The proposed network topology to design 1-redundant networked observer/estimator is shown (following Section 4.1).The networked estimation can tolerate removing/isolating any one faulty sensor.For example, assume S α1 is isolated by cutting all its communications (colored by red).The black-coloured links show the reduced communication network of G β (solid links) and G α (dashed links).The conditions in Lemma 3.1 and Theorem 2.1 hold on the reduced setup, i.e. the system (digraph) is structurally observable (in distributed sense) to all other sensors communicating over the reduced network.This 1-redundant networked observer can tolerate isolation/removal of 1 sensor from every observationally-equivalent set A 1 , A 2 , B 1 , B 2 as shown in Figures 6-7 and 8-9.

Figure 11 .
Figure 11.The MSEEs at 4 sensors are shown using the MTS protocol in He et al. (2021) andHe, Ren, et al. (2019) via L = 20 steps of averaging/consensus between every two consecutive steps of system dynamics.The measurementupdates (while tracking the unstable system in Figure6via the MTS protocol) are given at all 4 sensors.This figure says that the faults at sensor α 2 and β 1 is above the threshold (detected), while the detector raises false alarm at sensor β 2 .

Table 1 .
Comparison between different distributed estimation/detection methods in terms of 'network-connectivity' × 'communication-rate'.

Table 2 .
He, Ren, et al. (2019)he MTS protocol inHe et al. (2021)andHe, Ren, et al. (2019).Section 2.1), i.e. by removing any 1 sensor (or any 1 link) the network G β remains SC.Therefore, after removal/isolation of any α or β sensor (for example sensor α 1 ), the remaining sensor network still satisfies Lemma 3.1 and Theorem 2.1 (i.e.distributed observability is preserved).Recall that the performance of this 1-redundant networked observer/estimator after removing one sensor from every observationally-equivalent set A 1 , A 2 , B 1 , B 2 is given, for example, in Figures6 and 7and in Figures