Low cost network traffic measurement and fast recovery via redundant row subspace-based matrix completion

Traffic matrices (TMs) are essential for managing networks. Getting the whole TMs is difficult because of the high measurement cost. Several recent studies propose sparse measurement schemes to reduce the cost, which involve taking measurements on only a subset of origin and destination pairs (OD pairs) and inferring data for unmeasured OD pairs through matrix completion. However, existing sparse network measurement schemes suffer from the problems of high computation costs and low recovery quality. This paper investigates the coherence feature of real traffic flow data traces (Abilene and GÈANT). Both data sets are high coherence, with column coherence greater than row coherence. According to the coherence feature of both data sets, we propose our Redundant Row Subspace-based Matrix Completion (RRS-MC). RRS-MC involves several techniques. Firstly, we design an algorithm to identify subspace rows (OD pairs) from historical data. Secondly, based on the identified subspace rows, we design our sampling scheduling algorithm, which takes full measurement samples in subspace rows while taking partial measurement samples in the remaining rows. Moreover, we propose a redundant sampling rule prevent the recovery accuracy decrease caused by the subspace rows varying. Finally, we design a completion algorithm to recover the partially measured rows. We conduct extensive experiments. Results indicate that the proposed scheme is superior to the state-of-the-art sampling and completion scheme in computation costs and recovery accuracy.


Introduction
A Traffic matrix (TM) reveals network traffic volumes between OD pairs in a time slot.When viewing TM over multiple time slots, the results form Traffic matrices (TMs).TMs are essential for many network tasks, such as capacity planning, traffic engineering, anomaly detection, and congestion control (Huang et al., 2022;Li et al., 2023a;Roughan et al., 2003;Tang et al., 2021;Xie et al., 2017Xie et al., , 2019Xie et al., , 2016)).
The traffic flow exchanged between OD pairs can be measured using flow-based measurement tools like NetFlow (Claise, 2004) and sFlow (Wang et al., 2004).However, because these tools frequently consume too many CPU and memory resources (Yu et al., 2013), obtaining the TMs by measuring the traffic volumes between all OD pairs at all time slots is impractical due to high measurement overhead (Li, Liang et al., 2023;Xie et al., 2017Xie et al., , 2021Xie et al., , 2022)).To reduce measurement overhead, matrix completion-based sparse network monitoring attracted much recent interest (Li et al., 2023b;Xie, Wang et al., 2017).Sparse network monitoring aims to get the entire TM data by taking measurements in only a portion of OD pairs and inferring the un-measured data through matrix completion.According to matrix completion theory, if the target matrix has a low-rank feature, it can be accurately recovered by a subset of samples.Some studies (Roughan et al., 2011;Xie et al., 2021) have indicated that TMs have a temporal and spatial correlation, implying that TMs meet the low-rank prerequisite for using matrix completion.Although promising, existing matrix completion-based sparse network monitoring suffers from two major problems: • Low recovery accuracy: In addition to the low-rank feature, matrix coherence is another essential feature in the matrix completion theory.When the target matrix is highly coherent, which implies that not all entries reveal the same information, most of the information will be omitted if we take samples randomly.Using these random samples to recover the target matrix will lead to poor recovery accuracy.Some recent studies (Roughan et al., 2011) have applied matrix completion to infer un-measured data by some random measurement samples.However, these samples may not load enough information to recover unmeasured data accurately.
• Long processing time: To achieve better recovery accuracy, some studies (Liu & Wang, 2020;Xie et al., 2015) aim to select better measurement sampling locations following the matrix completion framework.However, to select a better measurement sampling location, such methods usually require multi-round iterations, which results in high computational costs.Besides, existing matrix completion algorithms (e.g.SRMF Roughan et al., 2011, SVT Cai et al., 2010, ALM Lin et al., 2010) involve an iterative training parameters process, which incurs high computational costs too.
Subspace-based matrix completion (Balcan & Zhang, 2016;Jin et al., 2022;Krishnamurthy & Singh, 2013, 2014;Wan et al., 2018) has a completion speed advantage among matrix completion techniques.This scheme only uses simple matrix calculation to recover the entire matrix, not involve multi-round iteration.However, using the subspace-based method for sparse network monitoring faces two problems: (1) For subspace-based matrix completion, we can select some columns to form column subspace or some rows to form row subspace.Which is better for a sparse network monitoring scenario.(2) Even though the completion algorithm is high-speed, identifying the subspace through the adaptive sampling is time-consuming.
Unlike the above sparse network monitoring schemes (Liu & Wang, 2020;Xie et al., 2015), we model the network traffic traces using a matrix, where the row denotes the OD pairs, and the column denotes the time slot to take advantage of both spatial and temporal information.We study the coherence feature of real network traffic traces.According to the coherence feature, we propose a matrix completion based on the row subspace.Our contributions in this paper can be summarised as follows: • We study the coherence feature of two real traffic flow traces (Abilene Zhang et al., 2003, andGÈANT Uhlig et al., 2006).We show the relationship between matrix numerical value and high coherence and illustrate the negative impact of high column coherence on the column subspace-based sampling and completion algorithm.• Based on the coherence feature of network traffic data, we propose Redundant Row Subspace-based Matrix Completion (RRS-MC).RRS-MC constructs a subspace by fully sampling the subspace rows.Then, for each rest of the rows, we take partial sampling and recover it by subspace.We identify subspace rows by training a small amount of historical data.Moreover, we add some redundant samples to obtain subspace robustly to prevent the decreased recovery accuracy caused by data varies.• We conduct comprehensive experiments on two real traffic flow traces (Abilene Zhang et al., 2003, andGÈANT Uhlig et al., 2006).The experimental results show that our method can obtain the whole monitoring data with high recovery quality at a lower measurement cost and has advantages in processing speed.
The rest of this paper is organised as follows.We present notations and basic definitions in Section 2. In Section 3, we introduce the related work.We study the feature of real traffic traces in Section 4. We formulate the problem and show our solution overview in Section 5.In Section 6, we proposed our sampling schedule algorithm and corresponding completion algorithm.We evaluate the performance of the proposed algorithms through extensive experiments in Section 7. Finally, conclude the work in Section 8.

Preliminaries
In this section, we give the notations and definitions.Table 1 shows a list of notations used in this paper.Definition 2.1 (Full SVD and Skinny SVD, Liu et al., 2012): For an n 1 × n 2 matrix M, its Singular Value Decomposition (SVD) is defined as M = U V T , where U ∈ R n 1 ×n 1 and V ∈ R n 2 ×n 2 are unitary matrices, and ∈ R n 1 ×n 2 a diagonal matrix with decreasing order elements.i.e. = diag(σ 1 , σ 2 , . . ., σ r , 0, . . ., 0), where r is the rank of the matrix and σ i is the ith singular value.The diagonal elements are also known as singular values.The SVD defined in this way is called the full SVD.
If only keep the positive σ i , the reduced form is called the skinny SVD.For a matrix M of rank r, its skinny SVD is computed by M = U r r V T r , where r = diag(σ 1 , σ 2 , . . ., σ r ) with {σ i } r i=1 being positive singular values.More precisely, U r and V r are formed by taking the first r columns of U and V, respectively.Definition 2.2 (Leverage Scores, Chen et al., 2015): For an n 1 × n 2 real-valued matrix M of rank r.Denoting the skinny SVD of M = U V T , then its (normalized) leverage scores are defined as: where μ i (M) corresponding any row i, ν j (M) corresponding any column j, and e i (resp.e j ) denotes the i (resp.j)th standard basis vector with appropriate dimension.For an n 1 × n 2 matrix M of rank r whose skinny SVD is given by M = U r r V T r , where U ∈ R n 1 ×r and V ∈ R r×n 2 are orthogonal matrix.Let U and V be the column subspace and row subspace of M. The column-coherence is defined as: (3) The row-coherence is defined as: Where the e i (resp.e j ) denotes the i (resp.j)th standard basis vector with appropriate dimension.Column-coherence and row-coherence are bound by 1 ≤ μ(U) ≤ n 1 /r and 1 ≤ μ(V) ≤ n 2 /r.Coherence parameter of M is defined as μ(r) := max(μ(U), μ(V)).We can consider the leverage scores as a local version of the coherence parameter.

Related work
In this section, we review related work.First, we introduce the existing traffic trace recovery techniques and matrix completion-based sparse network monitoring methods.Then, we describe the subspace-based matrix completion and the corresponding adaptive sampling schemes.

Traffic data recovery techniques and sparse network monitoring schemes
A series of studies is proposed to deal with missing traffic data.Utilize spatial (Lakhina et al., 2004;Zhang et al., 2005) or temporal (Barford et al., 2002) feature of the traffic data to infer the missing data.Most of the above methods have poor data recovery performance.
Utilizing both spatial and temporal features in traffic matrices, SRMF (Roughan et al., 2011) proposes the first spatio-temporal model to recover missing data by matrix factorization (MF).Following SRMF, several studies (Y.C. Chen et al., 2014;Diao et al., 2022;Hu et al., 2022;Hu, Hsieh et al., 2022;Li et al., 2023b;Liang et al., 2022) are proposed to recover data.These methods infer the missing data from partial measurement samples.However, such schemes do not consider the quality of the measurement samples.When the measured data does not carry enough information, these methods may suffer from low recovery accuracy.
In order to ensure the recovery performance, i.e. make the measurement samples have enough information to recover the unmeasured data.Several studies adaptively select measurement samples following the matrix completion framework.MC-E2E (Xie et al., 2015) proposes a dynamic sampling strategy to determine which OD pairs need to be measured according to some initial random samples.By modeling the traffic matrix with rows representing the origin nodes and columns representing the destination nodes, the main steps of the MC-E2E are as follows: first, apply a random sampling strategy to obtain some initial measurement samples, then use the previous measurement samples to recover the traffic matrix, and adaptively select the measurement locations with larger recovery error, the sampling process iterates until the stop condition is met.The stop condition determines whether the current measurement samples contain enough information to recover the traffic matrix.After obtaining enough measurement samples, MC-E2E uses SVT (Cai et al., 2010) to infer the unmeasured data.MC-E2E (Xie et al., 2015) has an advantage in terms of recovery accuracy.However, to achieve high recovery quality, MC-E2E requires multi-round iterations to find informative sampling locations, resulting in high computational costs.
Liu and Wang (2020) uses an adaptive sampling algorithm (Y.Chen et al., 2014) (called Local Coherence in this paper) for sparse network monitoring.Local Coherence (LC) is proposed to solve high coherent matrix completion by a two-stage adaptive sampling strategy.First, randomly sampling some entries and performing SVD on these samples to get the estimated leverage scores (local version of coherence) of remaining un-sampled data.Second, use the estimated leverage scores to determine further sampling location.Liu and Wang (2020) also uses the stop condition proposed in Xie et al. (2015) to determine whether the current measurement samples are sufficient to recover the matrix.Before the stop condition is satisfied, LC needs to continuously perform SVD to obtain the leverage scores to determine further sampling location, resulting in high computational costs.In addition, LC uses partially measured data to calculate leverage scores which may lead to inaccurate sampling locations and thus affect the recovery quality.

Subspace-based matrix completion and corresponding adaptive sampling scheme
A novel type of matrix completion algorithm is proposed in recent years (Balcan & Zhang, 2016;Jin et al., 2022;Krishnamurthy & Singh, 2013, 2014;Wan et al., 2018).These methods (called subspace-based matrix completion in this paper) use subspace to fill partially sampled vectors.The rationale of these completion methods is that if a vector belongs to a subspace U, then it can be recovered by U and the sub-samples of the vector.
In detail, the subspace-based methods (Balcan & Zhang, 2016;Jin et al., 2022;Krishnamurthy & Singh, 2013, 2014;Wan et al., 2018) identify some columns to compose the column subspace U, then use U to fill the other partially sampled columns.Given a subspace U, for any column M t , we sample set entries, denote samples in M t as M t , if M t ∈ U, and we can represent M t = Uα t , where α t is an r-dimensional coefficient.We can solve the following optimization problem to get the coefficient: (5) Then recover M t by Mt = Uα * t , where α * t is the optimal solution of (5), because (5) has a closed-form solution α * t = (U T U ) −1 U T M t .Thus we can get the recovery equation: An adaptive sampling strategy is usually used in these schemes to identify the columns that compose the subspace (called subspace columns).Specifically, given a subspace U (initially, it can be ∅), for any column M t , random sampling | | entries, when | | is enough, we can determining whether sub-samples of M t belongs to the current subspace U by an estimator res = M t − P U M t , where In the absence of noise, when res > 0 imply that M t not belonging to U, then adaptive sampling all entries in M t , and add it to the U.When res = 0, we can recover M t by Equation ( 6).Finding subspace columns using adaptive sampling usually requires the assumption that μ(U) is incoherent (Balcan & Zhang, 2016), i.e. low coherence value of μ(U).When μ(U) is incoherent, it implies that each row of the matrix reveals the same information.So, we do not need to consider the sampling location when we take samples for each column.When the | | is large enough, we have sufficient information to judge whether the partially sampled column belongs to the current subspace.If so, we can use it to recover the partially sampled column.
To provide a sparse monitoring scheme with high recovery accuracy and low measurement cost.This paper studies the coherence feature of two real network traffic traces.Based on the coherence feature, we propose RRS-MC, including the measurement and corresponding completion algorithms.Our measurement algorithm only takes full measurement samples on some rows and partial measurement samples for the remaining rows.Then, we can use the proposed completion algorithm to recover unmeasured data.We also demonstrate that the proposed completion algorithm is feasible for sparse network monitoring due to its low computation cost.

Empirical study with real traces, and common cases of highly coherent matrices
According to the matrix completion theory, a low-rank feature is necessary for low cost sparse network monitoring.In addition to low rank, matrix coherence is another feature that affects the accurate recovery of network monitoring data.In this section, we conducted the following experimental studies to prove traffic monitoring data's (Abilene Zhang et al., 2003 andGÈANT Uhlig et al., 2006) low-rank and high-coherence features.Then, we investigate the data structure leading to the matrix's high coherence.

Low-rank feature
According to Markovsky (2018); Xie et al. (2020), the top k singular values occupy nearly the total energy if a matrix has a good low-rank structure.i.e. k i=1 σ 2 i ≈ r i=1 σ 2 i , where σ i represents the ith singular value of the matrix.In this study, we apply SVD on the real traffic traces and check the low-rank feature of the real traffic traces using the total variance captured by the top k singular values, i.e. g(k) = k i=1 σ 2 i / r i=1 σ 2 i .Figure 1 shows the fraction g(k) for both traffic data sets Abilene (Zhang et al., 2003) and GÈANT (Uhlig et al., 2006).The top ten singular values in both network traffic data capture more than 95% variance, indicating that both data sets have a good low-rank approximation and meet the prerequisite for using matrix completion-based methods.

High-coherence feature
In addition to the low-rank feature, another essential feature in matrix completion theory is matrix coherence.The incoherence assumption (i.e.low coherence) is required because of the uniform sampling (Y.Chen et al., 2014).When the target matrix is incoherent, informally speaking, each entry of the target matrix reveals the same information.Since each entry has the same information load, we can accurately recover the matrix as long as the smallest required sample number is satisfied.However, this incoherence assumption is unlikely to hold in a real network traffic matrix.The traffic matrix is coherent (i.e.high coherence).When the matrix is coherent, adaptive sampling schemes are usually used instead of random sampling to ensure that the samples have enough information.In the high coherence case, we prefer to select those elements that carry more information to provide enough information to recover the target matrix.In this subsection, we study the coherence feature of both data sets.
According to Definition 2.3, the matrix coherence is the maximum value of the leverage scores.We first study the leverage scores of both data sets.For the monitoring matrix M ∈ R N×T , based on Definition 2.2, the leverage scores are related to N, T, and rank r.We take T = N, and r is calculated as the least k that makes g(k) larger than 99%, where g(k)  is given in Section 4.1.Figure 2 shows the leverage scores of both data sets, where the Xaxis is the index of rows and columns, and the Y-axis is the leverage score.As shown in Figure 2

Cause for the high coherence of traffic traces
The low-rank feature description columns (or rows) lie on a low-dimensional subspace.When the columns (or rows) uniformly lie on the low dimensional subspace, the coherence parameter is proved to be very low (Candès & Recht, 2009).However, real-world data always not uniformly lie on a low-dimensional subspace, which causes high coherence of the matrix.To illustrate the reason for the high coherence of the traffic traces, in this subsection, we show the relationship between matrix numerical value and high coherence.
• Case 1: The underlying subspace is constituted of some tiny subspace.Figure 4(a) shows a small example for this case.We can see that in a rank-2 matrix in which the first three rows belong to the same row subspace, the last row belongs to another row subspace.This structure is called the mixture of subspace in Balcan and Zhang (2016) and Liu and Li (2016).• Case 2: The maximum and minimum values of the data in the matrix differ by several orders of magnitude.Figure 4(b) shows an example for this case.• Case 3: A matrix simultaneously satisfies both data features of Case 1 and Case 2. Figure 4(c) shows a small example for this case.
The high column coherence feature of network traffic data implies that either one or more OD pairs have much more traffic volume than other OD pairs or that some OD pairs belong to an independent subspace.

Problem formulation
The aim of this paper is to lower network measurement costs while obtaining accurate network monitoring data for all OD pairs.Matrix completion provides a solution to recover a low-rank matrix from a subset of sampled entries.The prerequisite for using matrix completion is that the target matrix satisfies the low-rank constraint.In Section 4.1, we verify the low-rank feature of the traffic traces, thus making the matrix completion reduce the measurement cost feasibly.As shown in Figure 5, for low cost network monitoring, our scheme includes two major tasks: (1) a sampling scheduling algorithm that determines locations of measurement samples, and (2) after samples are collected, a matrix completion algorithm recovers (infers) the un-measured (un-sampled) data.
Subspace-based matrix completion schemes use adaptive sampling to obtain the subspace and then recover partially sampled columns (or rows) through the subspace.Applying subspace-based matrix completion for low cost traffic monitoring faces the following problems: • For subspace-based matrix completion, we can use adaptive sampling to find columns to compose column subspace or adaptive finding rows to compose row subspace.In Section 4.2, we verified the coherence feature of the traffic traces.Its column coherence is very high, and row coherence is relatively low, in which case using a column subspace-based adaptive sampling approach may lead to recovery failure or poor recovery accuracy.Figure 6 shows an example of the above case.The given rank-2 matrix with noise is shown in Figure 6(a), the coherence parameters of the matrix are μ(U) = 2, μ(V) = 1.582, and we have μ(U) > μ(V).Assuming we take random sampling as shown in Figure 6(b), we need to use an adaptive sampling strategy to select another column to form the column subspace of the given matrix.As our estimator, we calculate the residual between each partially sampled column and the fully sampled column (the second column).Wang and Singh (2015) and Wang and Singh (2017) use the magnitude of the residuals as a probability to identify which is a subspace column.In this case, the first column has the largest residual and is more likely to be a subspace column.After collecting samples, we use Equation ( 6) to recover the given matrix, and the recovery result is shown in Figure 6(c).As shown in Figure 6, a point value is 9.001 in the original matrix, but in the recovered matrix, it becomes 12, leading to poor recovery accuracy.
• High computational cost to identify the subspace.The recovery performance of using an adaptive sampling scheme to identify the subspace is better than using a naive random sampling scheme in the high coherence case.However, the high computational cost of the adaptive sampling scheme makes it difficult to meet the speed requirements of subsequent network tasks.

Solution overview
Since both real traces are high column coherence, the row coherence is relatively low.In this paper, we use row subspace-based matrix completion.Using a column or row subspacebased method depends on the data feature.Specifically, the adaptive sampling scheme first takes random samples and then obtains the subspace by adaptive sampling.To obtain higher recovery accuracy, we prefer a random sampling process on the low coherence side and an adaptive sampling process on the side with higher coherence.When the column coherence is higher than the row coherence, we should use the row subspace-based method.
The solution overview of our sampling schedule and recovery scheme is shown in Figure 7.Our scheme consists of two stages.Figure 7(a) shows the training stage.The main goal of this stage is to find rows, i.e.OD pairs, that compose the row subspace of the target network.These rows are named subspace rows in this paper.We use adaptive sampling in historical data to obtain subspace rows and mark the index set of subspace rows.
Figure 7(b) shows the second stage of our scheme, named the measurement and recovery stage.Because of the spatio temporal feature of the real traces, the subspace rows do not change much.We can use the rows marked at the first stage to form the estimated row subspace.In this stage, we take full measurement samples for all subspace rows.Furthermore, we randomly select some redundant rows and add them to the subspace rows set to prevent subspace variation, which will reduce recovery accuracy.For the remaining rows, we only take some measurement samples.After collecting data, our completion algorithm recovers the remaining partially measured rows.

Training stage
The main goal of this stage is to identify the rows that compose the row subspace.We use adaptive sampling to identify subspace rows from the historical data.Adaptive sampling usually requires active data acquisition (Krishnamurthy & Singh, 2014), which means that if a row is identified as a subspace row, we need to acquire all the data for this row.Since we have yet to determine which row will be identified as a subspace row in advance, we need to access the entire historical data matrix.

Algorithm 1 Training the indexes of subspace rows
Input: The matrix of the historic data (i.e., M), number of subspace rows k, expected number of samples per row | |.Output: An index set of OD pairs I.
1: Denotes the subspace rows set as S, denotes the index set of subspace rows as I.
2: Randomly select a row and add it into S, add its index into I.
3: Get the current row-space V ← compute first |S| left singular vectors of S T by performing skinny SVD.4: For each rest row, randomly take | | samples.5: while 1 ≤ j ≤ k − 1 do 7: Insert the M t into S, mark the t th row as subspace row, add t into I.

8:
Get the current row-space V ← compute first |S| left singular vectors of S T by performing skinny SVD.9: end while 10: Return index set I.
The main process of the training stage is shown in Algorithm 1. Assume the number of subspace rows is k, denote the set of subspace rows as S, and denote the index set of subspace rows as I. Initially, we randomly select a row as a subspace row, add it into S, add its index into I, and get the current row subspace by performing skinny SVD on S T .For each remaining row, we randomly take | | samples.Then, we find other k−1 subspace rows through adaptive sampling, as shown in step 5.Among all other rows in historic data M, we select the row with the largest residual from the current row subspace V as the subspace row, where the residual is calculated as res = M t − P V M t 2 2 , here The rationale behind selecting the largest residual row as a subspace row is that: rows with large residuals between the current row subspace can hardly be represented by the current row subspace V.As shown in Figure 8, assume we take the initial samples as Figure 8(b) shows, select the third row as a subspace row.Then, we calculate the residual between the rest rows and the current subspace.The largest residual is 0.2015, so we take adaptive sampling on the fourth row, add it into S, and add its index 4 into I.In this example, I = [3, 4].Take full measurement samples, denote these rows as S. 6: end for 7: for each OD pair not indexed in I do 8:

Measurement and completion stage
Random measurement | | samples.9: end for 10: //Recover stage.11: Calculate the rank r of S. 12: Denote the row-subspace of M as V. V ← compute first r left singular vectors of S T by performing skinny SVD.13: for each OD pair not indexed in I do 15: end for 16: return M.
We now describe our measurement and completion algorithm.The pseudo-code is shown in Algorithm 2. After we get the index of trained subspace rows I, we know which OD pairs compose the row subspace.Since the data changes over time, the subspace rows change a little.In order to obtain the estimated subspace more accurately, we designed a redundant sampling rule to ensure the robustness of the recovery accuracy.As shown in step 2, we randomly add α rows index into I.After getting the index set I, we perform the sampling schedule process according to I. Take full measurement samples for the indexed subspace rows (OD pairs).For the rest of the rows, i.e. rows not indexed in I, we only need to take | | measurement samples.After gathering measurement data, we recovered unmeasured data as follows: • Denote subspace rows set as S. We calculate the rank r of S and get the estimated row subspace V by performing skinny SVD on S T .• For the rows that are not indexed in I, the unmeasured data can be recovered by Mt There are three parameters k, α, and | | in our algorithms.These parameters affect the number of samples and the recovery accuracy.Where k and α impact the number of subspace rows.If k and α are too small, the selected subspace can not represent the matrix well, which further compromises the recovery accuracy.If k and α are too large, many rows will be selected as subspace rows, leading to a high sampling cost.| | impacts the number of samples in recovered rows.If | | is too small, we do not have enough information to recover the data for each recovered row, and if | | is too large, it will lead to a high sampling cost too.In Section 7, we will do experiments to set proper parameters.

Complexity analysis
This subsection investigates our scheme's sample number and computational complexity.Assume that the size of the recovered matrix is M ∈ R N×T 1 .After collecting the measurement samples, we have a set of subspace rows S ∈ R (k+α)×T 1 .To recover the remaining partially measured rows of M, we take | | measurement samples for each row.Thus, the total sample number of our scheme is Next, we study the computation complexity of our scheme.To recover the matrix M, we first calculate the rank r of S, which has a time complexity of O(T 1 (k + α) 2 ).We then obtain the estimated row subspace V of M by performing skinny SVD on S T , which has a complexity of O(T 1 (k + α)r).Using ( V( VT V ) −1 VT M t ) T , we recover the remaining partially measured rows.Since V ∈ R T 1 ×r , VT ∈ R r×| | , and M t ∈ R | |×1 , and we typically set | | > r, the complexity of recovering a partially measured row is O(T 1 | |r).The total number of recovered rows is N − (k + α), so the total complexity of recovering the entire matrix is

Performance evaluations
We conduct extensive experiments on two real-world network traffic data sets to evaluate the performance of our RRS-MC scheme.Moreover, to verify the processing time advantage of RRS-MC, we also conduct experiments on a relatively large-scale synthetic data set.

Data description and prepossessing
Table 2 shows the data sets used in this study.Data normalization (Aksoy & Haralick, 2001) is often applied in data preprocessing to scale the features of data into the range [0,1].We

Synthetic
The Synthetic data set is generated as follows: (1) generate a low-rank matrix M with its size being 3000 × 20000; (2) randomly select 5 rows and modify their values; (3) inject a Gaussian noise matrix Z into M.Here we limit Z F / M F = 0.2, where • F denotes the Frobenius norm.

Generated
normalise the Abilene and GÈANT in the following manner: where the min{M i,j } and max{M i,j } are the minimum value and maximum value of the data set.This normalization process does not affect our recovery results.After we get the recovery results, we first perform the inverse of the normalization operation on the recovery results and then calculate the recovered error ratio.

Evaluation metrics and experimental setup
We use the proposed and compared schemes in the experiment to recover the network traffic data from partial measurement samples.To evaluate the performance of different schemes, we use three evaluation metrics as follows: Error ratio(inferred): We calculate the error ratio of recovered data by comparing it with the raw data, which is calculated as follows: where Mi,j and M i,j denote the recovered data and the raw data at the location (i, j) of M, respectively.¯ denotes the un-sampled element indices set.Error ratio(inferred) measures the relative error between the values recovered from the completion algorithm and the raw data.
The processing time of the scheme consists of two parts: the sampling scheduling time and the recovery time for using the completion algorithm.Thus we use the following metrics to evaluate the processing time of all schemes: Sampling time: to evaluate the speed of the different sampling algorithms, we use the metric Sampling time to measure the average number of seconds to perform the sampling schedule task.
Recovery time: to evaluate the speed of the different completion algorithms, we use the metric Recovery time to measure the average number of seconds to perform the data recovery task.
We implement our's and comparison schemes using Matlab.The experiments are run on a laptop equipped with CPU Intel(R) Core(TM) i7-8750H CPU@2.20GHz, and 24GB memory.

Parameter impacts
According to Algorithm 2, the parameter | | indicates how many elements need to be measured in each recovered row.The index set of subspace rows I is made by two parts, a training part that identifies k subspace rows by Algorithm 1, and a redundancy part that randomly selects α redundant subspace rows.These parameters affect the sample ratio and the error ratio.Either k or α increase will increase the number of subspace rows.Therefore, for visualization purposes, we fixed the redundancy parameter α = 0.03N, which means we take an approximate 3% redundant sample ratio for each data set.
We Figure 9 draws the sample ratio and error ratio by varying k and | |.From Figure 9, we can see that (1) For the parameter k, initially, the recovery performance becomes better with the increase of k.After k reaches a value, the recovery performance becomes stable.It is because when k reaches a value, we have already obtained enough subspace information to recover the matrix, further increasing k will only affect the recovery accuracy a little.(2) For the parameter | |, its impact on recovery performance is similar to parameter k, when | | reaches a value, the recovery performance becomes stable.For example, in the GÈANT data set, the recovery accuracy when we take | | = 0.1T is similar to we take | | = 0.15T, but the former has 4% less sample ratio than the latter.
As shown in Figure 9, we select the parameters with the lowest sample ratio for each data set when the error ratio is close to 0.2.Hence, for Abilene, we set k = 7, | | = 0.25T, for GÈANT, we set k = 7, | | = 0.1T, for Synthetic, we set k = 10, | | = 0.03T, respectively.We use the corresponding sample ratio and error ratio for the following experiments.

Performance comparison with baseline
RRS-MC includes a sampling and completion algorithm.In order to conduct a comprehensive comparison, the compared schemes should also include sampling and completion processes.We compare our sampling and completion algorithms with the following baselines: • SVT-E2E: performs E2E (Xie et al., 2015) sampling algorithm, then use SVT (Cai et al., 2010) based matrix completion algorithm to recover the matrix.• ALM-Local Coherence: performs local coherence (Y.Chen et al., 2014) sampling algorithm, then use ALM (Augmented Lagrangian Method) (Lin et al., 2010) based matrix completion algorithm to recover the matrix.• SRMF-Random: performs random sampling, then use SRMF (Roughan et al., 2011) as the completion algorithm to recover the matrix.
Figure 10 shows the recovery accuracy for different algorithms, where X-axis is the sample ratio, and Y-axis is the error ratio.To investigate the recovery performance of the compared schemes, we raise the sample ratio until the error ratio is similar to ours.As shown in Figure 10, among all implemented algorithms, RRS-MC can achieve the low error ratio (0.2) with the least sample ratio, only needing 35% ratio of sample for Abilene, 13% for GÈANT, and 6% for Synthetic.
We also draw Figure 11 to show the sample number needed to reach the same recovered error ratio (0.2).As a reference, we also show the total measurements in T time slots data denoted as "All" in Figure 11.
We draw Figures 12 and 13 to show the sampling time and recovery time to achieve the same recovery error ratio in T time slots data.Both E2E and Local-coherence sampling algorithms require multiple rounds of sample selection.In each round, the E2E algorithm needs to use the previous round results to select current round samples.The Local-coherence algorithm needs to use SVD to determine the sampling location.Both algorithms involve complex matrix calculations.For a fair comparison, we take the sampling time of E2E and Local-coherence as the last round iteration time.For RRS-MC, after we train the index of subspace rows I, we take full samples for the indexed rows and randomly take | | samples for    the remaining rows.Since I can be trained in advance, the sampling time of RRS-MC is close to the random sampling algorithm.Our sampling algorithm only takes 0.018 s in Abilene, 0.085 s in GÈANT, and 0.202 s in Synthetic, respectively.
After gathering measurement samples, we use different matrix completion algorithms to recover the un-sampled data.RRS-MC perform skinny SVD once to obtain the row subspace.For each recovered row, it only performs a simple matrix calculation to recover.Compared with other completion algorithms involving iteration, RRS-MC has obvious advantages in recovery speed.It only takes 0.021 s in Abilene, 0.1 s in GÈANT, and 4.01 s in Synthetic, respectively.

Performance comparison with column-based scheme
Besides comparing with the above dynamic sampling and recovery schemes, we also compare RRS-MC with the existing column-based scheme and a random scheme to show the negative impact of high column coherence and the effectiveness of RRS-MC.We draw the CDF (cumulative distribution function) of the error ratio in Figure 14.Each point in Figure 14 represents a result of the sampling and recovery process.In this experiment, the sample ratio of the three schemes is the same.For each scheme, we have run one hundred times.
As expected, compared with the randomly selected row subspace algorithm (termed random row space), RRS-MC has advantages in terms of accuracy and stability.To illustrate, for the GÈANT data set (Figure 14(b)), the recovery error of RRS-MC around 0.2, while the recovery error of the random row space is between 0.2 and 0.6 with a large span, which indicates the subspace identified by our algorithm is precise and can recover the data accurately.RRS-MC has an advantage in recovery accuracy compared with the existing column subspace-based algorithm.

Performance comparison with non-redundant scheme
In this subsection, we verify the effectiveness of redundant sampling rule by comparing RRS-MC with a non-redundant scheme (termed RS-MC).For RS-MC, we only use Algorithm 1 to identify the subspace.In this experiment, we set the parameters of both schemes as the results of Section 7.3.1,for each data set, we set the redundancy parameter α = 0.03N, which means the redundant sample ratio is approximate 3% for each data set.We use 5T time slot data following the training data to verify the recovery accuracy for both schemes.
Figure 15 shows the results, where X-axis is a different starting time slot following the training data, and Y-axis is the error ratio.From Figure 15, we can see that (1) The recovery error of RS-MC tends to increase with time in the real data sets, which indicates that the subspace does change with varying times.(2) The recovery error of RRS-MC for every T time slots are approximately 0.2, which indicates that the estimated subspace from the combination of training by historical data and redundant sampling can robustly recover data continuously.(3) Since the subspace on the synthetic data set does not change, the trained subspace rows can recover the matrix well.Redundant sampling does not affect the recovery accuracy.

Conclusion
In this work, we focus on low cost network traffic monitoring.By studying real network traffic traces, we find that column coherence is larger than row coherence in most cases.Based on this coherence feature, we propose our RRS-MC, which constructs the subspace by taking full measurements in subspace rows.The remaining rows only take partial measurements and recover unmeasured data through the subspace.Accurately monitoring network traffic at a low cost can benefit network providers.For example, service providers can use our scheme to monitor network traffic and optimise their network infrastructure to serve their customers better.Extensive experiments on two real data sets and a synthetic data set show that our scheme ensures the high recovery quality of network measurement data with low measurement costs.

Figure 1 .
Figure 1.Good low-rank structure of real traffic traces.

Figure 2 .
Figure 2. Leverage score of two real network traffic flow traces.
, the max value of the leverage scores is 35.292(Abilene) and 102.2354 (GÈANT), so the coherence value is 35.292(Abilene) and 102.2354 (GÈANT), respectively.We use a black dashed line to indicate the upper bound of the leverage scores for the two data sets, i.e.N/r.In addition to the coherence feature of the first M ∈ R N×T , where the first M collects the monitoring data from time slot 1 to time slot T, we also study the coherence feature of the subsequent data.With the data varying, the results are shown in Figure 3.In the figure, the X-axis indicates the starting time slot.The Y-axis indicates the coherence parameters of each M ∈ R N×T from varying starting time slots.Combining Figures 2 and 3, we find that: (1) both data sets are high coherence; (2) column coherence is very high, and row coherence is relatively lower than column coherence.

Figure 3 .
Figure 3. Coherence feature of two real network traffic flow traces.

Figure 4 .
Figure 4. Example of high coherent matrices.

Figure 6 .
Figure 6.Big recovery error of use the column subspace-based methods in high column-coherence case.

Figure 8 .
Figure 8.An example of select largest residual from the current row subspace.

Algorithm 2
Measurement and recovery the un-measured data via subspace-based matrix completion Input: An index set of OD pairs I, expected number of samples per recovered row | |.Output: Recovered M. 1: //Redundancy sampling rule.2: Random add α redundancy rows index into I. 3: //Sample scheduling stage.4: for each OD pair that indexed in I do 5: use T = N time slots data as the training data to get the index of subspace rows and use T time slots data following the training data to investigate how the | | and k impact the sample ratio and recovery accuracy.According to the data sets, the scale of training data is listed as follows.We set T = N = 144 in Abilene, T = N = 529 in GÈANT, and T = N = 3000 in Synthetic, respectively.

Figure 10 .
Figure 10.Recover error for all schemes.

Figure 11 .
Figure 11.Sample number to achieve the same recovery accuracy.

Figure 12 .
Figure 12.Sampling time to achieve same recovery accuracy.

Figure 13 .
Figure 13.Recovery time to achieve same recovery accuracy.

Table 1 .
List of notations.

Table 2 .
Data description.data every 15 minutes in several months in the GÈANT network, which consists of 23 routers, and thus has 529 OD pairs.