An Attempt to Analyse Baarda’s Iterative Data Snooping Procedure based on Monte Carlo Simulation

William Sealy Gosset, otherwise known as “Student”, Fisher's disciple, was one of the pioneers in the development of modern statistical method and its application to the design and analysis of experiments. Although there were no computers in his time, he discovered the form of the “t distribution” by a combination of mathematical and empirical work with random numbers. This is now known as an early application of the Monte Carlo simulation. Today with the fast computers and large data storage systems, the probabilities distribution can be estimated using computerized simulation. Here, we use Monte Carlo simulation to investigate the efficiency of the Baarda’s iterative data snooping procedure as test statistic for outlier identification in the Gauss-Markov model. We highlight that the iterative data snooping procedure can identify more observations than real number of outliers simulated. It has a deserved attention in this work. The available probability of over-identification allows enhancing the probability of type III error as well as probably the outlier identifiability. With this approach, considering the analysed network, in general, the significance level of 0.001 was the best scenario to not make mistake of excluding wrong observation. Thus, the data snooping procedure was more realistic when the overidentifications case is considered in the simulation. In the end, we concluded that for GNSS network that the iterative data snooping procedure based on Monte Carlo can locate an outlier in the order of magnitude 4.5σ with high success rate.


Introduction
The reliability of outlier identification is one of the major challenges in the quality control of geodetic measurements. In the sense of Least Squares Estimation (LSE), the outliers are nuisance observations that spoil both estimated parameters and their standard deviations, causing incorrect results. Thus, we often try to minimize the magnitude of undetectable outliers in the observations as well as to reduce the effect of the undetected ones on the estimated parameters. Two categories for the treatment of observations contaminated by outliers have been developed: robust adjustment procedures (for an overview see e.g. Wilcox, 2012;Klein et al. 2015a) and outlier detection based on statistical tests (e.g. Baarda, 1968;Pope, 1976;Lehmann and Lösler, 2016;Klein et al. 2017).
The first one is outside the scope of this paper. Besides the undoubted advantages of robust adjustment, the outlier tests are also used. The following advantages of outlier analysis are mentioned by Lehmann (2013): • Detected outliers provide the opportunity to investigate causes of gross measurement errors; • Detected outliers can be re-measured; One of the best methods for outlier identification in geodetic data analysis is Baarda's testing procedure. This method is due to Baarda (1968). This method consists of three steps (see e.g. Teunissen 2006): detection (also known as overall model test), identification (also known as data snooping) and adaption (a corrective action, such as elimination of identified observation as an outlier).
At each iteration, only a single observation can be identified in the data snooping procedure.
Once an identified observation is excluded, the LSE adjustment is restarted without the rejected observation and again the identification step (data snooping) is applied. Of course, if redundancy permits, this procedure is repeated until none identification. This procedure is called "iterative data snooping" (e.g. Teunissen, 2006). In this paper we are exclusively concerned with iterative data snooping procedure.
Since data snooping is based on a statistical hypothesis testing with an alternative hypothesis for each observation, it may lead to a false decision as follows: • Type I error or false alert (probability level α) -Probability of identifying an outlier when there is none; • Type II error or missed detection (probability level β) -Probability of non-identifying an outlier when there is at least one; and • Type III error or wrong exclusion (probability level κ) -Probability of misidentification a non-outlying observation as an outlier, instead of the outlying one. This type of error decision was introduced by (Hawkins 1980;Förstner 1983).
The rate of type I decision error in a binary hypothesis test (i.e., with a single alternative hypothesis) can be selected by the user. The rate of type II decision error cannot. Lehmann and Voß-Böhme (2017) also point out that a test statistic with a low rate of type II is said to be powerful in the binary hypothesis case, when only a single alternative hypothesis is considered. However, in case of multiple alternative hypotheses (i.e., data snooping), without considering the Type III error, there is a high risk of over-estimating the successful identification probability (see e.g. Yang et al. 2013). On other hand, the confidence level is the probability that a non-outlying observation is correctly ignored; the power of the test is the probability that an outlier is correctly identified.
Therefore, the confidence level and the power of the test are the probabilities of the test result leading to correct decisions, as opposed to the occurrence of type I, II and III errors (see, for example, Förstner, 1983;Teunissen, 2006;Klein et al. 2015b).
Here, we extended the decision errors above; we highlight that "iterative data snooping" procedure can identify more observations than real number of outliers (we call here "overidentification"). The later has a deserved attention in this work. The aim of this paper is to compute the statistical quantities (Power of test, type II error, Type III error and "over-identification") for iterative data snooping by means of the well-established Monte Carlo Simulation (MCS). This simulation technique in geodesy has been widely applied since the pioneering idea of Hekimoglu and Koch (1999).
Many of the relevant probabilities in this contribution are multivariate integrals over complex regions (Teunissen, 2017). They therefore need to be computed by means of numerical simulation such as MCS. MCS methods are used whenever the functional relationships are analytically not simple tractable, as is the case for data snooping testing procedure (Lehmann, 2013). MCS method replaces random variables by computing excessive random experiments. In other words, the statistical quantities can be determined by frequency distributions of computer random experiments performed using random numbers. The MCS has already been applied in outlier detection (e.g. Lehmann and Scheffler, 2011;Lehmann, 2012;Klein et al. 2012;Klein et al. 2015aKlein et al. , 2015bErdogan, 2014;Niemeier and Tengen, 2017). Following this line of thought, here our goal was to apply the MCS to analyse the efficiency of the iterative data snooping procedure for the correct identification (or not) of a single simulated outlier at time.
The rest of the paper is organised as follows: Section 'Theoretical overview' brings a theoretical background about data snooping statistical tests for outlier identification in the LSE. Section 'Monte Carlo approach for data snooping procedure' presents the method for determining the statistical quantities numerically. Section 'Experiments and results' contains the experiments setup, results and discussions. Finally, last section offers conclusions and recommendations for future studies.

Theoretical overview
The mathematical model generally adopted in geodetic data analysis is the linear(ised) Gauss-Markov model, given by (Koch, 1999): Where e is the n x1 random error vector, A is the design(or Jacobian) matrix, x is the u x1 unknown parameters vector and y is the n x1 observations vector. The most employed solution for a redundant system of equations ( r a n k ( A ) n > ) is the weighted least squares estimator (WLSE) for the vector of unknowns (x ): W is the n x n weight matrix of the observations, taken as  (Teunissen, 2003). However, the WLSE is no longer optimal in the presence of systematic and/or gross errors (blunders) in the observations. In other words, despite optimal properties for WLSE, they lack robustness or insensitivity to outliers in observations (Huber, 1964;Rousseeuw and Leroy, 1987;Lehmann, 2013). Therefore, statistical testing procedures for detection and identification of outliers have been developed.
Quality control to identify outliers in geodetic measurements has been widely investigated since the pioneering work of Baarda (1968). In the sense of LSE, statistical testing procedures for detection and identification of outliers are based on maximum likelihood ratio. Consider a null hypothesis H0 for the parameters of the population probability distribution of an observation vector y. Consider further an alternative hypothesis HA for these parameters, constructed in a way that H0 is a subset of HA. Thus, in the general case, the maximum likelihood ratio between H0 and HA is given by (Larson, 1974): Where 0 max ( | ) p y H is the maximum of the probability density function (pdf) of y under H0 and max ( | ) A p y H is the maximum of the pdf of y under HA. As the null hypothesis is defined so that its sample space is contained in the sample space of the alternative hypothesis, the ratio in Equation 3 lies in the interval of 0 ( ) 1 y λ ≤ ≤ (Teunissen, 2006). The test criterion for the maximum likelihood ratio is given by (Larson, 1974): Where c > 0 is the critical value for the test according to the significance level α stipulated (for more details, see Larson, 1974;Teunissen, 2006).
Assuming normally distributed observation errors, a general case of hypothesis testing in linear models is formulated as Teunissen (2006): Where H0 is the null hypothesis (namely, absence of outliers in the observations) and HA is an alternative hypothesis (presence of "q" outlying observations in at certain known locations). The quantity Cy defines the non-random error model (in this context called outlier model) with dimension n x q and ∇ is the corresponding vector of q outliers. The dimension of ∇should be comprised between1 q n u ≤ ≤ − . For example, if n = 5 and q = 2, then a possible outlier model is . (one outlier in each observation 1 y and 3 y ). For more details about error models, see e.g. Lehmann and Lösler (2016).
Considering the maximum of the pdf of y under H0 and HA, the maximum likelihood ratio λ between the two hypotheses becomes (Teunissen, 2006): Where 0 e and 0 e ∑ is the estimated random error vector and a posteriori covariance matrix of the estimated random error computed by LSE into H0, respectively. Κα is the critical value for the test according to the significance level α. For more details see Koch (1999) and Teunissen (2003). Under the null hypothesis, the test statistic Tq follows the central chi-squared distribution with q degrees of freedom; under the alternative hypothesis, the test statistic Tq follows the non-central chi-squared distribution with q degrees of freedom and non-centrality parameter Data snooping procedure is a particular case of maximum likelihood ratio test when only one outlier (i.e. q = 1) is present in the data set at a time (see e.g. Baarda, 1968;Pope, 1976;Berber and Hekimoglu, 2003;Lehmann, 2012). Thus, it is formulated by the following test hypotheses (called individual model test or w-test) (Baarda, 1968;Teunissen, 2006): Where cy is outlier model for q=1, i.e. the n x 1 unit vector with 1 in its i th entry and zeros in the remaining (e.g.
[ ] , and ∇ is a scalar value with the gross error (outlier) at i th observation being tested. Therefore, in the null hypothesis (H0), it is assumed that there are no outliers in the observations, while in the alternative hypothesis (HA), is it assumed that the i th observation being tested ( i 1, , for n =  ) is contaminated by gross error of magnitude ∇.
If we consider one outlying observation in at certain known locations (q = 1), then the maximum likelihood ratio test for data snooping (Tq = 1) is given by (Teunissen, 2006 It should be noted that the test statistic Tq in Equation [8] is a particular case of generalized test statistic, when q=1. Important to mention also that the critical value follows from a chi-squared distribution with one degree freedom at a significance level of in a one-tailed test. Baarda (1968) and Teunissen (2006) demonstrate that if q = 1, then the test statistics (Equation 9) can also be formulated based on a standard normal distribution in a two-tailed test (so-called w-test). Both the chi-squared and normal distribution tests are equivalent. Usually in geodesy, the value of is set between 0.1% and 1% (Kavouras, 1982;Aydin and Demirel, 2004;Lehmann, 2013). Furthermore, data snooping contains multiple alternative hypotheses, as each observation is individually tested. Therefore, the only observation considered contaminated by outlier is the one whose test statistic satisfies the inequalities Tq=1 > Κα. In the case that two or more observations exceed the critical value Κα only the observation with the largest Tq=1 is flagged as an outlier. After having identified the observation most suspected of being an outlier (at given ), it is excluded usually from the model, and WLSE and data snooping are applied iteratively until there are no further outliers identified in the observations (iterative data snooping procedure) (Teunissen, 2006;Berber and Hekimoglu, 2003).
However, three types of incorrect decisions may occur into data snooping and its occurrence rates are associated with probability levels: the significance level α is the probability (when Ho is true) of a non-outlying observation be misidentified as an outlier (type I error or false positive); β is the probability that an outlying observation not be identified as outlier (type II error or false negative); finally, a non-outlying observation is misidentified as an outlier, instead of the outlying one (type III error given by κ). On other hand, the confidence level (CL) is the probability that a non-outlying observation is correctly ignored, therefore, CL = 1α; the power of the test (γ) is the probability that an outlier is correctly identified, i.e. γ = 1 -(β + κ). Therefore, the CL and the γ are the probabilities of the test result leading to correct decisions, as opposed to the occurrence of type I, I and III errors (see, for example, Förstner, 1983;Teunissen, 2006;Klein et al. 2015b). δ values as a function of 0 α and 0 γ (for a given degree of freedom). Alternatively, Aydin and Demirel (2004) presented a procedure to obtain the same through approximations of the non-central chi-squared distribution. The necessity of obtaining the non-centrality parameter is widespread in Geodesy (Baarda, 1968;Kavouras, 1982;Teunissen, 2006;Knight et al. 2010).
In addition to these probabilities, the iterative data snooping procedure can identify more observations than real number of outliers (here we call "over-identification"). The "overidentification" may contain one or more observations correctly identified or, in the worst case scenario, all erroneously identified.

Monte Carlo approach to analyse the iterative data snooping procedure
The MCS is applied to compute the probabilities levels. To do so, a sequence of m random errors vector , 1, , = K e k m  of a desired statistical distribution is generated. The "m" is known as the number of Monte Carlo experiments. Usually, assume that the random errors of the good measurements are normally distributed with expectation zero. Thus, we generate the random errors using the well-known Box-Muller method (Box and Muller, 1958) based on multivariate normal distribution, since the assumed stochastic model for random errors is based on matrix covariance of the observations, i.e.
On the other hand, an outlier (q=1) is selected based on magnitude intervals of the outliers for each m Monte Carlo experiments. We use the standard uniform distribution to select the outlier magnitude. The uniform distribution is a rectangular distribution with constant probability and implies the fact that each range of values that has the same length on the distributions support has equal probability of occurrence (see e.g. Lehmann and Shuffler, 2011). For example, for 10,000 Monte Carlo experiments, if the one choices a magnitude interval of the outliers of |3σ to 9σ|, the probability of a 3σ error occurring is virtually the same as -3σ, and so on. At each iteration of the simulation, a specific observation is chosen to receive a gross error based on the discrete uniform distribution (i.e., all observations have the same probability of being selected).
Random and gross errors are assumed to be independent (by definition) and both are combined to the total error as follow (see e.g. Kavouras, 1982): , 0 y e c ε = + ∇ ∇ ≠

[10]
Where ε is the n x 1 total error vector, e is n x 1 random errors vector and cy is outlier model for q=1, and ∇ is a scalar value with the outlier at i th observation being tested. Here, we assume that ∇>e. In order to avoid the compensation and potentiation problems, i.e. and e e ∇ − ∇ + , respectively, the observation selected to contain a gross error has its random error removed in the Equation 10. Before computing statistical test Tq=1 it is necessary to relate the random error vector e and total error vector ε, since this statistical test depends on the estimated random error vector 0 e .
In the sense of LSE, this relationship is given by: In the Equation 11 the reader should note that the multiplication of the redundancy matrix (R) and the total error ε provides a total error vector e ∇ . The total error vector is not only composed by random errors, but also it has one of its elements contaminated by an outlier. The redundancy matrix R in Equation 11 is based on the network geometry and covariance matrix. The redundancy matrix is given by: Where R is the n x n redundancy matrix and I is the n x n identity matrix. The diagonal elements of R are the local redundancy numbers (r). The local redundancy numbers indicate the fraction of a possible outlier on the observation, which is reflected in the respective residue of this observation.
Reliability measures such as local redundancy numbers are intrinsically related to the network geometry/configuration and observation weights. Such measure is widely used for quality analysis of geodetic networks, both in the design stage as well as for quality control (Kuang, 1991;Klein et al. 2012). It is desirable to have approximately a constant value for all redundancy numbers so that the ability of detecting outliers will be the same in every part of the network. In other words, one seeks to minimize the magnitude of undetectable outliers in the observations by increasing the redundancy numbers in order to have an optimal network configuration (see e.g. Amiri-Seemkooei, 2001a, 2001b. Furthermore, the redundancy numbers are correlated with the robustness parameters proposed by Vaníček et al. (1990) and Vaníček et al. (2001).
Note that the method described here for evaluating the iterative data snooping depends only on the matrix A and W, and the magnitude of the desired outlier.

Experiments and results
In this study, the previously described method was applied considering a GNSS (Global  It is important to mention that the design matrix defined initially must have a minimum configuration to avoid rank deficiency as well as being able to identify at least one outlier as mentioned by Xu (2005) that 'in order to identify outliers, one also has to further assume that for each model parameter, there must, at least, exist two good data that contain the information on such a parameter'. For example, consider the one unknown height into a leveling network (onedimensional -1D). Two observations would lead to different solutions and allow the detection of an inconsistency between them. Three observations would lead to different solutions and the identification of one outlying observation, and so on. Thus, in a general case, the value for 'q' equal to the minimal number of redundant observations across each and every point, minus one. For more details on the choosing the number of outliers to be considered, see Klein et al. (2017).
We highlight that the redundancy numbers (diagonal elements of Equation 12) were between 0.43 (minimum) and 0.78 (maximum) for the network, being classified as a good controllability; further considerations about GNSS networks are outside the scope of this study.
Here, the significance level for iterative data snooping is varied, taken as α = 0.001 (0.1%), α = 0.005 (0.5%), α = 0.01 (1%), α = 0.025 (2.5%) and α = 0.05 (5%). Each Monte Carlo simulation has a unique combination of significance level and magnitude of outliers. We ran 10,000 experiments for each simulation and compute the rate of type II error, type III error, the power of the test and the over-identification in iterative data snooping, totaling 12 x 4 x 10,000 = 480,000 Monte Carlo simulations. It is important to emphasize that the proposed method does not depend on the unknown parameters vector or the vector of observations as can seen in the section 3.
Random errors and outliers are synthetically generated and added to the observations. Each unknown station is involved in at least four baseline vectors; thus, the local-scale redundancy equals three. Positive and negative outliers are clipped between 3σ and 3.5σ, 3.5σ and 4σ; 4σ and 4.5σ; 4.5σ and 5σ; 5 and 5.5σ; 5.5σ and 6σ; 6σ and 6.5σ; 6.5σ and 7σ; 7σ and 7.5σ; 7.5σ and 8σ; 8σ and 8.5σ; 8.5σ and 9σ in each experiment. of experiments in which the procedure identified a single observation but wrong localization (type III errorκ). In addition to these classes, we consider "over-identification" case (more identified observation than one) and divided it into two categories: number of experiments where the procedure identified the outlying observation and others ("over-identification +") -see Figure 6; number of experiments where the procedure identified only non-outlier observations ("overidentification -") -see Figure 7. The "over-identification -" was also added to type III error case (k) (see Figure 8). Tables 1-5 show the success (γ), misidentifications (β and κ) and over-identifications rates for the various significance levels and intervals of the outlier size considered in this work. Figure 8. "Over-identification-" added to type III error for GNSS network vs. magnitude intervals of the outliers for each probability level α In general, the greater the magnitude of outliers, the greater is the efficiency of iterative data snooping procedure. It is important to note that from 4.5-5σ the type II error is practically absent.
The results show also that the probability of committing different types of error depends more on the critical value than outlier magnitude for one-dimensional identification and the GNSS network analysed.
Iterative data snooping procedure was more efficient for outliers larger than 4.5σ. Considering all results for the GNSS network, the mean success rate was 90.6% for α=0.001(0.1%) and 82% for α=0.005(0.5%). Thus, we consider α=0.001 and α=0.005 satisfactory significance level for the GNSS network analysed. Therefore, these results show the importance of a correct choice of α, as pointed out by Lehmann (2012); it also highlights the challenges in controlling the error rate in multiple hypotheses tests and iterative tests.
Regarding the two classes of over-identification rates, in general, the influence of committing the "over-identification+" and "over-identification-" is directly related to probability level α, i.e. the greater type I error, the greater is the over-identifications case. From the 5.5-6σ the "overidentification-" is practically null. Furthermore, if we disregard the "over-identification-" case, one could erroneously consider the significance level 0.05 the best scenario for type III error (see Figure   5). However, considering the type III error plus "over-identification-" the lowest probability of making a wrong exclusion is actually for a significance level of 0.001; as can be seen in the Figure  8. Therefore, the "over-identification-" rates should be considered for a more accurate and thorough analysis of the type III error as well as to avoid false interpretation of the results.
To conclude, it is always important to emphasize that the level of significance that is used to determine the theoretical critical value of the test is not the probability of the type I error of the iterative data snooping procedure. It is the only type I error of the local test (w-test) for a single alternative hypothesis. The type I error of iterative data snooping procedure can also be estimated based on Monte Carlo simulations. This topic will be covered in future work.

Conclusions
Monte Carlo methods are tools for solving problems using random numbers. Although this might sound somewhat specific and not very promising, Monte Carlo methods are fundamental tools in many areas of modern science. Here, the goal was to analyse the iterative data snooping testing procedure to locate an outlier by means of the Monte Carlo Simulations (MCS).
The MCS discards the use of the observation vector of Gauss-Markov model. In fact, the proposed method here depends only on design matrix A; the uncertainties of the observations y Σ ; and the magnitude intervals of simulated outlier. The random errors (or residues) are generated artificially from the normal statistical distribution, while the size of outliers is selected using standard uniform distribution.
Iterative data snooping shows high success rates in the experiments of a GNSS network for a single outlier randomly generated between four and five standard deviations. The efficiency of the data snooping also depends on the significance level α. Here, the optimal value for the significance level was 0.001 (0.1%) for the GNSS network analysed. This value depends more or less on the functional and stochastic model.
Furthermore, we note that the outlier identifiability in the iterative data snooping procedure is much more complex than that proposed by Prószyńsk (2015), because in our case a removal operation of observations is performed and we are not restricted only in the first adjustment run.
Here the reliable identification of outlier not only depends on type III errors, but also the "over-identifications case". The available probability of "over-identification-" allows enhancing the probability of type III error and avoids any misinterpretation. In the case of "over-identification+" requires further clarification in terms of theoretical basis and experiments (e.g. issues related to Minimal Detectable Bias and Minimal Identifiable Bias). The latter case will be investigated more closely in the next study.
Finally, we show that MCS is a feasible method to compute the probabilities level associated to a statistical testing procedure regardless of the statistical tables. Future studies should consider the case of multiple outliers which is a much more complicated problem and will be a topic of the next research.

Acknowledgments
The authors thank CNPq for financial support provided to the second author (proc. n. 305599/2015-1) and IBGE for the GNSS data. We also appreciate the support of all members of the research group: "Quality Control in Geodesy" (http://dgp.cnpq.br/dgp/espelhogrupo/3674873915161650). Last but not least we are also thankful to the two anonymous reviewers who provided contribution to the revision of the paper.