COMPARISON OF OUTLIER DETECTION TECHNIQUES IN NON-STATIONARY TIME SERIES DATA

This study examined the performance of six outlier detection techniques using a non-stationary time series dataset. Two key issues were of interest. Scenario one was the method that could correctly detect the number of outliers introduced into the dataset whiles scenario two was to find the technique that would over detect the number of outliers introduced into the dataset, when a dataset contains only extreme maxima values, extreme minima values or both. Air passenger dataset was used with different outliers or extreme values ranging from 1 to 10 and 40. The six outlier detection techniques used in this study were Mahalanobis distance, depth-based, robust kernel-based outlier factor (RKOF), generalized dispersion, K th nearest neighbors distance (KNND), and principal component (PC) methods. When detecting extreme maxima, the Mahalanobis and the principal component methods performed better in correctly detecting outliers in the dataset. Also, the Mahalanobis method could identify more outliers than the others, making it the "best" method for the extreme minima category. The k th nearest neighbor distance method was the "best" method for not over-detecting the number of outliers for extreme minima. However, the Mahalanobis distance and the principal component methods were the "best" performed methods for not over-detecting the number of outliers for the extreme maxima category. Therefore, the Mahalanobis outlier detection technique is recommended for detecting outlier in nonstationary time series data.


INTRODUCTION
There are two notable definitions of an outlier in literature. According to Barnett and Lewis (1994), an outlier is an observation that appears to deviate evidently from observations of the sample in which it occurs. Similarly, Johnson (1992) defines an outlier as an observation in a dataset that appears inconsistent with the rest of the observations in that dataset. The sources of outliers are mainly due to human error, instrument error, natural deviations in populations, fraudulent behavior, changes in systems' behavior, and/or faults in systems (Hodge and Austin, 2004).
Outlier detection refers to the task of identifying patterns in data that do not conform to expected behaviors (Ané et al., 2008;Angiulli and Pizzuti, 2002). Because an outlier can reveal unexpected but useful patterns in a dataset, it plays a crucial role in decision making, clustering, and pattern classification. Outlier detection is widely applied in public health anomaly, credit card fraud, intrusion detection studies, and has become of great interest to the data mining area (Barnett and Lewis, 1994;Fox, 1972;Glendinning, 1998). In literature, there are several outlier detection algorithms. Some popular categories of outlier detection techniques include z-score or extreme value analysis, probabilistic and statistical modeling, linear regression models, proximity-based models, and information theory models. Graphically, the box plots and the scatter plots are also used to detect outliers in a given dataset. Several studies in literature compared some of these outlier detection methods. Notable but recent ones are as discussed in the following sequel: Hodge and Austin (2004) surveyed the outlier detection methods that are used in machine learning and statistics, whiles Chandola et al. (2009) also reviewed the outlier detection techniques with respect to different assumptions. According to Xiaodan et al. (2018), other literature on outlier detection mainly focused on applications, such as network data (Gogoi et al., 2011) and temporal data (Gupta et al., 2014), or particular learning techniques, such as subspace learning and ensemble learning. The critical question is, which method can better detect outliers in a given time series dataset? This study seeks to compare the performance of six outlier detection methods concerning their ability to correctly identify the exact number of outliers that are introduced in the dataset. The study is different from the literature reviewed in these ways: (1) several numbers (or sample size) of outliers are introduced to the dataset; and (2) two dimensions of outliers that are extreme minima and extreme maxima are considered in the dataset.

METHODS AND MATERIALS Data Source and Nature
The performance of six outlier detection methods was compared using the air passenger dataset, which spans from 01/1960 to 12/1971, consisting of 144 observations or data points, which exhibits both trend and seasonality patterns. The dataset was obtained from the Time Series Analysis (TSA) package in R software (Cryer and Chan, 2012). The analysis was performed following the below steps of an algorithm for outlier detection: Step 1: Check to see if the datasets contain any outlier using the classical box plot approach to create lower and upper fences. Hence any value below the extreme minima or above the extreme maxima fence is an outlier. The extreme minima and maxima values are given by ܳ ଵ − (1.5 × ‫)ܴܳܫ‬ and ܳ ଷ + (1.5 × ‫,)ܴܳܫ‬ respectively, where ܳ ଵ is the first quartile, ܳ ଷ is the third quartile and ‫ܴܳܫ‬ is the inter-quartile range of the dataset.
Step 2: Check the minimum value of the dataset and separately introduce, in each data, arbitrary extreme minima and maxima values (i.e., the number of outliers) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ܽ݊݀ 40, in each case of the data, the sample size will increase depending on outliers introduced.
Step 3: Compare the performance of the six outlier detection methods to find which method can correctly detect all the outliers that were introduced into the dataset. In context, an outlier detection method is considered to be the "best" performing method if it identifies all or maximum number of outliers that were introduced in the dataset.
Step 4: Introduce both extreme maxima and minima into a particular dataset, which in this study is termed as the mixture dataset with sample sizes (i.e., the number of outliers) 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, ܽ݊݀ 80. Thus, sample size 2 would contain one value of extreme minima and one value of extreme maxima.
Step 5: Compare the performance of the six outlier detection methods for this mixture dataset. This is to check the methods' performance when both extreme maxima and minima are present in the data.

OUTLIER DETECTION METHODS
The outlier detection techniques considered in this study are Mahalanobis distance, depth-based, robust kernelbased outlier factor (RKOF), generalized dispersion, K th nearest neighbors distance (KNND), and principal component (PC) methods. The Mahalanobis distance method is a well-known criterion which depends on estimated parameters of the multivariate distribution. Given n observations from a p-dimensional dataset ܺ, we define the sample covariance matrix by where ‫̅ݔ‬ denotes the sample mean vector. All observations with a large V n values are indicated as outliers. The depth-based method is defined as where ܵ(. ) denotes the simplex generated by args, and the sum and average are taken over all p-pelts ‫ݔ‬ሾ݅ ଵ , ሿ, … , ‫ݔ‬ൣ݅ , ൧ such that 1 ≤ ݅ ଵ < ⋯ < ݅ ≤ ݊.
For the robust kernel-based outlier factor (RKOF), the local kernel density estimate of p is defined by: where h is the smoothing parameter, ߛ is the sensitivity parameter, ‫)ݔ(ܭ‬ is the multivariate kernel function, ߣ = ሼ݂(0) ݃ ⁄ ሽ ିఈ is the local bandwidth factor, ‫)ݔ(݂‬ is a pilot density estimate that satisfies ‫)ݔ(݂‬ > 0 for all the objects, ߙ is the sensitivity parameter that satisfies 0 ≤ ߙ ≤ 1, and ݃ is the geometric mean of ‫.)ݔ(݂‬ The generalized dispersion method computes Leave-One-Out (LOO) dispersion matrix for each observation (without considering the current observation) and, based on the difference between determinant of LOO dispersion matrix and determinant of actual dispersion matrix, labels an observation as an outlier. The principal component outlier statistic is defined, and the extremity of observation concerning a particular group is evaluated with this statistic: to assess a new observer ‫.ݔ‬ The numerator in this expression (equation (4) The K th nearest neighbors distance (KNND) method uses the distance-based method in finding outliers in a dataset, thus using the k nearest neighborhood method. For a set of each point in the KNND, the local outlier factor (LOF) uses the local reachability density (LRD) and compares it with those of the neighbors of each participant of that KNND set. The LRD (a density estimate that reduces the variables) of an object p is defined as: The final local outlier factor score is given as: where ‫݀ݎ݈‬ () and ‫݀ݎ݈‬ () are the local reachability density of p and o respectively.

RESULTS AND DISCUSSION
In this study, the two issues of interest are the correct detection of the number of outliers introduced into a nonstationary time series dataset and over detection of the number of outliers introduced into the dataset. Therefore, an outlier detection method is considered the "best" performing method if it identifies all or the maximum number of outliers introduced in the dataset. In Table 1, the descriptive summary of the data set is presented. Using the classical box plot approach, it was evident that the air passenger dataset has no outlier. Therefore, artificial outliers would be introduced into the dataset.

Correct Detection of Number of Outliers
Artificial outliers of several sizes were introduced into the air passenger dataset; therefore, the technique that could identify them was considered the "best" outlier detection method. In all, ninety-five (95) outliers were introduced in the extreme minima and maxima categories and one-hundred and ninety (190) outliers for the mixture dataset (containing both minima and maxima). Therefore, the method with the maximum number of detections is considered the "best" method at each extreme category.
In detecting the appropriate method for extreme minima with varying outliers, the Mahalanobis method could identify 36 out of 95 outliers in the air passenger dataset (see Table 2). However, the worst performed method was the principal component method that could not detect any outliers. For the extreme maxima category, the generalized dispersion method was the worst in detecting the outliers with only 3 out of 95 outliers, whiles the Mahalanobis distance and principal component methods were the "best" in correctly detecting the number of outliers.

Correct detection is shown in boldface
In the mixture dataset (containing both minima and maxima) category, it was evident in Table 3 that for the extreme minima, the generalized dispersion, Mahalanobis, and principal component methods could not detect any of the outliers introduced into the dataset. However, the depth-based method was "best" in detecting the number of outliers introduced into the dataset. For extreme maxima, the principal component and Mahalanobis methods were the "best" in detecting the number of outliers introduced into the dataset. The generalized dispersion could not detect any outlier in the dataset.

Over Detection of Number of Outliers
The performance of the six methods for over detecting outliers is assessed in the dataset. The "best" detection technique is the technique that records the minimum value.
From Table 4, it was evident that the k th nearest neighbor distance method was the "best" method for not over-detecting the number of outliers for the extreme minima. However, the principal component method was worst in performance since it recorded the highest number of outliers for over detection when having extreme minima. In introducing extreme maxima, the k th nearest neighbor distance method was the "worst" performing method since it had the highest number of over-detection of outliers. The Mahalanobis distance and the principal component methods were the "best" performing method with only three over-detections.  From Table 5, the principal component and Mahalanobis methods were the "best" methods since they could not over-detect any outlier. In contrast, the generalized dispersion method was the worst in performance since it recorded the highest number for over detection regarding the mixture scenario.

CONCLUSION
The performance of six different methods for detecting outliers was compared using the air passenger dataset. The air passenger dataset did not have any outlier; therefore, artificial outliers were introduced into the dataset. The performance was evaluated by the highest number of outliers that a detection method could correctly specify. For the extreme minima category, the "best" performed outlier detection technique was the Mahalanobis method, whiles the worst performed method was the principal component method. Again, for the extreme maxima category, the generalized dispersion method was the worst performed detection technique, whiles the Mahalanobis distance and principal component methods were the "best" in correctly detecting the number of outliers. Also, for the mixture dataset (containing both minima and maxima) category, the Mahalanobis and principal component methods were the "best" performed methods in correctly detecting outliers. Lastly, the k th nearest neighbor distance method was the "best" method for not over-detecting the number of outliers for extreme minima. However, the Mahalanobis distance and the principal component methods were the "best" performed methods for not over-detecting the number of outliers for the extreme maxima category. Therefore the Mahalanobis outlier detection technique is recommended for detecting outlier in a non-stationary time series dataset.