Assessing Image Classification Accuracy with Principal Component Analysis Algorithm

: The aim of this study is to assess image classification accuracy using the instrumentality of Principal Component Analysis (PCA). It is focused on evaluating the accruable benefits of Principal Component Analysis as part of an image preprocessing procedure for image classification. Land-use-land-cover (LULC) and accuracy assessment datasets were obtained with remote sensing and geographic information system’s software. The principal component analysis was statistically used to assess the level of correlation amongst bands in Landsat 8. The image classification was premised on the Maximum Likelihood classifier for land use land cover analysis. To ascertain the accuracy of the classified images, the Producer’s accuracy, User’s accuracy and Kappa coefficient derivatives of accuracy assessment was calculated. The results revealed that the first three PCs of the raw Landsat data accounted for 99.37 % variance of the original Landsat data, while the last three PCs represented only 0.63% of the original data. The results of land use land cover based on raw bands composite were Forest (41%), Shrubs (33%) and Built-up (26%) respectively. On the other hand, land use land cover based on Principal Component Analysis showed Forest (39%), Shrubs (39%) and Built- up (22%) respectively. Comparing the results of Kappa coefficients of both LULC of raw bands’ composite was 0.88 while that of PCA was 0.91. Conclusively, there is a significant level of difference in the classification outputs of PCA derived classification and that of raw Landsat bands’ composite.

The dimensionality of a data set can be mathematically reduced using the instrumentality of Principal Component Analysis (PCA) (Munyati 2004). The reduction in the dimensionality of a data set can further accentuate the visual characteristics of digital images. In sequel, it makes data processing and analysis more concise and manageable. There is fusion of PCA method with the digital image processing of satellite images with the capability of reducing a number of correlated image information bands to few uncorrelated bands (David 2017, Estornell et al. 2013. Researchers from different fields of human endeavor have successfully used PCA model in their studies (Gasmi et al. 2016;Lan et al. 2017;Zhao et al. 2017;Marchetti et al. 2020). For instance, Li et al. 2020 proposed the use PCA and high-dimensional model representation to estimate the probabilistic power flow; Geng et al. (2020) used the PCA based memory network to predict short-term wind speed; Schwartz et al. (2020) proposed a PCA method to conduct change detection in radar images; Gao, et al. (2019) worked on extreme learning machine and adaptive PCA methods for network intrusion. Remote sensing is the art or science of obtaining and analyzing information about phenomenon, area or object using a physical device without a physical contact (Jeeva and Naraana 2016). Object classification can be performed through the spectral analysis of the reflected or emitted radiant energy of the target (Meera et al. 2015). Oftentimes, remote sensing deals with multispectral images with highly correlated bands. In other to save data storage space and computing time, such bands could be combined into new, less correlated images by PCA. Multi-dimensional Principal Component works directly on the vector data of digital image where each band is taken as dimension of the matrix. The work is done on the principle of applying PCA and the methodology had been tested on several standard images (Dwivedi et al. 2006). Many researchers are exploring new scientific ways of improving on the performance of PCA in image analysis. To improve the performance of image compression, extended PCA based method to can be utilized to compress single image rather than a set of separated images. This method uses the correlations between three color components of an image (Mofarreh et al. 2015). PCA approach for identification and analysis of multi-layer images present comparatively better results than previously used techniques (Imran et al. 2005). The most common feature-extraction method is PCA, which transforms the data into a new set of principle components (PCs) that describes the underlying structure of the original dataset (Zhang and Mishra 2012). Multi-collinearity is simply a high degree of correlation among predictive variables in multiple regressions (Klainbaum et al. 1998). One of the ways of solving the problem of multi-collinearity is the application of Principal Component Analysis (PCA). Principal Component Analysis is a traditional multivariate statistical method commonly used to reduce the number of predictive variables and solve the multi-collinearity problem (Bair et al. 2006). Accuracy assessment of Land Cover maps, produced from remotely sensed data, involves comparing thematic maps with reference data (Congalton 1991). Since there were no suitable existing reference data that could be used for all locations on the earth's surface, a practical and statistically sound sampling plan was designed by Zhu et al. (2000) to characterize the accuracy of common and rare classes for the map product using National Aerial Photography Program (NAPP) photographs as the reference data. The sampling design was developed based on the following criteria: (1) ensure the objectivity of sample selection and validity of statistical inferences drawn from the sample data, (2) distribute sample sites spatially across the region to ensure adequate coverage of the entire region, (3) reduce the variance for estimated accuracy parameters, (4) provide a low-cost approach in terms of budget and time, and (5) be easy to implement and analyze (Zhu et al. 2000). The need for assessing the accuracy of a map generated from any remotely sensed data has become universally recognized as an integral project component. In the last few years, most projects have required that a certain level of accuracy be achieved for the project and map to be deemed a success (Ross and John, 2004). Therefore, the objective of this paper is to The aim of this study is to assess image classification accuracy using the instrumentality of Principal Component Analysis in Odeda LGA of Ogun State, Southwest Nigeria

MATERIALS AND METHODS
The study area, Odeda Local Government is one of the twenty (20) Local Governments Areas in Ogun State, south west Nigeria. It is located between 7°13′ and 7°30′ N of latitude, 3°11′ and 3°46′ E of longitude ( Fig.1) and covers a total land area of about 1,560 km 2 . It has a population of 109,449 according to the 2006 population census (NPC, 2006). The study area is predominantly rural with about 25-30 semi-urban areas and 860 villages and hamlets (Adedeji et al. 2020). The Landsat imagery used for this study was downloaded from the official website of Global Land Cover Facility (GLCF) -(http://www:glcf.umiacs.umd.edu). Satellite imagery of March 16 th , 2021 from Path 191 and Row 055 was used. Pre-processing of image helps to enhance and improve the quality of the image (Mussie 2011). Radiometric and geometric corrections were performed on the image to enhance output quality. Image preprocessing was carried out using nearest neighbor interpolation algorithm. When compared to other interpolation algorithms such as Linear interpolation, Bilinear interpolation and Bi-cubic interpolation techniques, Nearest Neighbor interpolation is quite simple and faster to calculate. Nearest neighbor interpolation method assigns each interpolated output pixel value of the nearest sample point in the input image. The interpolation kernel for the nearest neighbor is represented in equations 1 and 2 (Venkata 2019).
Assessing Image Classification Accuracy with Principal Component…..
Where is the pixel value?
The vector map of the study area was used to clip out Landsat 8 (OLI) image. Bands 2, 3, 4, 5, 6 and 7 which were stacked for further processing in ArcGIS 10.4. Table 1 shows the band statistics of the various bands that were used for this study. The clipped Landsat bands 2, 3, 4, 5, 6, and 7 were subjected to Principal Component Analysis as contained in the Spatial Analyst Tool, of ArcGIS software. The derived PC1, PC2 and PC3 were composited as a dataset for image classification. Another Landsat raw bands dataset were also composited for another classification, which was used for comparative analysis. )⟧ Where; represents number of pixels and b stands for the number of bands.
The matrix (equation 3) can be simplified, considering each group as a vector.
Where; k is the band number.
The covariance matrix's Eigenvalues must be calculated to decrease the dimensionality of the original bands and covariance matrix can be calculated as shown in equation 4 Where; , is the covariance of the different bands of each pair; DN p,i is a digital number of a pixel p in the band i, DN p,j is a digital number of a pixel P in the band j, μ i and μ j are the averages of the DN for the bands i and j, respectively.
Where, C is the covariance matrix of the bands and I is the diagonal identity matrix.
Where; y is the principal component vector, is the transformation matrix and is the original data vector In this study, the Maximum Likelihood (ML) supervised classification method was used. It is derived from the Bayes theorem, which states that a posteriori distribution P(i|ω), i.e., the probability that a pixel with feature vector ω belongs to class i, (Asmala and Shaun 2012) as shown in equation 7.
Where; P(ω|i) is the likelihood function, P(i) is the a priori information, i.e., the probability that class i occurs in the study area and P(ω) is the probability that ω is observed, which can be written as: Assessing Image Classification Accuracy with Principal Component….. In ML classification, each class is enclosed in a region in multispectral space where its discriminant function is larger than that of all other classes. These class regions are separated by decision boundaries, where, the decision boundary between class i and j occurs when: Accuracy assessment is a general term for comparing the classification of geographical data that are assumed to be true (reference), in order to determine the accuracy of the classification process. Error matrix has become a standard in the accuracy assessment of remote-sensing classification results (Nagamani 2015). The most common means of reporting the reliability of a land cover map derived from satellite data is the error or confusion matrix (Table 2), also called a contingency table (Congalton 1991). The error or confusion matrix represents a tabulated error made in a classification. The columns stand for categories on the ground while the rows represent the categories assigned in the mapping project. The overall accuracy represents the sum of the diagonal elements divided by the total number of pixels in the table. The producer's accuracy is calculated by dividing the number of pixels accurately classified in a given category by the total number of pixels of that category that were sampled on the ground. When the number of pixels in a category that were correctly classified is divided by the total number of pixels that were assigned to that category in the classification, the user's accuracy is obtained. The Kappa coefficient and the error matrix are considered as common techniques in measuring the accuracy of thematic maps generated by the classification process. The Kappa coefficient can be calculated using equation 10 = Σ Σ ( + × + ) 2 −Σ ( + × + ) Where; KAPPA = Kappa index, k = number of matrix files, Xii = observation number on row i and column I (along the diagonal), (X i+ and X +i ) = total marginal for row i and column i, respectively, N = total number of observations.

RESULTS AND DISCUSSION
The result of the correlation matrix of the raw bands (2 to 7) of Landsat 8 (OLI/TIR) showed that the highest correlations are between bands 3 and 4, and bands 2 and 3 with correlations of 0.97 and 0.98 respectively ( Table 2). Some of the bands have negative correlations such as bands 2 and 5 (-0.25), bands 3 and 5 (-0.13), bands 4 and 5 (0.17) and bands 5 and 7 (-0.04). These indicate very low inter-band correlation of spectral data. When there is high inter bands correlation (Table 2), it is a pointer to the fact that, the bands contain almost the same spectral information. Therefore, the use of bands with high inter-band correlation in data processing often leads to multi-collinearity problem in data analysis. For instance, bands 2 and 3 are highly correlated, it is therefore, pertinent to pick only one of the bands for the prerequisite image analysis. The same rule applies to bands 3 and 4 with correlation of 0.98. The interbands correlations with coefficient of determination (R 2 ) in Figure 2 show a more statistical representations of inter-bands relationships. Interpreting the factorloading pattern (Table 3), of the relationships amongst bands vis-à-vis a given PC, a band is said to load heavily on a given PC if the factor loading is greater or equal to 0.50 (Alphonsus and Raji 2019). Bands 6 and 7 loaded heavily on PC1 (0.65982 and 0.68504), while PC2 is heavily loaded with Band 5 (0.95140).

Principal component analysis in classification:
The six Landsat raw bands were subjected to Principal Component Analysis (PCA) which resulted in a percentage of variance of the six PCA results as shown in Table 4. The results showed that PC_1 (81.34%), PC_2 (15.69) and PC_3 (2.3376) are the highest in terms of percentage of variance. It suffices therefore, to state that the first three PC's of the original Landsat (OLI) data described 99.37 % of the original Landsat dataset, while the last three PC's accounted for only 0.63% of the original dataset. In this study, raw bands of Landsat 8 (OLI/TIR) satellite image (bands 2 to 7) and PC_1, PC_2 and PC_3 were composited and used for land use land cover (LULC) classification (Fig. 3). Visualizing the two composite outputs (Fig. 3), there appears to be dissimilarities between the two composite images. Supervised classification was carried out using the maximum likelihood algorithm classifier. Four thematic classes (water body, forest, shrubs and built-up) were classified in the study area ( Fig. 4 and 5). Though, the main objective of this study is not premised on the LULC classes, it suffices to look at the statistical derivatives from the two LULC classifications.  Classification results (Table 5) showed disparities between LULC classification with bands 2 to 7 composite (LULC_Bands) and LULC classification based on PCs 1 to 3 (LULC_PCA). For instance, forest class in LULC_Bands occupied 41% of the total study area while the forest class in LULC_PCA occupied 39% with a difference of 2%. Shrubs class in LULC_Bands was 33% while LULC_PCA remained 39%. Built-up was 26% (LULC_Bands) and 22% (LULC_PCA) respectively.   Tables 7 and 8. These tables show correlation matrices of reference data (Google Earth Image) of the study area and the Landsat satellite image of the same study area. The cells in the matrix tables indicate the amount of correlation between reference image and classified image with specific reference to the LULC classes. A total of 508 random points were chosen for the accuracy assessment of this study. The error matrices for the two classified images were calculated and their performances were statistically compared as shown in Tables 6 and 7.   Total  Water body  10  1  2  0  13  Forest  0  248  3  6  257  Shrubs  0  6  77  5  88  Built-up  0  5  1  144  508  Column Total 10  260  83 155 508 Though, the observable accuracy differential could be traceable to so many underpinning factors, it suffices to infer, that the PCA operations carried out on the dataset may have contributed to the recorded higher Kappa Coefficient results (Table 8) Conclusion: This research showed that the PCA approach was a useful image pre-processing technique to diminish the dimensionality of data for the study area. The first three PCs derived from the raw bands contain most of the information of the original data. Using the first three PC results in better classifications than the original dataset. The accuracy assessments of the two LULC types added credence to the importance of image pre-processing using PCA.