Main Article Content

Comparative Analysis of Clustering Algorithms on High Dimensionality Data


M. Abdulraheem
I.D. Oladipo
G.B. Balogun
M.O. Adeleke
D.S. Ricketts

Abstract

Data mining is an emerging research area employed by many evolving computing technologies since it reduces dataset complexity by  providing remarkable insight into the data. Additionally, it requires the ability to creatively envision the enormous and heterogeneous  datasets and to extract meaningful knowledge from the plethora of data through the practical application of appropriate algorithms. For  this reason, clustering algorithms are categorized as hierarchical, partitioning, and density-based and grid-based. The Partitioning  Clustering technique divides the data objects into several groups known as partitions, and each division represents a cluster. A hierarchy  or tree of clusters is created for the data objects using hierarchical clustering algorithms. The cluster is in areas with high densities by  density-based algorithms, which aggregate their data objects based on a particular neighbourhood. The grid structure used by a grid- based algorithm is created as the data object space is divided into a finite number of cells. Moreover, clustering is a technique that is frequently used in data mining to examine the data; thus the authors were motivated to compare it with other approaches. A data mining  analysis is useful for gaining an understanding of the distribution of data, observing the characteristics of clusters, and focusing  on certain clusters for further analysis. This work focuses on determining the algorithm with better performance on high-dimensionality  data between Expectation Maximization (EM) and Hierarchical Algorithms (HA) using cluster accuracy and evaluation time as parameters  for comparison. In this study, cluster analysis was performed using WEKA 3.8.5. The result shows that the EM method runtime and  accuracy perform better in clustering high-dimension data and performance improves as the number of clusters increases. However, in  the HA method, running time and accuracy barely improved with the difference in the dataset. Therefore, it is observed that the HA  method falls short in performance compared to the EM method. 


Journal Identifiers


eISSN: 2006-5523
print ISSN: 2006-5523