Main Article Content

Theoretical Utility of Data Value Metric and Genetic Algorithms for Variable Clustering in an Unsupervised Learning Environment


Okpako A. Ejaita
Ojie D. Voke

Abstract

Cluster analysis is regarded as one of the most important unsupervised learning tasks, with its natural application in dividing data into  meaningful groups, also known as clusters, based on the information in the data by describing the objects in terms of their relationships  and capturing the data's natural structure. Many traditional performance evaluation metrics for clustering algorithms abound in the  literature, treating various attributes or variables equally when measuring similarity; however, different attributes or variables may  contribute differently due to the amount of information they contain, which can vary greatly. Data Value Metric (DVM) is an information theoretic measure based on the concept of mutual information that has been shown to be a good metric for validating data quality and  utility in a big data ecosystem and in traditional data. Because it uses a forward selection search strategy, Data Value Metric (DVM) suffers  from local minima and loss of diversity in the population; however, hybridizing it with Genetic Algorithm will overcome the  problem of local minima because there will be a blend of evolutionary search to ensure a balance between exploration and exploitation  of the search space. This paper proposed a hybrid model of the Genetic Algorithm and the Data Value Metric (DVM) as an  information theoretic metric for quantifying the quality and utility of variable clustering selection that can be applied to traditional data. 


Journal Identifiers


eISSN: 2705-3121
print ISSN: 2705-313X