Main Article Content

Comparison of machine learning methods for the prediction of type 2 diabetes in primary care setting using EHR data

George Ochieng
Kenneth Rucha


Diabetes remains a major global public health challenge, thus the need for better methods for managing diabetes. Machine learning could provide reliable solutions to the need for early detection and management of diabetes. This study conducted experiments to compare a number of selected machine learning approaches to determine their suitability for early detection of diabetes in the primary care setting. A retrospective study was conducted using EHR dataset of confirmed cases of diabetes collected during routine care at Nairobi Hospital. Institutional ethical approvals were obtained, and data were retrieved from the database through stratified sampling based on gender. Diagnoses were confirmed using the ICD-10 codes. Records with 5% or so of missing values were excluded from this analysis. Data were processed by correction of errors and replacement of missing values using measures of central tendency. The data were transformed through normalization using the decimal-scaling method. Data analysis was conducted using selected supervised and unsupervised learning algorithms. Model performances were validated using metrics for the evaluation of classification and clustering results, respectively. Random Forest had the highest accuracy (0.95) and error rate (0.05), while Gradient Boosting and Multilayer Perceptron (MLP) with 3 hidden layers obtained accuracy (0.94) and error rate (0.06), respectively. The process of selecting machine learning algorithms needs to explore both supervised and unsupervised learning techniques. In addition, an appropriate architectural design of an MLP could present astounding results for classification tasks in primary care settings.

Journal Identifiers

eISSN: 1561-7645