Clustering by Partitioning around Medoids using Distance-Based Similarity Measures on Interval-Scaled Variables

: It is reported in this paper, the results of a study of the partitioning around medoids (PAM) clustering algorithm applied to four datasets, both standardized and not, and of varying sizes and numbers of clusters. The angular distance proximity measure in addition to the two more traditional proximity measures, namely the Euclidean distance and Manhattan distance, was used to compute object-object similarity. The data used in the study comprise three widely available datasets, and one that was constructed from publicly available climate data. Results replicate some of the well known facts about the PAM algorithm, namely that the quality of the clusters generated tend to be much better for small datasets, that the silhouette value is a good, even if not perfect, guide for the optimal number of clusters to generate, and that human intervention is required to interpret generated clusters. Additionally, results also indicate that the angular distance measure, which traditionally has not been widely used in clustering, outperforms both the Euclidean and Manhattan distance metrics in certain situations.


I. INTRODUCTION
Cluster analysis (or clustering) is an unsupervised machine learning task used to find structure in unlabelled data.The clustering task groups a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other clusters (Aldenderfer and Blashfield, 1984;Han et al, 2006).Objects to be clustered are typically represented as a matrix, with each row of the matrix representing an object as a features vector, and each dimension of the vector representing a feature or variable used to describe the object (Aldenderfer and Blashfield, 1984;Han et al, 2006;Kruskal et al, 1978;Tversky, 1977).Interpretation of generated clusters often requires human intervention to explain patterns that are common to members of the clusters.
Application areas of clustering are wide-ranging, including the following: biology, in classification of plants and animals; fraud detection in insurance, by identifying groups of customers with unusually high claims; automatic classification of similar web documents; and marketing, by identifying classes of customers with similar buying habits.
Several clustering approaches have been developed to address different types of data.These include: partitioning approaches, hierarchical approaches, density-based methods, grid-based methods, model-based methods, special techniques for clustering high-dimensional data, and constraint-based clustering (Han et al, 2006;Yinghua et al, 2016).For the partitioning approaches, given a collection of n objects to cluster, k (k ≤ n) clusters of the objects are created, with each cluster containing at least one object, and each object belonging to exactly one cluster.Two common heuristics used to determine cluster membership in partitioning algorithms are a centroid-based technique -K-means, where each cluster centre is represented by the mean value of the objects in the cluster, and a representative object-based technique -K-Medoids, where each cluster centre is represented by one of the n objects.Partitioning Around Medoids (PAM), is a widely used K-Medoids method.
The PAM algorithm starts by selecting k random objects (medoids) as representative of the k clusters, and each of the remaining n-k objects assigned to one of the k clusters based on how similar the objects are to the corresponding medoids.The algorithm then attempts to improve on the initial clustering as follows: for each medoid object, swap each of the non-medoid objects with the medoid and compute the cost, i.e., average dissimilarity, of the new clustering that results from this swap.If the cost increases, then the swap is undone.
Determination of object-object dissimilarity and hence cluster membership of objects has traditionally been done using the Euclidean and Manhattan (or city block) distance measures.Angular distance (the angular separation between two objects), another distance metric, has not been as widely studied in cluster analysis.The objective of this study was to implement and apply the PAM algorithm to both standardized and non-standardized versions of four datasets, and for each version of a dataset, compare the quality of clusters obtained from the Euclidean distance, Manhattan distance and angular distance similarity metrics.In each case, the quality of the generated clusters was determined using two metrics: the proportion of objects placed in the correct cluster as  et al, 2015;Linchman, 2013c).So, altogether, there were 72 x 15, or 1,080 measurements per protein.Each measurement can be considered as an independent sample/mouse.However, there are some missing data in the dataset.In this work, all the samples with missing data were removed, leaving 552 out of the original 1,080 samples.
The mice are placed into eight classes based on features such as genotype (control or trisomic), behavior (contextshock for mice that have been stimulated to learn, and shockcontext for mice that have not been stimulated to learn), and treatment using the drug memantine in recovering the ability to learn in trisomic mice (some mice injected with the drug and others not injected).The resulting eight classes were as follows: i.) c-CS-s: control mice, stimulated to learn, injected with saline (5 mice, 75 measurements) ii.) c-CS-m: control mice, stimulated to learn, injected with memantine (3 mice, 45 measurements) iii.) c-SC-s: control mice, not stimulated to learn, injected with saline (5 mice, 75 measurements) iv.) c-SC-m: control mice, not stimulated to learn, injected with memantine (4 mice, 60 measurements) v.) t-CS-s: trisomy mice, stimulated to learn, injected with saline (5 mice, 75 measurements) vi.) t-CS-m: trisomy mice, stimulated to learn, injected with memantine (6 mice, 90 measurements) vii.) t-SC-s: trisomy mice, not stimulated to learn, injected with saline (5 mice, 72 measurements) viii.)t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (4 mice, 60 measurements)

B. Theoretical Framework and Methodology 1) Determining an Optimal Value for k, the number of clusters to generate:
A common approach makes use of a silhouette (Rousseeuw, 1987;"Silhouette (clustering)," 2016).A silhouette value is a computed number that ranges from -1 to +1, and is a compact measure for cohesion (how similar an object is to its cluster) and separation (how dissimilar an object is to other clusters).A high silhouette value suggests that the object is well matched to its cluster, and a low silhouette value suggests that the object is not well matched to its cluster.Hence, in determining an appropriate value for k, clusters that have many objects with high silhouette values and few objects with low silhouette values were sought.The average silhouette score for each cluster is an indicator of how good the cluster is; similarly, the average silhouette value for all the clusters is an indicator of how good all the clusters put together are.
2) Object Representation: An important consideration in clustering algorithms is the data types of the variables used to represent objects, as these determine the approach to compute similarity between objects.Variable types include the following (Bramer, 2013;Han et al, 2006): interval-scaled variables, binary variables, categorical variables, ordinal variables, and ratio-scaled variables.
Interval-scaled variables are continuous measurements on a roughly linear scale (Han et al, 2006).Examples are weight, length, and temperature.One concern with intervalscaled variables is that the measuring unit may affect the generated clusters.Expressing a variable in smaller units (for example, centimetres instead of metres) leads to a larger range for that variable, and thus a larger effect on the resulting clustering structure (Han et al, 2006).One way to avoid dependence of the clustering algorithm on the measurement unit is to standardize the data, either by giving each variable the same weight, or in some cases, giving higher weights to important variables (e.g., height of a basketball player when clustering potential recruits for a basketball team).It is important to note though, that standardization may or may not be useful in some applications.This work considers only objects represented with interval-scaled variables, and limit discussion to this category of variables.
3) Similarity Measures: Several proximity measures are available to choose from, including spatial models, settheoretic models and graph-theoretic models (Corter, 1996); correlation coefficients, distance measures, association coefficients, and probabilistic similarity measures (Aldenderfer and Blashfield, 1984).Distance-based proximity measures consider objects as points in a coordinate space; with such a representation, object-object similarity is a measure of how close the objects are in this space.A true distance-based similarity metric must meet four criteria (Aldenderfer and Blashfield, 1984), summarized below for objects x, y, and z, separated by distance d: i.) Symmetry: the direction of measurement of distance is immaterial, i.e., d(x,y) = d(y,x) ≥ 0 ii.) Distinguishability of nonidenticals: if the two points differ, then the distance between them is not equal to zero, and from the symmetry criterion, must be greater than zero.In other words, since d(x,y) ≠ 0, then x ≠ y. iii.) Indistinguishability of incidentals: if two points coincide, then the distance bewteen them is zero, i.e., d(x,x) = 0 iv.) Triangle inequality: for any three points, x, y, z, the following relationship holds true d(x,y) ≤ d(x,z) + d(y,z) Two of the most commonly used distance metrics are Euclidean distance and Manhattan distance (Aldenderfer and Blashfield, 1984;Han et al, 2006;Tversky, 1977).The cosine similarity measure has been widely used to in areas like information retrieval and text mining that make use of high-dimensional data (Han et al, 2006;Manning., 2008;Nkweteyim, 2014;Salton, 1988;Salton and McGill, 1986), and can also be used in cluster analysis.
The cosine similarity metric works well in information retrieval systems using the vector-space model to represent objects.In that representation, document vectors are represented by term weights, which are all positive, guaranteeing that the similarity score always ranges from 0 to 1.However, in other applications in which object variables may be negative, the cosine similarity score ranges from -1 to 1.In such cases, cosine similarity does not meet the triangle inequality requirement for a distance-based similarity metric, and the angular distance metric, which meets this requirement, could be used instead.
Once determined, object-object proximity measures can be stored in a look-up table to be consulted during the clustering process.

4) Standardization of Object Vectors:
As mentioned in the introduction, an object can be represented in standard form as a features vector with the weight given to each feature dependent on the unit of measure used.An alternative representation is to standardize the vector to give each variable the same weight.One common way to standardize data is to compute a z-score (Equation 3) for each variable, as illustrated below.It is noticed always that, -1 ≤ s(i) ≤ 1.

III. METHODOLOGY
In this study, C code was designed and implemented to generate clusters using the PAM algorithm as well as compute silhouette values for different numbers of clusters generated from the four datasets, both standardized and not standardized.In each case, the Euclidean, Manhattan, and angular distance were used to determine the dissimilarities between objects.Guided by the number of classes k, of objects that were suggested by the datasets, five sets of clusters were generated for all but the iris dataset comprising k-2, k-1, k, k+1, and k+2 clusters respectively in a bid to appreciate the usefulness of silhouette values in guiding an ideal number of clusters to be generated for a given dataset; for the iris dataset with only three categories of flowers, the number of clusters generated were 2, 3, 4, 5, and 6.

A. Clusters
Tables 1-4 show the clustering results for the four datasets.The results are based on the optimal number of clusters generated for the number of classes of objects suggested by the dataset: 3, 6, 8, and 30 clusters respectively for the iris, climate, mice, and leaf datasets.)The results show that for non-standardized data, the angular distance measure outperforms the Euclidean and Manhattan distance measures for three of the four datasets, while the Manhattan distance measure fares similarly to, or slightly better than, the Euclidean distance measure, on all the non-standardized datasets.When the datasets are standardized however, there is no clear advantage of any of the three measures over the others.These facts are further summarized and illustrated in Figures 1 and 2.

B. Silhouette widths
Table 5 shows the average silhouette widths for five consecutive cluster sizes generated from the clustering algorithm, with the silhouette values highlighted for the optimal number of clusters as suggested by the number of classes in the dataset.The optimal numbers of classes as suggested by the respective datasets are printed in bold.With the exception of the iris dataset for which there is a sharp change in silhouette values from one run to another, the changes for the other datasets are mild.
Secondly, the computed silhouette widths range from 0.16 (using Manhattan distance dissimilarity measure on the generation of 8 clusters from the standardized mice protein dataset), to 0.7 (using the angular distance dissimilarity measure on the generation of 6 clusters from the nonstandardized climate dataset).
Results suggest that angular distance, which in the past has not been prominently highlighted as a useful dissimilarity metric in clustering, does indeed result in higher quality clusters than the well-known Euclidean and Manhattan distance metrics, in some situations.generate using a partitioning clustering algorithm as there is no fine line separating the silhouhette values between the optimal number of clusters and a less optimal number.In this work, the advantage of the pre-determined number of classes in each dataset to know the optimal number of clusters to generate was taken.In practice though, it is not known beforehand, how many clusters to generate.This work thus re-emphasizes the need for human intervention in interpreting generated clusters as silhouette values alone can only serve as a guide to the number of clusters to generate.
Results reveal that the algorithm succeeded in correctly clustering larger percentages of objects in some datasets than others.Too much cannot be read into this as several factors could be responsible, notably variation in dataset sizes, number of attributes, and number of classes.In general, the larger these values are, the more difficult the clustering problem is.The iris dataset with the best performance for example, was the simplest, with 150 objects, 3 clusters, and 4 attributes.The mice protein dataset on the other hand, which performed poorly, comprised 552 objects, eight clusters, and 77 attributes.
There are some limitations in the work, which a similar study could consider, to get a better appreciation of the relative performances of different dissimilarity metrics.First, datasets with similar characteristics (size, number of attributes and number of known classes) could be selected.
With a variety of similarity metrics available for use in clustering, with none of them apparently outperforming all others in all situations, an approach to clustering can be envisaged in which cluster membership of an object is determined not only from the object-object similarity score of a single similarity metric, but rather through a voting system in which a majority of the similarity metrics used, supports membership of the object in that cluster.
V. CONCLUSION Presented in this paper, were details of the clusters obtained when the PAM algorithm was applied to four datasets (both un-standardized and standardized using the zscore), and using three metrics -Euclidean distance, Manhattan distance, and angular distanceto determine similarity between objects, and hence cluster membership of objects.Cluster silhouette widths were also computed in a bid to appreciate the usefulness of this metric in the estimation of the quality of generated clusters.
Results show that the seldom-used angular distance metric outperforms the widely Euclidean and Manhattan distance dissimilarity metrics in certain situations, and so should be considered as a viable, alternative distance measure by researchers in the area of clustering.Given that different proximity measures result in different clusters, perhaps automated distance-based clustering should use several proximity measures, and place an object in a cluster only if the majority of the proximity measures vote for the object to be placed in that cluster.
The work also confirms the fact that silhouette widths alone are not sufficient to determine the quality of generated clusters, and so human examination remains important in interpreting generated clusters.
Low-dimensional datasets was chosen in the work.This work can be extended to investigate other similarity measures to appreciate their usefulness in the PAM algorithm, as well as higher-dimensional data, in order to investigate the effects of dimensionality on the algorithm.

Figure 1 :
Figure 1: Percentage of objects per dataset correctly clustered on un-normalized datasets.

Figure 2 :
Figure 2: Percentage of objects per dataset correctly clustered on normalized datasets.