Characterizing KMeans clustering methods to accelerate experiment run-times

Material Information

Characterizing KMeans clustering methods to accelerate experiment run-times
Kravitz, Yaffe
Place of Publication:
Denver, CO
University of Colorado Denver
Publication Date:

Thesis/Dissertation Information

Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Electrical Engineering, CU Denver
Degree Disciplines:
Electrical engineering
Committee Chair:
Connors, Dan
Committee Members:
Liu, Chao
Gedney, Stephen

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright Yaffe Kravitz. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.


This item has the following downloads:

Full Text
A. A.S., Clinton Community College, 2010
B. S., University of Colorado Denver, 2016
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Electrical Engineering

This thesis for the Master of Science degree by Yaffe Kravitz has been approved for the Electrical Engineering Program by
Dan Connors, Advisor Chao Liu Stephen Gedney
Date: July 29, 2017

Kravitz, Yaffe (M.S., Electrical Engineering Program)
Thesis directed by Professor Dan Connors
Machine learning algorithms have the potential to unlock solutions to ambitious problems in a myriad of scientific and industrial fields. Clustering is an unsupervised machine learning technique in which data objects are grouped, based on similarity, into clusters. Objects within a cluster are alike in some mathematical way and objects in different clusters are distinctly different. In this way, clustering uncovers the hidden structure of a dataset to discover how many data objects are similar and the number of clusters. Unfortunately clustering techniques vary in two experimental ways: clustering execution time and the accuracy of the final clustering results. Furthermore, many clustering algorithms require multiple experimental evaluations that, in turn, generate prohibitive clustering execution time [1].
Kmeans is the primary clustering algorithm in the data mining and unsupervised machine learning spheres that defines a single clustering experiment by defining the set of similar points that are each closest to ’K’ cluster centroids. The algorithm finds a solution of centroids in the dataset by iteratively calculating the multi-dimensional ’D’ distances between each of the experimental ’K’ cluster centers and the ’N’ data points. While the base Kmeans algorithm is parallelizable across these iterations, there are two exacerbating problematic conditions of the algorithm’s use: starting seed selection and the selectivity the number of K-centroids for a dataset. Each of these conditions requires multiple ’E’ experiments to evaluate within each dataset. The work outlined in this thesis investigates the optimization and the restructuring of the Kmeans algorithm to use multiple sub-samplings of a target dataset to generate accurate K-centroid seeds that eliminate excessive iterations of the base clustering algorithm.

The form and content of this abstract are approved. I recommend its publication.
Approved: Dan Connors

I dedicate this work to my love, Rebekah, and our four amazing children, Anna, Sarah, Jonathan, and Samuel who have stood by me and supported me throughout my educational career. Without their love, support, and never-ending supply of patience, this endeavor would have been nearly impossible.

This thesis would not have been possible without the generous support of Professor Dan Connors. The valuable expertise and depth of knowledge in the computer engineering held that Dr. Connors provided was instrumental in achieving this milestone. I also owe a great debt of gratitude to those who encouraged and motivated me to start this journey and see it through to completion.

I. INTRODUCTION...................................................... 1
II. BACKGROUND........................................................ 4
Machine Learning ................................................. 4
Categories of Learning ....................................... 5
Supervised, Semi-supervised, and Unsupervised Learning . . 5
Regression.................................................... 6
Classification................................................ 6
Clustering.................................................... 9
Dimensionality Reduction..................................... 10
Kmeans Clustering................................................ 11
Initialization Methods....................................... 17
Initialization Sensitivity............................... 17
Cluster Affinity............................................. 21
Available Kmeans Code Packages............................... 22
Variations of Clustering Algorithms.......................... 22
Kmeans Algorithms............................................ 23
Machine Learning Datasets.................................... 23
III. APPROACH........................................................ 25
Refined Initialization of Centroids.............................. 25
Running the Experiments.......................................... 28
IV. EXPERIMENTAL RESULTS............................................ 31
Results.......................................................... 31
Algorithm vs Clustering Time................................. 31
V. CONCLUSION...................................................... 40
REFERENCES............................................................... 41

Machine learning algorithms have the potential to unlock solutions to ambitious problems in a myriad of scientific and industrial fields. Traditional machine learning concepts can be divided into four categories: regression, classification, clustering, and dimensionality analysis. Varieties of algorithms in each of these machine learning domains require multiple parameters to explore to achieve the optimal model and analysis. As clustering is one approach to extracting hidden structures of data, it is significantly useful at the start of exploring the utility of a dataset. For that reason, it is critical to cut through the computational barriers of the deployment early in an investigation of a dataset to unearth advantageous insights related to exploring the efficiency and expediency of machine learning concepts of regression and classification.
Cluster analysis (or clustering) is the task of grouping a set of similar objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) [2], The clustering task is deployed in exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics [3].
While the overall concept of clustering is to place data objects into groups based on similarity, the definition of similarity may vary depending on the intended goal of exploring the dataset. In general, objects within a cluster are alike in some mathematical way and objects in different clusters are distinctly different. Common views of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Similarly, clustering uncovers the hidden structure of a dataset to discover how many data objects are similar and the number of clusters [4], Cluster analysis is not one specific algorithm

but the general task to be solved. In this way, there are many various clustering algorithms that differ significantly in their notion of what constitutes grouping similarity in a cluster and how efficiently to find the similarity within clusters.
Kmeans is one of the primary clustering algorithms used in the data mining and unsupervised machine learning environments. Kmeans defines a single clustering experiment by defining the set of similar points that are each closest to ’K’ cluster centroids. The algorithm finds a solution of centroids in the dataset by iteratively calculating the multi-dimensional ’D’ distances between each of the experimental ’K’ cluster centers and the ’N’ data points. While the base Kmeans algorithm is paral-lelizable across these iterations, there are two exacerbating problem space conditions of the algorithm’s use: starting seed selection and the selectivity of the number of K-centroids for a dataset. Each of these conditions require multiple ’E’ experiments to evaluate within each dataset. Thus to explore the hidden structure of a single dataset requires significant computational work in calculating ’E’ experiments of ’K’ centroids for an ’NxD’ sized data [5].
The bulk of the Kmeans calculations are performed to find centroids that minimize a mathematical score between the data-points and each centroid. One of the inherent issues with the Kmeans algorithm arises in the way the initial starting centroids are selected and that a number of initial iterations are exhausted to migrate centroids from their initial seed positions to positions within the multidimensional space that match the inherent characteristics. There are a number of approaches to initializing Kmeans centroids, most commonly is to use randomly selected data points and other generalized heuristics.
The refinement Kmeans approach overcomes the computational barrier of clustering by sampling a large dataset to uncover candidate centroids to initiate the search of initial seed points for the full dataset. This thesis investigates the optimization of the refinement Kmeans algorithm by evaluating various sampling sizes across a wide

range of target datasets. Moreover, a variety of Kmeans algorithms are evaluated to additionally determine further impact to the computational requirements of finding the optimal clustering in a dataset. The goal is to uncover data characteristics that might guide data science researchers in setting optimal Kmeans evaluation parameters that overcome the prohibitive computational costs of performing full dataset exploration. The methods for characterizing the results are clustering execution time and the clustering affinity (a measure of accuracy of the final clustering results).
This thesis is organized as follows: Chapter II discusses the background of the Kmeans clustering algorithm. Chapter III examines the Kmeans refinement approach and thesis experimental framework for evaluating multiple clustering evaluations. The experimental results section, Chapter IV, shows performance data for the clustering of selected datasets. Finally, Chapter V concludes this thesis.

BACKGROUND Machine Learning
Machine learning is a subfield of computer science that refers to the equipping of computers with the ability to learn without explicit programming. The terms “machine learning” and “artificial intelligence” will often be used interchangeably and, while the concepts are connected, there are differences. The goal of artificial intelligence is to create a machine that can mimic the human mind. To do this, it will need a machine to have learning capabilities. Artificial intelligence also encompasses knowledge representation, abstract thinking, and reasoning. Machine learning’s only focus is to create software that can learn from past experience - hence the term “machine learning”. Thus, while artificial intelligence is the embodiment of recreating human characteristics in a machine, the focus of machine learning is only on the development of computer programs that can change and learn when exposed to new data.
Three basic elements of machine learning are seen throughout the tens of thousands of machine learning algorithms in existence, not to mention the hundreds of new algorithms developed every year. These three components are as follows:
• Representation: display of knowledge
• Evaluation: judging of hypotheses
• Optimization: generation of models
Representation simply refers to how the knowledge is represented. For example, depending on the characteristics of the dataset, it may be understood best with either a decision tree or a set of rules. Instances, graphical models, neural networks, and model ensembles are all examples of ways knowledge can be represented.

Evaluation is the way hypotheses are judged (or “evaluated”). Accuracy, prediction and recall, mean squared error, cost, and likelihood are all common examples of evaluation.
Optimization is the way models are generated; this is also known as the search process. Examples include combinatorial optimization, convex optimization, and constrained optimization. All machine learning algorithms are a basic combination of these three components. In other words, these three categories represent the framework to understand any machine learning algorithm [6].
Categories of Learning
Machine learning can be summed up into four different categories: classification, regression, clustering, and dimensionality reduction (see figures II. 1 - II.8). These four categories encompass the world of supervised, semi-, and unsupervised learning (sometimes, reinforcement is also included but can be considered a subcategory to the unsupervised learning category and not really a separate category).
• Regression: predicting a value
• Classification: predicting a type or category
• Clustering: Ending the hidden structure of data
• Dimensionality Reduction: getting the set of principal variables in data
Supervised, Semi-supervised, and Unsupervised Learning
Classification and regression are quite similar - for both systems, the goal is to generate a function to predict values for unseen observations. It is important that during the training of the function, labels are available to the algorithm.
Regression falls under the supervised learning focus, while classification is considered to be a semi-supervised technique. To assess the performance of a trained model,

the labels of real observations can be compared to the predicted labels. Labeling can be a tedious burden and is often done by humans, so techniques that do not require labeled data are sought out and they are known as unsupervised learning. An example of unsupervised learning includes clustering, where groups of observations that are similar are located. It does not require specific labeled observations. Assessing the performance of this trained model is more difficult than a more straightforward supervised learning algorithm because there are no real labels to use for comparison.
Some techniques overlap between the supervised and the unsupervised learning mediums into a land of gray. This so-called “semi-supervised learning” could involve a lot of observations that are not labeled and a few which are labeled. Clustering can be used to group similar observations and the information about the clusters (and the few labels) can be used to assign a class to the unlabeled observations (then this gives us more labeled information to perform unsupervised learning on) [7].
Regression problems are machine learning problems that try to predict an input value based on previous information. The variables input are called predictors and the output is known as the response (see Figure II.2). In some ways, regression and classification are very similar in that they both exemplify trying to estimate a function that maps input to output based on earlier observations. In regression, however, the search is to find an actual value, not just a class of an observation. An example of what regression could be useful for would be modeling credit scores based on a person’s record of payments. All regression models have at least two things in common: first, the response is always quantitative and second, they will always need input knowledge of previous input-to-output observations to erect a model.
A classification problem involves predicting whether a given observation belongs
to a certain category. Classification is the process of taking an input and mapping it


/ predicting
Figure II. 1: Scikit-Learn Algorithm Cheat Sheet - Choosing Estimator [8]
few features should be
A ^ RidgeRegression
SVR(kernel=’linear') |
Figure II.2: ScikitLearn Algorithm Cheat Sheet - Regression [8]
to a discrete label. For example, this is the case when an image is being assigned to the label of either a “cat” or a “dog”. With only two choices, this would be known as two-class or binomial classification. Often times, there are many more categories (for

instance, when trying to predict the winner of a sports tournament), and the problem is known as multi-class classification.
Classifiers not ' KNeighbors
Figure II.3: ScikitLearn Algorithm Cheat Sheet - Classification [8]
Classification involves training on earlier observations to create a classifier that will try to predict the class of future observations. In the binomial classification referenced above, a model would predict the difference between a dog and a cat given a photo with one or the other. The system might be trained on many different observations of dogs as well as many different observations of cats, to come up with classifiers that would be able to tell if a new, previously unseen, photo of a dog or cat is either classified as a dog or classified as a cat (see Figure II.4). Once a classifier is constructed, it can be used to take arbitrary input (like a photo of a dog or cat), and label the image as a dog or cat (see Figure II.5).
r 1 Earlier Observations r f ESTIMATE r ^ Classifier k A
Figure II.4: Classification: Building a Classifier
The possible applications of this technique are broad and promising. For example,
after significant studies of medical examinations determine that certain vital signs and

Figure II.5: Classification: Using a Classifier to Predict
symptoms are related to a specific disease, that information can be used to build a predictor that will allow a reasonable prediction to be made that a new patient with unseen signs and symptoms might have this disease and may need treatment. A simple example would be the yes-or-no prediction simplicity of “is this tumor cancerous?” A basic question with profound implications. It is important to use a qualitative output and to remember that there is foreknowledge of the classes to which observations can belong.
Clustering is an attempt to group similar objects together into clusters where the clusters themselves are dissimilar. Good clustering methods produce high quality clusters with high similarity within clusters and low similarity compared to other clusters [9]. This similarity can be calculated by using the notion of “affinity”. Clustering is different from classification and regression in that there is no need for foreknowledge of the labels and sometimes there is no clear right or wrong in clustering (see Figure II.7). Clustering can be useful when there is no previous knowledge about any patterns inside samples of data. Different clusterings can reveal new information (sometimes useful information) about a potential hidden structure within a data set. A practical example of clustering could be segmenting a market of potential consumers based on their demographic features and purchasing histories. Another example of clustering could be to find groups of movies based on reviews of movies or features in each movie to find movies most like another movie [10].

number of categories known
Spectral Clustering NOT I WORKING
MiniBatch KMeans ir,
Figure II.6: ScikitLearn Algorithm Cheat Sheet - Clustering [8] 15
-15 -10 -5 0 5 10 15
Figure 11.7: Clustering with Three Clusters
Dimensionality Reduction
In machine learning, dimensionality reduction is the process of reducing the number of random variables in consideration by obtaining a set of principal variables. The process is useful to speed up algorithm execution and it can actually be helpful in the final clustering accuracy or classification. If there is faulty input data or too much

noise, a less than desirable algorithm performance can be seen. Removing data that is actually dis-informative could find more general classification regions and rules and better performance can be seen on new datafll].
One of the ways to reduce input dimensionality involving a high number of missing or irrelevant values is called Principal Component Analysis (PCA). PCA is a statistical procedure for identifying the principal components or the number of uncorrelated variables from a large set of data. To explain the maximum amount of variance with the fewest number of principal components is the chief concern of PCA. Usually PCA is just one step in a series of analyses and is used to reduce the number of variables to avoid multicollinearity. The social sciences and marketing research arenas commonly use principal components analysis for their often very large data sets. Figure II.8 shows the ScikitLearn overview guidelines of using PCA for dimensionality reduction.
Figure II.8: ScikitLearn Algorithm Cheat Sheet - Dimensionality Reduction [8]
Kmeans Clustering
The Kmeans algorithm is a clustering algorithm that seeks to minimize the Euclidean distance between centroids and the members associated with those centroids, hi reality, it attempts to find a hidden structure in a dataset, if one exists, k (number of clusters) is specified manually and a dataset is provided while the algorithm does the rest. R will always output a solution, but the solution may not be correct. If the goal is to cluster based on some similarity, the answer Kmeans provides may not always meet that criterion.

Kmeans is known for its simplicity and its ability to be used for a myriad of data types. The Kmeans algorithm is a clustering method defined by its partitioning based, non-hierarchical methodology. Given a set of objects X, Kmeans partitions every data point ay into a cluster Cj (where j < k and k, the specified number of clusters, is a parameter that must be provided) that minimize the contained group sum of squared errors. Partitioning is done in such a way as to minimize the objective function.
Xi ^Ij ||
j= 1 XiEcj
where jij is the centroid of cluster Cj.
The kmeans algorithm begins by initializing k cluster centers. The data points are then assigned to one of the existing clusters in accordance with the square of the Euclidean distance from the center of each cluster (centroid). The centroid of each cluster is then updated to become the mean center of all the data points in the cluster as a result of the change in membership. The process of assigning and re-assigning membership and updating the centroids is repeated until there is no more change in the value of any of the cluster centers, also known as convergence.
Assignment step: Every point ay in X is assigned to the closest centroid’s cluster.
Update step: Centroid of every cluster is recalculated based on the mean point of all the data points.
One way to make kmeans work more expediently is to pick the best value for k. Another way is to intelligently select the starting centroids.
See Figures II.9 - 11.16 for an example of kmeans in action:

Figure 11.10: Membership Assignment

Move centroids to mean of points in clusters
Figure 11.11: Moving Centroids to Mean of Data Points
Figure 11.12: Membership Reassignment

Move centroids to mean of points in clusters
• •
Figure 11.13: Moving Centroids to Mean of Data Points
Figure 11.14: Membership Reassignment

Move centroids to mean of points in clusters
Already at mean No change
Figure 11.15: No Movement - Centroids at Local Minimum

Initialization Methods
There are several different methods used to initialize Kmeans, of them, the common ones are Forgy and Random Partition [12]. Random Partition does just that, randomly assigns all the data points to different clusters and then finds the mean of each of the clusters and uses the means as the initial starting centroids. Forgy takes the initial means and spreads them out more, whereas the Random Partition method tends to have the means closer to the center. Another initialization method, the one used in this thesis, by Bradley and Fayyad [13] generally does much better for getting the clustering results. Kmeans++ is another algorithm that initializes the centroids differently:
1. Choose one center uniformly at random from among the data points.
2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using standard Kmeans clustering.
Initialization Sensitivity
Kmeans is an extremely valuable tool in the machine learning toolbox, but it doesn’t come without flaws. In computing, there is no such thing as a perfect algorithm that applies universally across the board to every possible dataset. Kmeans

is no exception. One of the issues with Kmeans is that it is highly sensitive to initialization. What that means is on the first step in the algorithm, centroids must be assigned before the iterative process begins. One of the iterative steps is to measure the distance from data points to centroids, so those initial centroids must be placed somewhere to start. It has been an area of much research to figure out a good way to initialize the centroids, and there are some great solutions out there, but this thesis proposes expanding on some of the ideas to improve the Kmeans algorithm even more.
There have also been several attempts to improve the efficiency and time it takes to run the Kmeans algorithm. Because of its iterative nature, larger dimensionality datasets can take a significant amount of time to produce a result. Improving the efficiency and speed that Kmeans takes to perform clustering alone is not much help if it produces an incorrect answer, an incorrect answer produced in a shorter amount of time is not usually helpful.
Improving the initialization can actually speed up the algorithm because if the the starting centroids can be placed intelligently in such a way that they are closer to where the final iteration would place the centroids, the number of iterations needed to converge is usually reduced, which typically equates to a reduction in total clustering time.
In some cases, Kmeans is not capable of ever finding the correct clustering result because of the nature of the dataset. In the following sample dataset, Kmeans did not cluster the result correctly (see Figure II. 7) and never would correctly cluster it because the hidden structure is not one that a mean-based clustering algorithm would fold. For this example, a density based algorithm, such as DBSCAN, would be more suitable to carry out this task (see Figure 11.18).
While Kmeans is probably the most intuitively straight-forward clustering method (where an algorithm groups data in k clusters based on some measure of similarity),

Figure 11.17: Kmeans (k=4) Clustering - Cannot Find Hidden Structure in Dataset


• %
i, *{•/
Figure 11.18: DBSCAN Clustering - Correctly Finds Hidden Structure in Dataset
it is not without its setbacks. This algorithm’s speed and simplicity are unparalleled, however, one of the issues with Kmeans is that if it is not seeded with good initial centroid points, it can end up with an incorrect clustering, even for data that should be easily clustered.
For example, Figure 11.19 shows an initial seeding of centroids where the initial points were selected randomly. In the resultant clustering, figure IF20 shows how a poor initialization selection for the centroids can result in a bad clustering result -

this would show up as a poor affinity score compared to the correct clustering.
Figure 11.19: Clustering (k=3) - Poor Initial Seeding of Centroids with Random Initialization
Figure 11.20: Clustering (k=3) - Poor Initial Seeding of Centroids Leads to Poor Clustering Results
The refinement algorithm has the following merits of operation:
• performs Kmeans on a subset of sampled data so that the system completes much faster.
• the answer potentially is better as it is derived by multiple findings, compared to that of the naive Kmeans that starts from a single set of centroids.

Cluster Affinity
A concept for evaluating the assignment of data objects to clusters is a mathematically derived score known generally as affinity. In short, affinity is a calculated measure of how well data objects are assigned to clusters. Figure 11.21 details an example of affinity. If there were only 2 data points, with 2 clusters (k=2), affinity would be zero because the centroids would be the data points and the data points would be 0 distance from each of their respective centroids.
Centroid 1
Centroid 2
Figure 11.21: Affinity for k=2 with Two Data Points
If there were 3 data points and 2 clusters (k=2), then the resultant clustering might look like Figure 11.22, where the affinity for one of the data points is zero, and the affinity of the other centroid is the average sum of squared distances of the data points to the cluster’s mean center point (centroid).
Figure 11.22: Affinity for k=2 with Three Data Points

Available Kmeans Code Packages
Several prominent code packages include the Kmeans clustering algorithm:
• Scikit learn -
• mlpack -
• OpenCV - http://docs.opencv.Org/2.4/modules/core/doc/clustering.html
• node-Kmeans -
• R-
• ELKI (Environment for DeveLoping KDD-Applications Supported by Index-Structures)
For this thesis, the mlpack Kmeans implementation is selected because it provides the necessary setup to carry out multiple experiments and provides the needed output information with minimal recoding. A Python framework that allows for carrying out several experiments with initializing the centroids was written for this thesis.
Variations of Clustering Algorithms
• Kmeans: Lloyd’s algorithm (the standard Kmeans algorithm)
• Jenks natural breaks optimization: Kmeans applied to univariate data
• Minkowski Weighted Kmeans: computes weights for features at each cluster
• Kmeans++: improved initial centroid seeding
• Mini-batch Kmeans: runs Kmeans on ’’mini batch” samples of larger datasets that exceed memory space
• Spherical Kmeans: for textual datasets
• Fuzzy C-Means Clustering: data points assigned to multiple clusters

Kmeans Algorithms
Several Kmeans algorithms will be evaluated in the thesis for their inherent differences when locating centroids under different circumstances: naive, pelleg-moore, elkan, hamerly, and dualtree. List of the Kmeans algorithms included in the ml-pack [14] Kmeans program that were used:
• naive: the standard Lloyd iteration; takes O(kN) time per iteration.
• pelleg-moore: the ’blacklist’ algorithm, which builds a kd-tree on the data. This can be fast when k is small and the dimensionality is reasonably low.
• elkan: Elkan’s algorithm for k-means, which maintains upper and lower distance bounds between each point and each centroid. This can be very fast, but it does not scale well to the case of large N or k, and uses a lot of memory.
• hamerly: Hamerly’s algorithm is a variant of Elkan’s algorithm that handles memory usage much better and thus can operate with much larger datasets than Elkan’s algorithm.
• dualtree: the dual-tree algorithm for k-means builds a kd-tree on both the centroids and the points in order to prune away as much work as possible. This algorithm is most effective when both N and k are large.
Machine Learning Datasets
The approach to testing new clustering approaches will evaluate several datasets to capture different data characteristics and trends.
The following datasets (see Table II. 1) with multiple height and width (dimensionality) variations were run using several different sample sizes along with different percentages of the original dataset. The datasets are included from the UC Irvine Machine Learning archive website [15].

Table II. 1: Machine Learning Datasets
Data File Number of Observations Dimensions
Basic 131000 35
Chem 47832 155
Diabetic 13820 41
GroupLens 100000 3
HAR 26523 127
HYG 119618 14
Iris 149 4
KDDCUP 131000 74
Poker 131000 11
Power 131000 6
Power2 1000 6
RelationNetwork 53413 22
Test 1000 3

Refined Initialization of Centroids
The common approach to using Kmeans clustering is to generate a minimum of 10 experiments for a full target dataset. The machine learning standard of scikit-learn deploys 10 experiments and carries out the affinity scoring to select the best final clustering. Running 10 experiments on large datasets is both time consuming and compute intensive. Many environments have limited computational power for an excessive number of trials. To get around needing to run multiple experiments on the entire dataset, the goal is to run multiple experiments on smaller samples of the dataset. Running Kmeans on 1%, or even 5%, of the original dataset saves a significant amount of time. The goal is to get the same clustering performance in much less time and with much less computing power usage.
The method proposed by Bradley and Fayyad in their paper “Refining initial points for Kmeans clustering” [13] is implemented in mlpack. This strategy samples points from the dataset and runs Kmeans clustering on those points multiple times, saving the resulting clusters. Then, Kmeans clustering is run on those clusters, yielding the original number of clusters. The centroids of those resulting clusters are used as initial centroids for Kmeans clustering on the entire dataset. The approach is as follows: starting with an original dataset with 1 million data points (shown in Figure III. 1.), multiple samples (8 samples of 1% of the original dataset) are gathered and can be visualized in Figure III.2. Kmeans is then run on all samples as depicted in Figure III.3. The final step of the refinement process is to take the 8 clusterings and run Kmeans again on them. The result provides the mean points of the clusters to use as the starting centroids for the final Kmeans run on the original dataset. Figure III.4) presents the final Kmeans run on the original dataset.

Figure III. 1: Simulated Original Data
Figure III. 2: Eight Random Samples - Each Containing 1% of Simulated Original Dataset (10,000 Datapoints per Sample)
■SSjcSfev-.' ' ' ■pytb .. ' • -|||;. t..v ■ &> a; ysj... . •• ;X;.5S®r-: - j£g£ : • A N' Vi * ‘ . • ' Vv‘.V, . ■ • • ’ '•••••
' u ■ . ■ - •• •U ■' ' ?:iTB &/*'■■- ? ^3££<.V.* : - . «c3p^%*
Figure III.3: Kmeans Clustering on Sub-Samples for k=4
Figure III.5 shows the clustering when the initial centroids are seeded from the combined centroids and Kmeans is run on the original data. Figure III.6 illustrates the run of Kmeans clustering on the original data set with the intelligently selected

Figure III.4: Combined Centroids from Subsamples - Getting the Mean of Each
Figure III.5: Seeding Kmeans with Intelligently Selected Starting Centroids
Figure III.6: Kmeans on Simulated Original Dataset with Refinement
starting centroids. Figure III.7 compares the clustering to randomly selected starting centroids. It shows how initializing the centroids to a “good” location can have a significant impact on the amount of computing required to reach the final clustering.
Figure III.7: Kmeans on Simulated Original Dataset without Refinement

Running the Experiments
As shown in Figure III.8, the process to begin running the experiments requires the import of relevant python modules, defining global variables, and providing the input and output path structure. The second step involves setting up the experimental parameters. This can be done either by editing the python script directly, or modifying a configuration hie that holds all the values that should be run by the script. The script begins by scanning the input directory for data hies. Once the data hies are found, the python module loads the paths, hlenames, and the size and dimensionality information into a dictionary. The output paths are also set up based on the experimental parameters. The path dictionary is indexed in a way that makes it easy to determine each experimental hie output.
As mlpack-Kmeans produces multiple data hies for each Kmeans run, it is necessary to establish an intuitive hie naming convention for the output hies. Each run of mlpack-Kmeans produces 3 output hies: a centroids hie that contains the coordinates of the centroids found, an associations hie that includes the original dataset along with an appended column to show cluster membership of each row, and a loghle that contains the information about the run, including clustering time, number of iterations to convergence, and any errors or other messages associated with the run. As an example, an experimental run on a single dataset with 4 different algorithms, 4 values of k, 4 sample sizes, and 4 sampling percentages will produce 256 calls to mlpack-Kmeans and 768 output hies. The convention that is used puts the parameter values right in the hie name. For example, HAR_4_100_0.10_elkan is the hlename base for the experiment that runs on the HAR dataset with a value of k=4, 100 samples of 10% sampling, and the elkan clustering algorithm.
Figure III.9 shows the remainder of the process. Once the paths are dehned, the python script sets up the command to send to the operating system to run the mlpack-Kmeans algorithm. There are several parameters that are passed to the program and

- Import python modules
- Define global variables
- Define output path
- Define input dataset path
Define lists of parameters for experimental run
•k values •sample sizes •sample percentages •Kmeans algorithms
- Scan input directory for valid datasets
- Load datafiles into python dictionary “all_data_dict"
Store path names for outputs in python dictionary “all_path_dict" in the following format:
• [exp] = dataset_k_sample-size_sample%_algorithm. output.txt
• (e.g. HAR_4_100_0.10_elkan)
Figure III.8: Code Path to Generate Kmeans Results
the python script formats those based on the experimental setup described previously. It will parse through all the experimental parameters and run the algorithm on the given dataset(s). While this process is running, after generating an assignment hie, there is a python function to reduce the size of the assignments hie by removing the original dataset from the hie and leaving the remaining cluster membership information. Once the original dataset is removed, the hie is renamed to ’membership’. Without this process, the original dataset would be reproduced hundreds of times and the limited hard drive space would be consumed.
After running the Kmeans algorithm, there is a separate function that calculates the affinity of each run. This process parses through all the experimental runs and calculates the affinity based on the membership hie, the centroids hie, and the membership hie. It uses the centroids hie to establish the centroid coordinates, and the membership hie to establish which points (rows) in the original dataset hie corre-

spond with which cluster. The affinity is summed and a weighted score is recorded in another hie labeled ’affinity’ - effectively adding a fourth file to the output run.
Once all the results are produced (it can take several hours to run all the experiments), another function gathers the results into a data dictionary in python. Clustering time is extracted from the the log file of each run and the affinity is pulled from the affinity file from each run. The data dictionary provides convenient access to the data without having to pull it from the files each time a plot is generated. The data is then plotted to display relevant trends and allow conclusions to be made about the different aspects of the experimental run.
Figure III.9: Code Path to Calculate and Assess Affinity

The quality of the clustering is important as it provides insight into how well the Kmeans algorithm performed. Affinity is the term for how close the data points are to the cluster they are members of after a Kmeans algorithm has completed a run. Execution time alone is not as critical as getting the correct result. Getting an incorrect result faster is not a solution that provides any help to the task. The final goal of this thesis is to try and establish a matrix (or score card) that shows what algorithm, sample size, and percent value (percent of original dataset for the samples) should be used when running the Kmeans algorithm on a given type or size of dataset. If execution time is more important than accuracy, then one algorithm might do the trick, and 10% sampling with 100 samples might not be the way to go. If the quality of the clustering result (affinity) is more important, then there may be an algorithm and sampling that works better on a particular dataset.
If the experiment is run with 10% sampling and 100 samples, the execution might be five times worse in execution time than the base, unrefined run. The quest might be for better affinity and the execution time might not matter as much. In this case though, if the results are the same as the unrefined run, the one with better execution time is typically the better choice.
Algorithm vs Clustering Time
Figure IV. 1 shows the various Kmeans clustering algorithms for the dataset of KDDCUP using k=5. The results are standardized to use naive execution time as the base scaled to 1. In that way, any Kmeans algorithm that runs faster than naive will have a lower bar plot and value below 1. The figure demonstrates that the different Kmeans algorithms have the capacity to achieve significantly faster execution than naive. The hamerly algorithm is approximately 5 times faster than base Kmeans,

although both the elkan and pelleg-moore techniques are close contenders respectively performing 3x and 2x faster than naive.
CD 0-911 C/)
O 0.781 "O
£ 0.651
cn 0.390 _c
§ 0.260 o
KDDCUP algorithms vs. clustering time (k=5)
Figure IV. 1: KDDCUP Dataset: Algorithms vs. Clustering Time for k=5
Figure IV.2 shows the trend between the dataset, clustering time and k. The trend of the results show that as k increases, it takes longer to cluster the dataset. KDDCUP, RelationNetwork, and HAR are larger datasets in terms of both observations and dimensionality, so it is expected that they would take longer to cluster. The resulting function of this figure helps to characterize the influence of running different algorithms for different datasets. The analysis of the experimental results can be used as a valuable reference for how to run the Kmeans algorithms on different shaped datasets. One algorithm might go really fast, but at the expense of a good clustering outcome - another algorithm might be slow, but provide a much better clustering solution. The trend of increasing the K number of clusters increases the execution time of clustering which has a higher impact for some datasets.
Figure IV.3 shows the trend between the HYG dataset, weighted affinity (or per-point affinity) and the value of k. The results show that there is a decreasing trend

Figure IV.2: Clustering Time for Datasets vs. k for Naive Algorithm
for affinity as k increases, but for k=6, the affinity was actually worse than when k=3. This observation points to the fact that sometimes Kmeans does not always cluster the data correctly which is why multiple experiments are often needed to verify the results.
^ 1.057 CD
(3 0.906 n
-g 0.755 N
E 0.604
< 0.302
HYG naive k vs. weighted affinity
Figure IV.3: HYG Dataset vs. k vs. Weighted Affinity for Naive Algorithm Figure IV.4 shows the trend between dataset size and execution time. The clus-

tering time has been normalized to the ’Basic’ dataset run and the plot is showing in log scale. In general, the trend shows an increase in execution time as the data size gets larger. The data size is a multiple of the number of observations and the dimensionality of the dataset.
Total clustering runtime per dataset
ro 100
â– 0
size of dataset (observations * dimensionality)
Figure IV.4: Total Clustering Time per Dataset
Figure IV.5 shows a breakdown of the clustering time with respect to k as well as the sampling size. The base run is the case where no refinement was used on the Kmeans run, and the rest of the lines represent clustering time for the different experimental runs with different sampling sizes of percentages of the original dataset. The plot shows that there is an improvement in clustering time when using a sample size of 10 and sampling 1% of the original dataset.
Figure IV.6 shows the sum of execution time compared to the sum of weighted affinity for each dataset on an example run with 8 clusters (k=8).

Weighted Affinity (log)
--- base_kmeans 5%_10_samples --- 10%_10_samples
--- l%_10_samples 5%_50_samples --- 10°/o_50_samples
--- l%_50_samples 5%_100_samples 10%_100_samples
--- l%_100_samples
3 4 5 6 7 8 9 10 11 12 13
k value
Figure IV.5: Clustering Time Compared to Base for Kmeans on Kddcup Algorithm
Total Weighted Affinity and exec time per dataset for k:8
Execution time
Figure IV.6: Total Weighted Affinity and Clustering Time over Dataset for k=8
Execution Time [s] (log)

Figure IV.7 shows how the affinity of each cluster is generally inversely proportional to k. This trend occurs because as the data is clustered into more separate clusters, typically the average within-cluster distance decreases as well. If all data points became clusters, there would be zero average within-cluster distance and the affinity would be zero - which does not help to find any hidden structures in the data.
Weighted affinity vs k (across all experiments), Diabetic
220 ----
Figure IV. 7: Affinity Trends Down as k Increases on Diabetic Dataset
Figure IV.8 shows how the clustering time is related to the algorithm and the number of clusters (k). This demonstrates how different algorithms perform across the spectrum of k values for the “Diabetic” dataset. As also previously shown, the affinity tends to trend down as the value of k increases. Because there were no refinement techniques used to plot this chart, the affinity scores did not see much deviation between the algorithms.
In this next chart, Figure IV.9 shows how refinement can actually reduce the affinity over some of the naive runs. The plot shows that the pelleg-moore algorithm,

Weighted Affinity Weighted Affinity
Weighted Affinity and Clustering Time vs K and Algorithm (Diabetic)
With No Refinement
â–  weighted_affinity â–  clustering_time
224 1
mn -j Tim nm
K value and Algorithm
Figure IV. 8: Naive Algorithm Takes Longer for Higher Values of k
Weighted Affinity and Clustering Time vs K and Algorithm (Diabetic) With Refinement (K=9)
IZZlweighted_affinity (^■clusteringtime ^^»sample_percent A sample_size
164.8 164.9 165.1 166.0 166.0 167 9 1689
171.9 172.0
0.01 0.01
0 o 0 0.01
u 9 9 9 9 9 9
pelleg- elkan dualtree naive pelleg- naive
moore moore
naive | dualtree
K value and Algorithm
Figure IV.9: Refinement Techniques Can Sometimes Yield Better Affinity
Clustering Time [s] Clustering Time [s]

with no refinement provided the lowest weighted affinity score, as well as a significantly lower execution time than the naive algorithm with no refinement (denoted on the plot as sample percent = 0 and sample size = 0). There is, however, a refinement technique that beats the naive in both weighted affinity and execution time. The dualtree algorithm with 100 samples and a 1% sampling rate scores just above the lowest affinity score. As would be expected, the elkan algorithm, using 100 samples at a 10% sampling rate, while it also beat the naive, no-refinement experimental run, the execution time is the highest on the chart. These results show that by characterizing the algorithm and experimental setup, the overall clustering time and outcome can be significantly improved by selecting the optimal refinement techniques, algorithms, and parameters.
Figure IV. 10 shows PCA analysis and how it could potentially improve upon the execution time if the primary components that contribute to a majority of the variance are filtered out and clustered. This plot shows that some of the components can probably be eliminated while still maintaining a statistically reasonable result after clustering. More work is needed to explore this in-depth as there is a potential for using PCA analysis before running the Kmeans algorithm, which could improve the process even more. As can be seen, roughly 3 of the variables account for 70% of the problem variance. With that in mind, refinement techniques may be able to use this information to better implement the experimental runs.

Cumulative PCAfor: Power, observations: 131000, dimensionality: 7
Figure IV. 10: PC A on the Power Dataset Showing Problem Variance Contribution

The outcome of running several different experiments on approximately a dozen datasets provided insight into how the Kmeans algorithm behaves on datasets of diverse sizes. With the limited computing power available, it took over 8 hours to run all the experiments on the 13 datasets. The extreme execution time is one reason why this research is important: not everyone has access to large computing resources, so running experiments like this on larger datasets can take several weeks, or even months.
The examination of the refinement Kmeans technique demonstrated good potential in overcoming execution time of the traditional Kmeans. Further exploration is warranted to help build on this research. Anything that can be done to help speed up the execution time while also improving the clustering process is valuable to the scientific and data research communities, not only in terms of energy savings, but also time savings and getting results faster to help solve the problem at hand. Whether it is mining data to predict elections, or evaluating cancer rates from patient studies, the world of machine learning and clustering will no doubt continue to be a significant part of the next several decades of technological advancement.

[1] Irani, Jasmine, Nitin Pise, and Madhura Phatak. ’’Clustering Techniques and the Similarity Measures used in Clustering: A Survey.” International Journal of Computer Applications 134, no. 7 (January 2016): 9-14. http: //
[2] Liu, Hsiang-Chuan, Wen-Pei Sung, and Wenli Yao. Information, Computer and Application Engineering. CRC Press.
[3] Zheliznyak, Iryna, Zoriana Rybchak, and Iryna Zavuschak. ’’Analysis of Clustering Algorithms.” Advances in Intelligent Systems and Computing 512 (September 27, 2016): 305-15.
[4] Chowdary, Sunil N., Sri D. Lakshmi Prasanna, and P. Sudhakar. ’’Evaluating and Analyzing Clusters in Data Mining using Different Algorithms.” International Journal of Computer Science and Mobile Computing 3, no. 2 (February 2014): 86-99.
[5] Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang. Principles and theory for data mining and machine learning. Dordrecht: Springer, 2011.
[6] Kelleher, John D., Brian Mac Namee, and Aoife DArcy. Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. Cambridge, MA: The MIT Press, 2015.
[7] Schouwenaars, Filip, and Sebastian Perez Saaibi. ’’Introduction
to Machine Learning.” Online Course. Accessed May 06, 2017.
[8] Scikit-learn: Machine Learning in Python, Pedregosa, F. et ah, Journal of Machine Learning Research volume 12, pages 2825-2830, 2011
[9] Yan Jun, Zhang Benyu, Liu Ning, Yan Shuicheng, Cheng Qiansheng, Fan Weiguo, Yang Qiang, Xi Wensi, and Chen Zheng,2006. Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing, IEEE transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320- 333.
[10] Matloff, Norman S. Statistical regression and classification: from linear models to machine learning. Boca Raton: CRC Press, 2017.
[11] Xu, Guandong, Zhenglu Yang, and Yu Zong. Applied data mining. Boca Raton: CRC Press, 2013.

[12] Hamerly, G.; Elkan, C. (2002). ’’Alternatives to the k-means algorithm that find better clusterings” (PDF). Proceedings of the eleventh international conference on Information and knowledge management (CIKM).
[13] Bradley, Paul and Fayyad, Usama. (1998). ’’Refining Initial Points for K-Means Clustering” (PDF). In Proceedings of the Fifteenth International Conference on Machine Learning (ICML ’98), Jude W. Shavlik (Ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 91-99.
[14] Curtin, Ryan R., James R. Cline, Neil P. Slagle, William B. March, P. Ram, Nishant A. Mehta, and Alexander G. Gray. ’’MLPACK: A Scalable C Machine Learning Library.” Journal of Machine Learning Research 14 (2013): 801-05.
[15] Lichman M. Lichman 2013UCI Machine Learning Repository University of California, Irvine, School of Information and Computer Sciences

Full Text


CHARACTERIZINGKMEANSCLUSTERINGMETHODSTOACCELERATE EXPERIMENTRUN-TIMES by YAFFEKRAVITZ A.A.S.,ClintonCommunityCollege,2010 B.S.,UniversityofColoradoDenver,2016 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ElectricalEngineering 2017


ThisthesisfortheMasterofSciencedegreeby YaeKravitz hasbeenapprovedforthe ElectricalEngineeringProgram by DanConnors,Advisor ChaoLiu StephenGedney Date:July29,2017 ii


Kravitz,YaeM.S.,ElectricalEngineeringProgram CHARACTERIZINGKMEANSCLUSTERINGMETHODSTOACCELERATE EXPERIMENTRUN-TIMES ThesisdirectedbyProfessorDanConnors ABSTRACT Machinelearningalgorithmshavethepotentialtounlocksolutionstoambitious problemsinamyriadofscienticandindustrialelds.Clusteringisanunsupervised machinelearningtechniqueinwhichdataobjectsaregrouped,basedonsimilarity, intoclusters.Objectswithinaclusterarealikeinsomemathematicalwayandobjectsindierentclustersaredistinctlydierent.Inthisway,clusteringuncoversthe hiddenstructureofadatasettodiscoverhowmanydataobjectsaresimilarandthe numberofclusters.Unfortunatelyclusteringtechniquesvaryintwoexperimental ways:clusteringexecutiontimeandtheaccuracyofthenalclusteringresults.Furthermore,manyclusteringalgorithmsrequiremultipleexperimentalevaluationsthat, inturn,generateprohibitiveclusteringexecutiontime[1]. Kmeansistheprimaryclusteringalgorithminthedataminingandunsupervised machinelearningspheresthatdenesasingleclusteringexperimentbydeningthe setofsimilarpointsthatareeachclosestto'K'clustercentroids.Thealgorithmnds asolutionofcentroidsinthedatasetbyiterativelycalculatingthemulti-dimensional 'D'distancesbetweeneachoftheexperimental'K'clustercentersandthe'N'data points.WhilethebaseKmeansalgorithmisparallelizableacrosstheseiterations, therearetwoexacerbatingproblematicconditionsofthealgorithm'suse:starting seedselectionandtheselectivitythenumberofK-centroidsforadataset.Eachof theseconditionsrequiresmultiple'E'experimentstoevaluatewithineachdataset. Theworkoutlinedinthisthesisinvestigatestheoptimizationandtherestructuringof theKmeansalgorithmtousemultiplesub-samplingsofatargetdatasettogenerate accurateK-centroidseedsthateliminateexcessiveiterationsofthebaseclustering algorithm. iii


Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:DanConnors iv


DEDICATION Idedicatethisworktomylove,Rebekah,andourfouramazingchildren,Anna, Sarah,Jonathan,andSamuelwhohavestoodbymeandsupportedmethroughout myeducationalcareer.Withouttheirlove,support,andnever-endingsupplyof patience,thisendeavorwouldhavebeennearlyimpossible. v


ACKNOWLEDGMENT ThisthesiswouldnothavebeenpossiblewithoutthegeneroussupportofProfessor DanConnors.ThevaluableexpertiseanddepthofknowledgeinthecomputerengineeringeldthatDr.Connorsprovidedwasinstrumentalinachievingthismilestone. Ialsooweagreatdebtofgratitudetothosewhoencouragedandmotivatedmeto startthisjourneyandseeitthroughtocompletion. vi


TABLEOFCONTENTS CHAPTER I.INTRODUCTION............................1 II.BACKGROUND.............................4 MachineLearning............................4 CategoriesofLearning.......................5 Supervised,Semi-supervised,andUnsupervisedLearning..5 Regression..............................6 Classication.............................6 Clustering..............................9 DimensionalityReduction......................10 KmeansClustering............................11 InitializationMethods........................17 InitializationSensitivity....................17 ClusterAnity...........................21 AvailableKmeansCodePackages.................22 VariationsofClusteringAlgorithms................22 KmeansAlgorithms.........................23 MachineLearningDatasets.....................23 III.APPROACH...............................25 RenedInitializationofCentroids....................25 RunningtheExperiments........................28 IV.EXPERIMENTALRESULTS......................31 Results..................................31 AlgorithmvsClusteringTime...................31 V.CONCLUSION..............................40 REFERENCES...................................41 vii


CHAPTERI INTRODUCTION Machinelearningalgorithmshavethepotentialtounlocksolutionstoambitious problemsinamyriadofscienticandindustrialelds.Traditionalmachinelearning conceptscanbedividedintofourcategories:regression,classication,clustering,and dimensionalityanalysis.Varietiesofalgorithmsineachofthesemachinelearning domainsrequiremultipleparameterstoexploretoachievetheoptimalmodeland analysis.Asclusteringisoneapproachtoextractinghiddenstructuresofdata,itis signicantlyusefulatthestartofexploringtheutilityofadataset.Forthatreason, itiscriticaltocutthroughthecomputationalbarriersofthedeploymentearlyinan investigationofadatasettounearthadvantageousinsightsrelatedtoexploringthe eciencyandexpediencyofmachinelearningconceptsofregressionandclassication. Clusteranalysisorclusteringisthetaskofgroupingasetofsimilarobjects insuchawaythatobjectsinthesamegroupcalledaclusteraremoresimilarin somesenseoranothertoeachotherthantothoseinothergroupsclusters[2]. Theclusteringtaskisdeployedinexploratorydatamining,andacommontechnique forstatisticaldataanalysis,usedinmanyelds,includingmachinelearning,pattern recognition,imageanalysis,informationretrieval,bioinformatics,datacompression, andcomputergraphics[3]. Whiletheoverallconceptofclusteringistoplacedataobjectsintogroupsbased onsimilarity,thedenitionofsimilaritymayvarydependingontheintendedgoalof exploringthedataset.Ingeneral,objectswithinaclusterarealikeinsomemathematicalwayandobjectsindierentclustersaredistinctlydierent.Commonviewsof clustersincludegroupswithsmalldistancesamongtheclustermembers,denseareas ofthedataspace,intervalsorparticularstatisticaldistributions.Similarly,clustering uncoversthehiddenstructureofadatasettodiscoverhowmanydataobjectsare similarandthenumberofclusters[4].Clusteranalysisisnotonespecicalgorithm 1


butthegeneraltasktobesolved.Inthisway,therearemanyvariousclusteringalgorithmsthatdiersignicantlyintheirnotionofwhatconstitutesgroupingsimilarity inaclusterandhowecientlytondthesimilaritywithinclusters. Kmeansisoneoftheprimaryclusteringalgorithmsusedinthedataminingand unsupervisedmachinelearningenvironments.Kmeansdenesasingleclusteringexperimentbydeningthesetofsimilarpointsthatareeachclosestto'K'cluster centroids.Thealgorithmndsasolutionofcentroidsinthedatasetbyiteratively calculatingthemulti-dimensional'D'distancesbetweeneachoftheexperimental'K' clustercentersandthe'N'datapoints.WhilethebaseKmeansalgorithmisparallelizableacrosstheseiterations,therearetwoexacerbatingproblemspaceconditions ofthealgorithm'suse:startingseedselectionandtheselectivityofthenumberof K-centroidsforadataset.Eachoftheseconditionsrequiremultiple'E'experiments toevaluatewithineachdataset.Thustoexplorethehiddenstructureofasingle datasetrequiressignicantcomputationalworkincalculating'E'experimentsof'K' centroidsforan'NxD'sizeddata[5]. ThebulkoftheKmeanscalculationsareperformedtondcentroidsthatminimizeamathematicalscorebetweenthedata-pointsandeachcentroid.Oneofthe inherentissueswiththeKmeansalgorithmarisesinthewaytheinitialstartingcentroidsareselectedandthatanumberofinitialiterationsareexhaustedtomigrate centroidsfromtheirinitialseedpositionstopositionswithinthemultidimensional spacethatmatchtheinherentcharacteristics.Thereareanumberofapproachesto initializingKmeanscentroids,mostcommonlyistouserandomlyselecteddatapoints andothergeneralizedheuristics. The renementKmeans approachovercomesthecomputationalbarrierofclusteringbysamplingalargedatasettouncovercandidatecentroidstoinitiatethesearch ofinitialseedpointsforthefulldataset.Thisthesisinvestigatestheoptimizationof therenementKmeansalgorithmbyevaluatingvarioussamplingsizesacrossawide 2


rangeoftargetdatasets.Moreover,avarietyofKmeansalgorithmsareevaluatedto additionallydeterminefurtherimpacttothecomputationalrequirementsofnding theoptimalclusteringinadataset.Thegoalistouncoverdatacharacteristicsthat mightguidedatascienceresearchersinsettingoptimalKmeansevaluationparametersthatovercometheprohibitivecomputationalcostsofperformingfulldataset exploration.Themethodsforcharacterizingtheresultsareclusteringexecutiontime andtheclusteringanityameasureofaccuracyofthenalclusteringresults. Thisthesisisorganizedasfollows:ChapterIIdiscussesthebackgroundofthe Kmeansclusteringalgorithm.ChapterIIIexaminestheKmeansrenementapproach andthesisexperimentalframeworkforevaluatingmultipleclusteringevaluations.The experimentalresultssection,ChapterIV,showsperformancedatafortheclustering ofselecteddatasets.Finally,ChapterVconcludesthisthesis. 3


CHAPTERII BACKGROUND MachineLearning Machinelearningisasubeldofcomputersciencethatreferstotheequipping ofcomputerswiththeabilitytolearnwithoutexplicitprogramming.Theterms machinelearning"andarticialintelligence"willoftenbeusedinterchangeably and,whiletheconceptsareconnected,therearedierences.Thegoalofarticial intelligenceistocreateamachinethatcanmimicthehumanmind.Todothis,itwill needamachinetohavelearningcapabilities.Articialintelligencealsoencompasses knowledgerepresentation,abstractthinking,andreasoning.Machinelearning'sonly focusistocreatesoftwarethatcanlearnfrompastexperience-hencetheterm machinelearning".Thus,whilearticialintelligenceistheembodimentofrecreating humancharacteristicsinamachine,thefocusofmachinelearningisonlyonthe developmentofcomputerprogramsthatcanchangeandlearnwhenexposedtonew data. Threebasicelementsofmachinelearningareseenthroughoutthetensofthousandsofmachinelearningalgorithmsinexistence,nottomentionthehundredsof newalgorithmsdevelopedeveryyear.Thesethreecomponentsareasfollows: Representation:displayofknowledge Evaluation:judgingofhypotheses Optimization:generationofmodels Representationsimplyreferstohowtheknowledgeisrepresented.Forexample, dependingonthecharacteristicsofthedataset,itmaybeunderstoodbestwitheither adecisiontreeorasetofrules.Instances,graphicalmodels,neuralnetworks,and modelensemblesareallexamplesofwaysknowledgecanberepresented. 4


Evaluationisthewayhypothesesarejudgedorevaluated".Accuracy,predictionandrecall,meansquarederror,cost,andlikelihoodareallcommonexamplesof evaluation. Optimizationisthewaymodelsaregenerated;thisisalsoknownasthesearch process.Examplesincludecombinatorialoptimization,convexoptimization,andconstrainedoptimization.Allmachinelearningalgorithmsareabasiccombinationof thesethreecomponents.Inotherwords,thesethreecategoriesrepresenttheframeworktounderstandanymachinelearningalgorithm[6]. CategoriesofLearning Machinelearningcanbesummedupintofourdierentcategories:classication, regression,clustering,anddimensionalityreductionseeguresII.1-II.8.These fourcategoriesencompasstheworldofsupervised,semi-,andunsupervisedlearning sometimes,reinforcementisalsoincludedbutcanbeconsideredasubcategoryto theunsupervisedlearningcategoryandnotreallyaseparatecategory. Regression:predictingavalue Classication:predictingatypeorcategory Clustering:ndingthehiddenstructureofdata DimensionalityReduction:gettingthesetofprincipalvariablesindata Supervised,Semi-supervised,andUnsupervisedLearning Classicationandregressionarequitesimilar-forbothsystems,thegoalisto generateafunctiontopredictvaluesforunseenobservations.Itisimportantthat duringthetrainingofthefunction,labelsareavailabletothealgorithm. Regressionfallsunderthesupervisedlearningfocus,whileclassicationisconsideredtobeasemi-supervisedtechnique.Toassesstheperformanceofatrainedmodel, 5


thelabelsofrealobservationscanbecomparedtothepredictedlabels.Labelingcan beatediousburdenandisoftendonebyhumans,sotechniquesthatdonotrequire labeleddataaresoughtoutandtheyareknownasunsupervisedlearning.Anexampleofunsupervisedlearningincludesclustering,wheregroupsofobservationsthat aresimilararelocated.Itdoesnotrequirespeciclabeledobservations.Assessing theperformanceofthistrainedmodelismoredicultthanamorestraightforward supervisedlearningalgorithmbecausetherearenoreallabelstouseforcomparison. Sometechniquesoverlapbetweenthesupervisedandtheunsupervisedlearning mediumsintoalandofgray.Thisso-calledsemi-supervisedlearning"couldinvolve alotofobservationsthatarenotlabeledandafewwhicharelabeled.Clusteringcan beusedtogroupsimilarobservationsandtheinformationabouttheclustersand thefewlabelscanbeusedtoassignaclasstotheunlabeledobservationsthenthis givesusmorelabeledinformationtoperformunsupervisedlearningon[7]. Regression Regressionproblemsaremachinelearningproblemsthattrytopredictaninput valuebasedonpreviousinformation.Thevariablesinputarecalledpredictorsand theoutputisknownastheresponseseeFigureII.2.Insomeways,regression andclassicationareverysimilarinthattheybothexemplifytryingtoestimate afunctionthatmapsinputtooutputbasedonearlierobservations.Inregression, however,thesearchistondanactualvalue,notjustaclassofanobservation.An exampleofwhatregressioncouldbeusefulforwouldbemodelingcreditscoresbased onaperson'srecordofpayments.Allregressionmodelshaveatleasttwothingsin common:rst,theresponseisalwaysquantitativeandsecond,theywillalwaysneed inputknowledgeofpreviousinput-to-outputobservationstoerectamodel. Classication Aclassicationprobleminvolvespredictingwhetheragivenobservationbelongs toacertaincategory.Classicationistheprocessoftakinganinputandmappingit 6


FigureII.1:Scikit-LearnAlgorithmCheatSheet-ChoosingEstimator[8] FigureII.2:ScikitLearnAlgorithmCheatSheet-Regression[8] toadiscretelabel.Forexample,thisisthecasewhenanimageisbeingassignedto thelabelofeitheracat"oradog".Withonlytwochoices,thiswouldbeknownas two-classorbinomialclassication.Oftentimes,therearemanymorecategoriesfor 7


instance,whentryingtopredictthewinnerofasportstournament,andtheproblem isknownasmulti-classclassication. FigureII.3:ScikitLearnAlgorithmCheatSheet-Classication[8] Classicationinvolvestrainingonearlierobservationstocreateaclassierthat willtrytopredicttheclassoffutureobservations.Inthebinomialclassication referencedabove,amodelwouldpredictthedierencebetweenadogandacatgiven aphotowithoneortheother.Thesystemmightbetrainedonmanydierent observationsofdogsaswellasmanydierentobservationsofcats,tocomeupwith classiersthatwouldbeabletotellifanew,previouslyunseen,photoofadogorcat iseitherclassiedasadogorclassiedasacatseeFigureII.4.Onceaclassieris constructed,itcanbeusedtotakearbitraryinputlikeaphotoofadogorcat,and labeltheimageasadogorcatseeFigureII.5. FigureII.4:Classication:BuildingaClassier Thepossibleapplicationsofthistechniquearebroadandpromising.Forexample, aftersignicantstudiesofmedicalexaminationsdeterminethatcertainvitalsignsand 8


FigureII.5:Classication:UsingaClassiertoPredict symptomsarerelatedtoaspecicdisease,thatinformationcanbeusedtobuilda predictorthatwillallowareasonablepredictiontobemadethatanewpatientwith unseensignsandsymptomsmighthavethisdiseaseandmayneedtreatment.Asimple examplewouldbetheyes-or-nopredictionsimplicityofisthistumorcancerous?"A basicquestionwithprofoundimplications.Itisimportanttouseaqualitativeoutput andtorememberthatthereisforeknowledgeoftheclassestowhichobservationscan belong. Clustering Clusteringisanattempttogroupsimilarobjectstogetherintoclusterswhere theclustersthemselvesaredissimilar.Goodclusteringmethodsproducehighquality clusterswithhighsimilaritywithinclustersandlowsimilaritycomparedtootherclusters[9].Thissimilaritycanbecalculatedbyusingthenotionofanity".Clustering isdierentfromclassicationandregressioninthatthereisnoneedforforeknowledgeofthelabelsandsometimesthereisnoclearrightorwronginclusteringsee FigureII.7.Clusteringcanbeusefulwhenthereisnopreviousknowledgeabout anypatternsinsidesamplesofdata.Dierentclusteringscanrevealnewinformationsometimesusefulinformationaboutapotentialhiddenstructurewithinadata set.Apracticalexampleofclusteringcouldbesegmentingamarketofpotential consumersbasedontheirdemographicfeaturesandpurchasinghistories.Another exampleofclusteringcouldbetondgroupsofmoviesbasedonreviewsofmovies orfeaturesineachmovietondmoviesmostlikeanothermovie[10]. 9


FigureII.6:ScikitLearnAlgorithmCheatSheet-Clustering[8] FigureII.7:ClusteringwithThreeClusters DimensionalityReduction Inmachinelearning,dimensionalityreductionistheprocessofreducingthenumberofrandomvariablesinconsiderationbyobtainingasetofprincipalvariables.The processisusefultospeedupalgorithmexecutionanditcanactuallybehelpfulinthe nalclusteringaccuracyorclassication.Ifthereisfaultyinputdataortoomuch 10


noise,alessthandesirablealgorithmperformancecanbeseen.Removingdatathat isactuallydis-informativecouldndmoregeneralclassicationregionsandrulesand betterperformancecanbeseenonnewdata[11]. Oneofthewaystoreduceinputdimensionalityinvolvingahighnumberofmissing orirrelevantvaluesiscalledPrincipalComponentAnalysisPCA.PCAisastatisticalprocedureforidentifyingtheprincipalcomponentsorthenumberofuncorrelated variablesfromalargesetofdata.Toexplainthemaximumamountofvariancewith thefewestnumberofprincipalcomponentsisthechiefconcernofPCA.UsuallyPCA isjustonestepinaseriesofanalysesandisusedtoreducethenumberofvariablestoavoidmulticollinearity.Thesocialsciencesandmarketingresearcharenas commonlyuseprincipalcomponentsanalysisfortheiroftenverylargedatasets.FigureII.8showstheScikitLearnoverviewguidelinesofusingPCAfordimensionality reduction. FigureII.8:ScikitLearnAlgorithmCheatSheet-DimensionalityReduction[8] KmeansClustering TheKmeansalgorithmisaclusteringalgorithmthatseekstominimizetheEuclideandistancebetweencentroidsandthemembersassociatedwiththosecentroids. Inreality,itattemptstondahiddenstructureinadataset,ifoneexists.knumber ofclustersisspeciedmanuallyandadatasetisprovidedwhilethealgorithmdoes therest.Itwillalwaysoutputasolution,butthesolutionmaynotbecorrect.If thegoalistoclusterbasedonsomesimilarity,theanswerKmeansprovidesmaynot alwaysmeetthatcriterion. 11


Kmeansisknownforitssimplicityanditsabilitytobeusedforamyriadof datatypes.TheKmeansalgorithmisaclusteringmethoddenedbyitspartitioning based,non-hierarchicalmethodology.Givenasetofobjects X ,Kmeanspartitions everydatapoint x i intoacluster c j where j k and k ,thespeciednumberof clusters,isaparameterthatmustbeprovidedthatminimizethecontainedgroup sumofsquarederrors.Partitioningisdoneinsuchawayastominimizetheobjective function. k X j =1 X x i 2 c j k x i )]TJ/F19 11.9552 Tf 11.955 0 Td [( j k 2 where j isthecentroidofcluster c j . Thekmeansalgorithmbeginsbyinitializing k clustercenters.Thedatapoints arethenassignedtooneoftheexistingclustersinaccordancewiththesquareofthe Euclideandistancefromthecenterofeachclustercentroid.Thecentroidofeach clusteristhenupdatedtobecomethemeancenterofallthedatapointsinthecluster asaresultofthechangeinmembership.Theprocessofassigningandre-assigning membershipandupdatingthecentroidsisrepeateduntilthereisnomorechangein thevalueofanyoftheclustercenters,alsoknownasconvergence. Assignmentstep:Everypoint x i in X isassignedtotheclosestcentroid'scluster. Updatestep:Centroidofeveryclusterisrecalculatedbasedonthemeanpoint ofallthedatapoints. Onewaytomakekmeansworkmoreexpedientlyistopickthebestvaluefork. Anotherwayistointelligentlyselectthestartingcentroids. SeeFiguresII.9-II.16foranexampleofkmeansinaction: 12


FigureII.9:ClusterInitialization FigureII.10:MembershipAssignment 13


FigureII.11:MovingCentroidstoMeanofDataPoints FigureII.12:MembershipReassignment 14


FigureII.13:MovingCentroidstoMeanofDataPoints FigureII.14:MembershipReassignment 15


FigureII.15:NoMovement-CentroidsatLocalMinimum FigureII.16:Convergence-NoChangeinMembership 16


InitializationMethods ThereareseveraldierentmethodsusedtoinitializeKmeans,ofthem,thecommononesareForgyandRandomPartition[12].RandomPartitiondoesjustthat, randomlyassignsallthedatapointstodierentclustersandthenndsthemeanof eachoftheclustersandusesthemeansastheinitialstartingcentroids.Forgytakes theinitialmeansandspreadsthemoutmore,whereastheRandomPartitionmethod tendstohavethemeansclosertothecenter.Anotherinitializationmethod,theone usedinthisthesis,byBradleyandFayyad[13]generallydoesmuchbetterforgetting theclusteringresults.Kmeans++isanotheralgorithmthatinitializesthecentroids dierently: 1.Chooseonecenteruniformlyatrandomfromamongthedatapoints. 2.Foreachdatapointx,computeDx,thedistancebetweenxandthenearest centerthathasalreadybeenchosen. 3.Chooseonenewdatapointatrandomasanewcenter,usingaweightedprobabilitydistributionwhereapointxischosenwithprobabilityproportionalto Dx. 4.RepeatSteps2and3untilkcentershavebeenchosen. 5.Nowthattheinitialcentershavebeenchosen,proceedusingstandardKmeans clustering. InitializationSensitivity Kmeansisanextremelyvaluabletoolinthemachinelearningtoolbox,butit doesn'tcomewithoutaws.Incomputing,thereisnosuchthingasaperfectalgorithmthatappliesuniversallyacrosstheboardtoeverypossibledataset.Kmeans 17


isnoexception.OneoftheissueswithKmeansisthatitishighlysensitivetoinitialization.Whatthatmeansisontherststepinthealgorithm,centroidsmustbe assignedbeforetheiterativeprocessbegins.Oneoftheiterativestepsistomeasure thedistancefromdatapointstocentroids,sothoseinitialcentroidsmustbeplaced somewheretostart.Ithasbeenanareaofmuchresearchtogureoutagoodway toinitializethecentroids,andtherearesomegreatsolutionsoutthere,butthisthesisproposesexpandingonsomeoftheideastoimprovetheKmeansalgorithmeven more. Therehavealsobeenseveralattemptstoimprovetheeciencyandtimeittakes toruntheKmeansalgorithm.Becauseofitsiterativenature,largerdimensionality datasetscantakeasignicantamountoftimetoproducearesult.Improvingthe eciencyandspeedthatKmeanstakestoperformclusteringaloneisnotmuchhelp ifitproducesanincorrectanswer,anincorrectanswerproducedinashorteramount oftimeisnotusuallyhelpful. Improvingtheinitializationcanactuallyspeedupthealgorithmbecauseifthe thestartingcentroidscanbeplacedintelligentlyinsuchawaythattheyarecloserto wherethenaliterationwouldplacethecentroids,thenumberofiterationsneededto convergeisusuallyreduced,whichtypicallyequatestoareductionintotalclustering time. Insomecases,Kmeansisnotcapableofeverndingthecorrectclusteringresult becauseofthenatureofthedataset.Inthefollowingsampledataset,Kmeansdid notclustertheresultcorrectlyseeFigureII.7andneverwouldcorrectlyclusterit becausethehiddenstructureisnotonethatamean-basedclusteringalgorithmwould nd.Forthisexample,adensitybasedalgorithm,suchasDBSCAN,wouldbemore suitabletocarryoutthistaskseeFigureII.18. WhileKmeansisprobablythemostintuitivelystraight-forwardclusteringmethod whereanalgorithmgroupsdatainkclustersbasedonsomemeasureofsimilarity, 18


FigureII.17:Kmeansk=4Clustering-CannotFindHiddenStructureinDataset FigureII.18:DBSCANClustering-CorrectlyFindsHiddenStructureinDataset itisnotwithoutitssetbacks.Thisalgorithm'sspeedandsimplicityareunparalleled, however,oneoftheissueswithKmeansisthatifitisnotseededwithgoodinitial centroidpoints,itcanendupwithanincorrectclustering,evenfordatathatshould beeasilyclustered. Forexample,FigureII.19showsaninitialseedingofcentroidswheretheinitial pointswereselectedrandomly.Intheresultantclustering,gureII.20showshowa poorinitializationselectionforthecentroidscanresultinabadclusteringresult19


thiswouldshowupasapooranityscorecomparedtothecorrectclustering. FigureII.19:Clusteringk=3-PoorInitialSeedingofCentroidswithRandom Initialization FigureII.20:Clusteringk=3-PoorInitialSeedingofCentroidsLeadstoPoor ClusteringResults Therenementalgorithmhasthefollowingmeritsofoperation: performsKmeansonasubsetofsampleddatasothatthesystemcompletes muchfaster. theanswerpotentiallyisbetterasitisderivedbymultiplendings,compared tothatofthenaiveKmeansthatstartsfromasinglesetofcentroids. 20


ClusterAnity Aconceptforevaluatingtheassignmentofdataobjectstoclustersisamathematicallyderivedscoreknowngenerallyasanity.Inshort,anityisacalculated measureofhowwelldataobjectsareassignedtoclusters.FigureII.21detailsan exampleofanity.Iftherewereonly2datapoints,with2clustersk=2,anity wouldbezerobecausethecentroidswouldbethedatapointsandthedatapoints wouldbe0distancefromeachoftheirrespectivecentroids. FigureII.21:Anityfork=2withTwoDataPoints Iftherewere3datapointsand2clustersk=2,thentheresultantclustering mightlooklikeFigureII.22,wheretheanityforoneofthedatapointsiszero,and theanityoftheothercentroidistheaveragesumofsquareddistancesofthedata pointstothecluster'smeancenterpointcentroid. FigureII.22:Anityfork=2withThreeDataPoints 21


AvailableKmeansCodePackages SeveralprominentcodepackagesincludetheKmeansclusteringalgorithm: Scikitlearn- mlpack- OpenCV- node-Kmeans- R- ELKIEnvironmentforDeveLopingKDD-ApplicationsSupportedbyIndexStructures Forthisthesis,themlpackKmeansimplementationisselectedbecauseitprovides thenecessarysetuptocarryoutmultipleexperimentsandprovidestheneededoutput informationwithminimalrecoding.APythonframeworkthatallowsforcarryingout severalexperimentswithinitializingthecentroidswaswrittenforthisthesis. VariationsofClusteringAlgorithms Kmeans:Lloyd'salgorithmthestandardKmeansalgorithm Jenksnaturalbreaksoptimization:Kmeansappliedtounivariatedata MinkowskiWeightedKmeans:computesweightsforfeaturesateachcluster Kmeans++:improvedinitialcentroidseeding Mini-batchKmeans:runsKmeanson"minibatch"samplesoflargerdatasets thatexceedmemoryspace SphericalKmeans:fortextualdatasets FuzzyC-MeansClustering:datapointsassignedtomultipleclusters 22


KmeansAlgorithms SeveralKmeansalgorithmswillbeevaluatedinthethesisfortheirinherentdifferenceswhenlocatingcentroidsunderdierentcircumstances:naive,pelleg-moore, elkan,hamerly,anddualtree.ListoftheKmeansalgorithmsincludedinthemlpack[14]Kmeansprogramthatwereused: naive:thestandardLloyditeration;takes O kN timeperiteration. pelleg-moore:the'blacklist'algorithm,whichbuildsakd-treeonthedata.This canbefastwhenkissmallandthedimensionalityisreasonablylow. elkan:Elkan'salgorithmfork-means,whichmaintainsupperandlowerdistance boundsbetweeneachpointandeachcentroid.Thiscanbeveryfast,butitdoes notscalewelltothecaseoflargeNork,andusesalotofmemory. hamerly:Hamerly'salgorithmisavariantofElkan'salgorithmthathandles memoryusagemuchbetterandthuscanoperatewithmuchlargerdatasets thanElkan'salgorithm. dualtree:thedual-treealgorithmfork-meansbuildsakd-treeonboththe centroidsandthepointsinordertopruneawayasmuchworkaspossible.This algorithmismosteectivewhenbothNandkarelarge. MachineLearningDatasets Theapproachtotestingnewclusteringapproacheswillevaluateseveraldatasets tocapturedierentdatacharacteristicsandtrends. ThefollowingdatasetsseeTableII.1withmultipleheightandwidthdimensionalityvariationswererunusingseveraldierentsamplesizesalongwithdierent percentagesoftheoriginaldataset.ThedatasetsareincludedfromtheUCIrvine MachineLearningarchivewebsite[15]. 23


TableII.1:MachineLearningDatasets DataFileNumberofObservationsDimensions Basic13100035 Chem47832155 Diabetic1382041 GroupLens1000003 HAR26523127 HYG11961814 Iris1494 KDDCUP13100074 Poker13100011 Power1310006 Power210006 RelationNetwork5341322 Test10003 24


CHAPTERIII APPROACH RenedInitializationofCentroids ThecommonapproachtousingKmeansclusteringistogenerateaminimumof 10experimentsforafulltargetdataset.Themachinelearningstandardofscikitlearndeploys10experimentsandcarriesouttheanityscoringtoselectthebest nalclustering.Running10experimentsonlargedatasetsisbothtimeconsuming andcomputeintensive.Manyenvironmentshavelimitedcomputationalpowerfor anexcessivenumberoftrials.Togetaroundneedingtorunmultipleexperiments ontheentiredataset,thegoalistorunmultipleexperimentsonsmallersamplesof thedataset.RunningKmeanson1%,oreven5%,oftheoriginaldatasetsavesa signicantamountoftime.Thegoalistogetthesameclusteringperformancein muchlesstimeandwithmuchlesscomputingpowerusage. ThemethodproposedbyBradleyandFayyadintheirpaperReninginitial pointsforKmeansclustering"[13]isimplementedinmlpack.Thisstrategysamples pointsfromthedatasetandrunsKmeansclusteringonthosepointsmultipletimes, savingtheresultingclusters.Then,Kmeansclusteringisrunonthoseclusters,yieldingtheoriginalnumberofclusters.Thecentroidsofthoseresultingclustersareused asinitialcentroidsforKmeansclusteringontheentiredataset.Theapproachisas follows:startingwithanoriginaldatasetwith1milliondatapointsshowninFigureIII.1.,multiplesamplessamplesof1%oftheoriginaldatasetaregathered andcanbevisualizedinFigureIII.2.Kmeansisthenrunonallsamplesasdepicted inFigureIII.3.Thenalstepoftherenementprocessistotakethe8clusteringsand runKmeansagainonthem.Theresultprovidesthemeanpointsoftheclustersto useasthestartingcentroidsforthenalKmeansrunontheoriginaldataset.Figure III.4presentsthenalKmeansrunontheoriginaldataset. 25


FigureIII.1:SimulatedOriginalData FigureIII.2:EightRandomSamples-EachContaining1%ofSimulatedOriginal Dataset,000DatapointsperSample FigureIII.3:KmeansClusteringonSub-Samplesfork=4 FigureIII.5showstheclusteringwhentheinitialcentroidsareseededfromthe combinedcentroidsandKmeansisrunontheoriginaldata.FigureIII.6illustrates therunofKmeansclusteringontheoriginaldatasetwiththeintelligentlyselected 26


FigureIII.4:CombinedCentroidsfromSubsamples-GettingtheMeanofEach FigureIII.5:SeedingKmeanswithIntelligentlySelectedStartingCentroids FigureIII.6:KmeansonSimulatedOriginalDatasetwithRenement startingcentroids.FigureIII.7comparestheclusteringtorandomlyselectedstarting centroids.Itshowshowinitializingthecentroidstoagood"locationcanhavea signicantimpactontheamountofcomputingrequiredtoreachthenalclustering. FigureIII.7:KmeansonSimulatedOriginalDatasetwithoutRenement 27


RunningtheExperiments AsshowninFigureIII.8,theprocesstobeginrunningtheexperimentsrequires theimportofrelevantpythonmodules,deningglobalvariables,andprovidingthe inputandoutputpathstructure.Thesecondstepinvolvessettinguptheexperimentalparameters.Thiscanbedoneeitherbyeditingthepythonscriptdirectly, ormodifyingacongurationlethatholdsallthevaluesthatshouldberunbythe script.Thescriptbeginsbyscanningtheinputdirectoryfordatales.Oncethe datalesarefound,thepythonmoduleloadsthepaths,lenames,andthesizeand dimensionalityinformationintoadictionary.Theoutputpathsarealsosetupbased ontheexperimentalparameters.Thepathdictionaryisindexedinawaythatmakes iteasytodetermineeachexperimentalleoutput. Asmlpack-KmeansproducesmultipledatalesforeachKmeansrun,itisnecessarytoestablishanintuitivelenamingconventionfortheoutputles.Eachrunof mlpack-Kmeansproduces3outputles:acentroidslethatcontainsthecoordinates ofthecentroidsfound,anassociationslethatincludestheoriginaldatasetalong withanappendedcolumntoshowclustermembershipofeachrow,andalogle thatcontainstheinformationabouttherun,includingclusteringtime,numberof iterationstoconvergence,andanyerrorsorothermessagesassociatedwiththerun. Asanexample,anexperimentalrunonasingledatasetwith4dierentalgorithms, 4valuesofk,4samplesizes,and4samplingpercentageswillproduce256callsto mlpack-Kmeansand768outputles.Theconventionthatisusedputstheparameter valuesrightinthelename.Forexample,HAR 4 100 0.10 elkanisthelenamebase fortheexperimentthatrunsontheHARdatasetwithavalueofk=4,100samples of10%sampling,andtheelkanclusteringalgorithm. FigureIII.9showstheremainderoftheprocess.Oncethepathsaredened,the pythonscriptsetsupthecommandtosendtotheoperatingsystemtorunthemlpackKmeansalgorithm.Thereareseveralparametersthatarepassedtotheprogramand 28


FigureIII.8:CodePathtoGenerateKmeansResults thepythonscriptformatsthosebasedontheexperimentalsetupdescribedpreviously. Itwillparsethroughalltheexperimentalparametersandrunthealgorithmonthe givendatasets.Whilethisprocessisrunning,aftergeneratinganassignmentle, thereisapythonfunctiontoreducethesizeoftheassignmentslebyremovingthe originaldatasetfromtheleandleavingtheremainingclustermembershipinformation.Oncetheoriginaldatasetisremoved,theleisrenamedto'membership'. Withoutthisprocess,theoriginaldatasetwouldbereproducedhundredsoftimes andthelimitedharddrivespacewouldbeconsumed. AfterrunningtheKmeansalgorithm,thereisaseparatefunctionthatcalculates theanityofeachrun.Thisprocessparsesthroughalltheexperimentalrunsand calculatestheanitybasedonthemembershiple,thecentroidsle,andthemembershiple.Itusesthecentroidsletoestablishthecentroidcoordinates,andthe membershipletoestablishwhichpointsrowsintheoriginaldatasetlecorre29


spondwithwhichcluster.Theanityissummedandaweightedscoreisrecorded inanotherlelabeled'anity'-eectivelyaddingafourthletotheoutputrun. Oncealltheresultsareproduceditcantakeseveralhourstorunalltheexperiments,anotherfunctiongatherstheresultsintoadatadictionaryinpython. Clusteringtimeisextractedfromthethelogleofeachrunandtheanityispulled fromtheanitylefromeachrun.Thedatadictionaryprovidesconvenientaccess tothedatawithouthavingtopullitfromtheleseachtimeaplotisgenerated. Thedataisthenplottedtodisplayrelevanttrendsandallowconclusionstobemade aboutthedierentaspectsoftheexperimentalrun. FigureIII.9:CodePathtoCalculateandAssessAnity 30


CHAPTERIV EXPERIMENTALRESULTS Thequalityoftheclusteringisimportantasitprovidesinsightintohowwell theKmeansalgorithmperformed.Anityisthetermforhowclosethedatapoints aretotheclustertheyaremembersofafteraKmeansalgorithmhascompleteda run.Executiontimealoneisnotascriticalasgettingthecorrectresult.Getting anincorrectresultfasterisnotasolutionthatprovidesanyhelptothetask.The nalgoalofthisthesisistotryandestablishamatrixorscorecardthatshows whatalgorithm,samplesize,andpercentvaluepercentoforiginaldatasetforthe samplesshouldbeusedwhenrunningtheKmeansalgorithmonagiventypeorsize ofdataset.Ifexecutiontimeismoreimportantthanaccuracy,thenonealgorithm mightdothetrick,and10%samplingwith100samplesmightnotbethewaytogo. Ifthequalityoftheclusteringresultanityismoreimportant,thentheremaybe analgorithmandsamplingthatworksbetteronaparticulardataset. Iftheexperimentisrunwith10%samplingand100samples,theexecutionmight bevetimesworseinexecutiontimethanthebase,unrenedrun.Thequestmight beforbetteranityandtheexecutiontimemightnotmatterasmuch.Inthiscase though,iftheresultsarethesameastheunrenedrun,theonewithbetterexecution timeistypicallythebetterchoice. Results AlgorithmvsClusteringTime FigureIV.1showsthevariousKmeansclusteringalgorithmsforthedatasetof KDDCUPusingk=5.Theresultsarestandardizedtousenaiveexecutiontimeasthe basescaledto1.Inthatway,anyKmeansalgorithmthatrunsfasterthannaivewill havealowerbarplotandvaluebelow1.Theguredemonstratesthatthedierent Kmeansalgorithmshavethecapacitytoachievesignicantlyfasterexecutionthan naive.Thehamerlyalgorithmisapproximately5timesfasterthanbaseKmeans, 31


althoughboththeelkanandpelleg-mooretechniquesareclosecontendersrespectively performing3xand2xfasterthannaive. FigureIV.1:KDDCUPDataset:Algorithmsvs.ClusteringTimefork=5 FigureIV.2showsthetrendbetweenthedataset,clusteringtimeandk.The trendoftheresultsshowthataskincreases,ittakeslongertoclusterthedataset. KDDCUP,RelationNetwork,andHARarelargerdatasetsintermsofbothobservationsanddimensionality,soitisexpectedthattheywouldtakelongertocluster. Theresultingfunctionofthisgurehelpstocharacterizetheinuenceofrunning dierentalgorithmsfordierentdatasets.Theanalysisoftheexperimentalresults canbeusedasavaluablereferenceforhowtoruntheKmeansalgorithmsondierent shapeddatasets.Onealgorithmmightgoreallyfast,butattheexpenseofagood clusteringoutcome{anotheralgorithmmightbeslow,butprovideamuchbetter clusteringsolution.ThetrendofincreasingtheKnumberofclustersincreasesthe executiontimeofclusteringwhichhasahigherimpactforsomedatasets. FigureIV.3showsthetrendbetweentheHYGdataset,weightedanityorperpointanityandthevalueofk.Theresultsshowthatthereisadecreasingtrend 32


FigureIV.2:ClusteringTimeforDatasetsvs.kforNaiveAlgorithm foranityaskincreases,butfork=6,theanitywasactuallyworsethanwhenk=3. ThisobservationpointstothefactthatsometimesKmeansdoesnotalwayscluster thedatacorrectlywhichiswhymultipleexperimentsareoftenneededtoverifythe results. FigureIV.3:HYGDatasetvs.kvs.WeightedAnityforNaiveAlgorithm FigureIV.4showsthetrendbetweendatasetsizeandexecutiontime.Theclus33


teringtimehasbeennormalizedtothe'Basic'datasetrunandtheplotisshowing inlogscale.Ingeneral,thetrendshowsanincreaseinexecutiontimeasthedata sizegetslarger.Thedatasizeisamultipleofthenumberofobservationsandthe dimensionalityofthedataset. FigureIV.4:TotalClusteringTimeperDataset FigureIV.5showsabreakdownoftheclusteringtimewithrespecttokaswell asthesamplingsize.Thebaserunisthecasewherenorenementwasusedon theKmeansrun,andtherestofthelinesrepresentclusteringtimeforthedierent experimentalrunswithdierentsamplingsizesofpercentagesoftheoriginaldataset. Theplotshowsthatthereisanimprovementinclusteringtimewhenusingasample sizeof10andsampling1%oftheoriginaldataset. FigureIV.6showsthesumofexecutiontimecomparedtothesumofweighted anityforeachdatasetonanexamplerunwith8clustersk=8. 34


FigureIV.5:ClusteringTimeComparedtoBaseforKmeansonKddcupAlgorithm FigureIV.6:TotalWeightedAnityandClusteringTimeoverDatasetfork=8 35


FigureIV.7showshowtheanityofeachclusterisgenerallyinverselyproportionaltok.Thistrendoccursbecauseasthedataisclusteredintomoreseparate clusters,typicallytheaveragewithin-clusterdistancedecreasesaswell.Ifalldata pointsbecameclusters,therewouldbezeroaveragewithin-clusterdistanceandthe anitywouldbezero-whichdoesnothelptondanyhiddenstructuresinthedata. FigureIV.7:AnityTrendsDownaskIncreasesonDiabeticDataset FigureIV.8showshowtheclusteringtimeisrelatedtothealgorithmandthe numberofclustersk.Thisdemonstrateshowdierentalgorithmsperformacross thespectrumofkvaluesfortheDiabetic"dataset.Asalsopreviouslyshown,the anitytendstotrenddownasthevalueofkincreases.Becausetherewereno renementtechniquesusedtoplotthischart,theanityscoresdidnotseemuch deviationbetweenthealgorithms. Inthisnextchart,FigureIV.9showshowrenementcanactuallyreducethe anityoversomeofthenaiveruns.Theplotshowsthatthepelleg-moorealgorithm, 36


FigureIV.8:NaiveAlgorithmTakesLongerforHigherValuesofk FigureIV.9:RenementTechniquesCanSometimesYieldBetterAnity 37


withnorenementprovidedthelowestweightedanityscore,aswellasasignicantly lowerexecutiontimethanthenaivealgorithmwithnorenementdenotedonthe plotassamplepercent=0andsamplesize=0.Thereis,however,arenement techniquethatbeatsthenaiveinbothweightedanityandexecutiontime.The dualtreealgorithmwith100samplesanda1%samplingratescoresjustabovethe lowestanityscore.Aswouldbeexpected,theelkanalgorithm,using100samplesat a10%samplingrate,whileitalsobeatthenaive,no-renementexperimentalrun,the executiontimeisthehighestonthechart.Theseresultsshowthatbycharacterizing thealgorithmandexperimentalsetup,theoverallclusteringtimeandoutcomecan besignicantlyimprovedbyselectingtheoptimalrenementtechniques,algorithms, andparameters. FigureIV.10showsPCAanalysisandhowitcouldpotentiallyimproveupon theexecutiontimeiftheprimarycomponentsthatcontributetoamajorityofthe variancearelteredoutandclustered.Thisplotshowsthatsomeofthecomponents canprobablybeeliminatedwhilestillmaintainingastatisticallyreasonableresult afterclustering.Moreworkisneededtoexplorethisin-depthasthereisapotential forusingPCAanalysisbeforerunningtheKmeansalgorithm,whichcouldimprove theprocessevenmore.Ascanbeseen,roughly3ofthevariablesaccountfor70%of theproblemvariance.Withthatinmind,renementtechniquesmaybeabletouse thisinformationtobetterimplementtheexperimentalruns. 38


FigureIV.10:PCAonthePowerDatasetShowingProblemVarianceContribution 39


CHAPTERV CONCLUSION Theoutcomeofrunningseveraldierentexperimentsonapproximatelyadozen datasetsprovidedinsightintohowtheKmeansalgorithmbehavesondatasetsof diversesizes.Withthelimitedcomputingpoweravailable,ittookover8hoursto runalltheexperimentsonthe13datasets.Theextremeexecutiontimeisonereason whythisresearchisimportant:noteveryonehasaccesstolargecomputingresources, sorunningexperimentslikethisonlargerdatasetscantakeseveralweeks,oreven months. TheexaminationoftherenementKmeanstechniquedemonstratedgoodpotentialinovercomingexecutiontimeofthetraditionalKmeans.Furtherexplorationis warrantedtohelpbuildonthisresearch.Anythingthatcanbedonetohelpspeed uptheexecutiontimewhilealsoimprovingtheclusteringprocessisvaluabletothe scienticanddataresearchcommunities,notonlyintermsofenergysavings,butalso timesavingsandgettingresultsfastertohelpsolvetheproblemathand.Whether itisminingdatatopredictelections,orevaluatingcancerratesfrompatientstudies, theworldofmachinelearningandclusteringwillnodoubtcontinuetobeasignicant partofthenextseveraldecadesoftechnologicaladvancement. 40


REFERENCES [1]Irani,Jasmine,NitinPise,andMadhuraPhatak."ClusteringTechniques andtheSimilarityMeasuresusedinClustering:ASurvey."InternationalJournalofComputerApplications134,no.7January2016: 9-14. [2]Liu,Hsiang-Chuan,Wen-PeiSung,andWenliYao.Information,Computerand ApplicationEngineering.CRCPress. [3]Zheliznyak,Iryna,ZorianaRybchak,andIrynaZavuschak."AnalysisofClusteringAlgorithms."AdvancesinIntelligentSystemsandComputing512September27,2016:305-15. [4]Chowdary,SunilN.,SriD.LakshmiPrasanna,andP.Sudhakar."Evaluatingand AnalyzingClustersinDataMiningusingDierentAlgorithms."International JournalofComputerScienceandMobileComputing3,no.2February2014: 86-99. [5]Clarke,Bertrand,ErnestFokoue,andHaoHelenZhang.Principlesandtheory fordataminingandmachinelearning.Dordrecht:Springer,2011. [6]Kelleher,JohnD.,BrianMacNamee,andAoifeDArcy.Fundamentalsofmachine learningforpredictivedataanalytics:algorithms,workedexamples,andcase studies.Cambridge,MA:TheMITPress,2015. [7]Schouwenaars,Filip,andSebastianPerezSaaibi."Introduction toMachineLearning."OnlineCourse.AccessedMay06,2017. [8]Scikit-learn:MachineLearninginPython,Pedregosa,F.etal.,JournalofMachineLearningResearchvolume12,pages2825{2830,2011 [9]YanJun,ZhangBenyu,LiuNing,YanShuicheng,ChengQiansheng,Fan Weiguo,YangQiang,XiWensi,andChenZheng,2006.Eectiveandecient dimensionalityreductionforlarge-scaleandstreamingdatapreprocessing,IEEE transactionsonKnowledgeandDataEngineering,Vol.18,No.3,pp.320-333. [10]Matlo,NormanS.Statisticalregressionandclassication:fromlinearmodels tomachinelearning.BocaRaton:CRCPress,2017. [11]Xu,Guandong,ZhengluYang,andYuZong.Applieddatamining.BocaRaton: CRCPress,2013. 41


[12]Hamerly,G.;Elkan,C.."Alternativestothek-meansalgorithmthatnd betterclusterings"PDF.Proceedingsoftheeleventhinternationalconference onInformationandknowledgemanagementCIKM. [13]Bradley,PaulandFayyad,Usama.."ReningInitialPointsforK-Means Clustering"PDF.InProceedingsoftheFifteenthInternationalConference onMachineLearningICML'98,JudeW.ShavlikEd..MorganKaufmann PublishersInc.,SanFrancisco,CA,USA,91-99. [14]Curtin,RyanR.,JamesR.Cline,NeilP.Slagle,WilliamB.March,P.Ram, NishantA.Mehta,andAlexanderG.Gray."MLPACK:AScalableCMachine LearningLibrary."JournalofMachineLearningResearch14:801-05. [15]LichmanM.Lichman2013UCIMachineLearningRepository,Irvine,SchoolofInformationandComputerSciences 42