Doing so poses challenges for k-means clustering and all variants mis-classified some points

Advanced users can modify the following Centaurus parameters: maximum number of clusters to fit to the data ; number of experiments per K to run; number of times to initialize the k-means clustering ; type of covariance matrix to use for the analysis ; whether to scale the data so that each dimension has zero mean and unit standard deviation . Centaurus considers each parameterization that the user chooses as a “job”. Each job consists of multiple tasks that Centaurus deploys. Users can check the status of a job or view the report for a job . The status page provides an overview of all the tasks with a progress bar for the percentage of tasks completed and a table showing task parameters and outcomes available for download and visualization. Centaurus has a report page to provide its recommendation. The recommendation consists of the number of clusters and k-means variant that produced the best score. This page also shows the cluster assignments and spatial plots using longitude and latitude . For additional analysis, users can select “advanced report” to see the correlation among features in the dataset, scores for each k-means variant, best clusterings for each one of the variants, etc. We implement Centaurus using Python and integrate a number of opensource software, packages, and cloud services. These services include the Python Flask Flask web framework, RabbitMQ RabbitMQ and Python Celery Celery for messaging and queuing support, and an PostgreSQL SQL database PostgreSQL and MongoDB Community Edition NoSQL database MongoDB , which we use to store parameters and results for jobs and tasks. Other packages include Numpy Walt et al. , Pandas McKinney et al. , SciKit-Learn Pedregosa et al. , and SciPy Jones et al. for data processing and Matplotlib Hunter and Seaborn Seaborn for data visualization. Centaurus can execute on any virtualized cluster or cloud system and autoscales deployments by starting and stopping virtual servers as required by the computation. In our evaluation in this chapter, we deploy Centaurus on a private cloud that runs Eucalyptus v4.4 Nurmi et al. , Aristotle ,growing strawberries vertical system which has multiple virtual servers with different CPU, memory, and storage capabilities. We build upon, generalize, and extend this sys- tem in Chapter 5 of this dissertation.

Centaurus stores the cluster assignments for each experiment, which is the result with the largest log-likelihood value across initial assignments. This Centaurus instance only considers clustering results when all clusters have at least 30 points, in its computation of BIC and AIC. Finally, as described above, Centaurus reports the result with the highest average BIC score the “best” clustering across every K considered for all variants. Note that Dataset-1 was generated using a GMM where all dimensions are independent of each other and are identically distributed. Thus the “perfect” classification results generated by the Full and Diagonal methods indicate that they correctly disregard any observed sample variance or covariance. The results for Full-Untied with Dataset-2 and Dataset-3 illustrate Centaurus ’s ability to correct for cross-dimensional correlation. The generating GMM in both cases is untied . Also, unlike in Dataset-1 where there are three distinct clusters with separated centers, we purposefully placed the cluster centers of Dataset-2 and Dataset-3 near each other and generated distributions that overlap.To visualize the effect of different k-means variants on BIC score, we perform 2048 single k-means runs for each variant for synthetic datasets described in 3.4Figure 3.2 shows histograms of the BIC scores for each the of three synthetic datasets. We divide the scores among 100 bins. For each dataset, we present six histograms, one for each of the k-means variants, represented in different colors, where each variant has a total of 2048 single k-means runs. The X-axis depicts BIC scores from experiments – farther right corresponds to larger BIC and thus higher quality clusterings. This corresponds to a clustering with four clusters having cardinality of 2188, 531, 308, and 205, respectively. The second-best clustering has BIC score of -8925.4 and three clusters with cardinality 1733, 973, and 526, respectively, as shown in . Figure 3.4c shows the difference between these two clusterings. A specific data point is shown if it has a different cluster number assignment when we rank clusters by cardinality. For this data, clearly these clusterings differ. Thus, doubling the number of experiments from 1024 to 2048 allows Centaurus to find a clustering with a better BIC score.

The Sedgwick dataset has a more stable outcome in terms of the best BIC score when increasing the number of experiments. Figure-3.3b shows that even with 256 experiments , we achieve the same maximum BIC score as with 2048 experiments. The best result has a BIC score of -7468.0 and three clusters with 1111, 996, and 568 elements . This result is consistent over many repeated jobs with a sufficiently large number of experiments i.e. any job with more than 256 experiments produced this same clustering as the one corresponding to the largest score. The second-best clustering agrees with the best result on the number of clusters with cluster cardinalities of 963, 879, and 833, and a BIC score of -7529.8 . While these clusters do differ, Figure 3.5c shows thatthe differences are scattered spatially. Thus the best and second-best clusterings may not differ in terms of actionable insight. For the UNL field, the best and second-best clusterings are shown in Figure 3.6. These are both from job-2048. The best clustering has six clusters with cardinalities 2424, 1493, 1138, 561, 111, and 70, respectively. The second-best clustering has four clusters with cardinalities 2730, 1615, 838, and 614, respectively. From these features and the differences shown in Figure 3.6c it is clear the best and second-best clustering are dissimilar. Further, the second-best clustering from job-2048 is the best clustering in job-64, job-512, and job-1024 respectively. As with the Cal Poly data , doubling the number of experiments from 1024 to 2048 “exposed” a better and significantly different clustering. Unlike the results for the synthetic datasets, the best clustering for the Veris EC datasets is produced by the Full Untied variant for sufficiently large job sizes. This result is somewhat surprising since the Full Untied variant incurs the largest score penalty in the BIC score computation among all of the variants. The score is penalized for the mean, variance, and covariance estimates from each cluster. The other variants require fewer parameter estimates . Related work has also argued for using fewer estimated parameters to produce the best clustering Fridgen et al. leading to an expectation that a simpler variant would produce the best clustering, but is not the case for these datasets.

Because Centaurus considers all variants, it will find the best clustering even if this effect is not general to all Veris data.To evaluate the differences among clustering variants as applied to farm datasets, we present three largest jobs with their best BIC scores and the variant that produced the best score . For each farm dataset we present results from the three largest experiments: Job-512, Job-1024, and Job-2048. The results show that for most of the multivariate datasets,growing vegetables in vertical pvc pipe the best clusterings came from the Full-Untied variant with a very small number of exceptions: The ALM dataset and Job-1024 for TC1 dataset . For the univariate datasets , the only variant possible is spherical since there are no additional dimensions with which to compute the covariance. For these datasets, the Spher-Untied variant performs best.Degenerate clusters are a surprisingly frequent, yet under-studied, the phenomenon when clustering farm datasets. We next investigate the frequency with which degeneracy occurs for different datasets and different numbers of clusters. Figure3.7 illustrates the search space for the best BIC score with the estimated joint distribution of BIC scores and the number of elements in the smallest cluster . The figure represents a Job-2048 for CAP dataset, with all six variants and all values of K . Darker colors on the graph represent higher density regions. Our system uses all of the k values and all of the variants when choosing the model with the highest BIC score. Per-component distributions are available on the sides of the graph. The graph indicates that the highest BIC scores often come from the clusterings that have one or more almost empty clusters. Particularly for variants that rely on an estimate of co-variance between dimensions, inferences made about the means of these clusters are suspect when their sizes are small. Note that for larger values of k, such clusterings can be common. For example, this particular job had 50961 or 41.5% degenerate and 71916 non-degenerate experiments. To illustrate this effect more fully, we divide the experiments based on the number of clusters, k, and illustrate how degeneracy behaves with increasing k in Table 3.6. The total number of experiments for multivariate datasets was 12288 for each k and all six variant types. The univariate datasets had 4096 experiments and include only two variants for each k. For each farm, the results show the number of non-degenerate clusterings for each k. In some cases, the number of non-degenerate clusterings decreases as k increases. To emphasize the overall degeneracy across all Job-2048 experiments for each dataset, we summarize the percentages of experiments with fewer than 30 elements in their smallest cluster in Table 3.7. The smallest percentage of degenerate clusters is 6% for the GR1 dataset and the largest percentage was 72% for TC2 dataset. We have chosen 30 as a reasonable rule of thumb for a cluster size from which to make an inference about the mean of each cluster in the experiments having three dimensional data. Because k-means converges to a locally optimum solution for non-convex solution spaces, the choice of initial assignment can effect the clustering it finds. Often, users of k-means will run it once, or a small number of times assuming that the local minimum it finds is “close” to the global minimum. In this subsection, we investigate the validity of this conjecture for soil EC data across farms. Figure 3.8 presents the best BIC scores for different experiment sizes for all of the farm datasets. In some of the jobs, the best BIC score occurs only once amongst all of the experiments while in others the best BIC score is more common among multiple experiments . Thus a small number of trials is likely to result in the most common clustering rather than the best one. More importantly, an increase in the number of experiments increases the chance of finding the best BIC score.

The x-axis for each graph in Figure 3.8 is the power of 2 in the number of experiments. For most of the graphs, k-means finds the best BIC consistently beyond some large number. However, for a few of them, it appears that an even greater number of experiments may be necessary before a single consistently large BIC is determined. The graph in Figure 3.3c, for example, seems to indicate that an even larger number of experiments may yield a larger BIC. Thus, for the EC available to our study, it is clear that the best clustering is often rare and thus requires a large number of independent trials to determine. Even though the best clustering may be rare, it may also be that it differs from the most common clustering by so little as to make the effort required to find it unnecessary or wasteful. Table 3.8 compares the clustering determined by the best BIC scored cluster to the most common clustering for each of the data sets across their largest jobs. We limit the clusterings to those with at least 30 elements in each cluster to prevent degenerate clusterings from clouding the results. For each data set we show two rows. The row marked “B” shows the BIC score, the value of k, the cardinality of each of the k clusters, and the number of occurrences of this clustering for the clustering having the best BIC score. The row marked “MC” shows the same information for the most common clustering.