K-means Clustering

../../../../_images/dataset_blobs1.svg

Clusters data by trying to separate samples in n groups of equal variance

Documentation

KMeans is an unsupervised clustering algorithm. The algorithm clusters data by trying to

separate samples in n groups of equal variance, minimizing a criterion known as the inertia. It scales well to large numbers of samples and has been used across a large range of application areas in many different fields.

The set of samples is divided into a given number of clusters, where each cluster is descibed by the mean of the samples in the cluster. The inertia is the sum of distances from the cluster mean to all the samples.

Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:

  • Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.

  • Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated. Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.

Attributes

cluster_centers_

Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

inertia_

Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.

labels_

Labels of each point

Definition

Output ports

model
Type: model
Description: Model

Configuration

K-means algorithm (algorithm)

K-means algorithm to use. The classical EM-style algorithm is “lloyd”. The “elkan” variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

Changed in version 0.18: Added Elkan algorithm

Changed in version 1.1: Renamed “full” to “lloyd”, and deprecated “auto” and “full”. Changed “auto” to use “lloyd” instead of “elkan”.

Initialization method (init)

Method for initialization:

  • ‘k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them.

  • ‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.

  • If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

  • If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

For an example of how to use the different init strategies, see sphx_glr_auto_examples_cluster_plot_kmeans_digits.py.

For an evaluation of the impact of initialization, see the example sphx_glr_auto_examples_cluster_plot_kmeans_stability_low_dim_dense.py.

Maximum number of iterations (max_iter)

Maximum number of iterations of the k-means algorithm for a single run.

Number of clusters/centroids (n_clusters)

The number of clusters to form as well as the number of centroids to generate.

For an example of how to choose an optimal value for n_clusters refer to sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py.

Number of runs (n_init)

Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see kmeans_sparse_high_dim).

When n_init=’auto’, the number of runs depends on the value of init: 10 if using init=’random’ or init is a callable; 1 if using init=’k-means++’ or init is an array-like.

Added in version 1.2: Added ‘auto’ option for n_init.

Changed in version 1.4: Default value for n_init changed to ‘auto’.

Random seed (random_state)

Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See random_state.

Tolerance (tol)

Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

Examples

Example flows demonstrating this node:

The node can be found in:

Implementation

class node_clustering.KMeansClustering[source]