K-means Clustering

../../../../_images/dataset_blobs1.svg

Clusters data by trying to separate samples in n groups of equal variance

Documentation

Attributes

cluster_centers_

Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

inertia_

Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.

labels_

Labels of each point

Definition

Output ports

model model

Model

Configuration

K-means algorithm (algorithm)

K-means algorithm to use. The classical EM-style algorithm is “lloyd”. The “elkan” variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

“auto” and “full” are deprecated and they will be removed in Scikit-Learn 1.3. They are both aliases for “lloyd”.

Changed in version 0.18: Added Elkan algorithm

Changed in version 1.1: Renamed “full” to “lloyd”, and deprecated “auto” and “full”. Changed “auto” to use “lloyd” instead of “elkan”.

Initialization method (init)

Method for initialization:

‘k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence, and is theoretically proven to be \mathcal{O}(\log k)-optimal. See the description of n_init for more details.

‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.

If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

Maximum number of iterations (max_iter)

Maximum number of iterations of the k-means algorithm for a single run.

Number of clusters/centroids (n_clusters)

The number of clusters to form as well as the number of centroids to generate.

Number of runs (n_init)

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

Random seed (random_state)

Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See random_state.

Tolerance (tol)

Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

Implementation

class node_clustering.KMeansClustering[source]