K-means Clustering¶

../../../../_images/dataset_blobs1.svg

Clusters data by trying to separate samples in n groups of equal variance

Documentation¶

Attributes¶

cluster_centers_
Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.

inertia_
Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.

labels_
Labels of each point

Definition¶

Output ports¶

model model
Model

Configuration¶

K-means algorithm (algorithm)
K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

For now “auto” (kept for backward compatibility) chooses “elkan” but it might change in the future for a better heuristic.

Changed in version 0.18: Added Elkan algorithm

Initialization method (init)
Method for initialization:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.

If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

Maximum number of iterations (max_iter)
Maximum number of iterations of the k-means algorithm for a single run.

Number of clusters/centroids (n_clusters)
The number of clusters to form as well as the number of centroids to generate.

Number of runs (n_init)
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

Random seed (random_state)
Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See random_state.

Tolerance (tol)
Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

Implementation¶

class node_clustering.KMeansClustering[source]