K-means Clustering¶
Clusters data by trying to separate samples in n groups of equal variance
Documentation¶
Attributes¶
- cluster_centers_
Coordinates of cluster centers. If the algorithm stops before fully converging (see
tol
andmax_iter
), these will not be consistent withlabels_
.- inertia_
Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
- labels_
Labels of each point
Definition¶
Output ports¶
- model model
Model
Configuration¶
- K-means algorithm (algorithm)
K-means algorithm to use. The classical EM-style algorithm is “lloyd”. The “elkan” variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).
Changed in version 0.18: Added Elkan algorithm
Changed in version 1.1: Renamed “full” to “lloyd”, and deprecated “auto” and “full”. Changed “auto” to use “lloyd” instead of “elkan”.
- Initialization method (init)
Method for initialization:
‘k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them.
‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.
If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.
For an example of how to use the different init strategy, see the example entitled sphx_glr_auto_examples_cluster_plot_kmeans_digits.py.
- Maximum number of iterations (max_iter)
Maximum number of iterations of the k-means algorithm for a single run.
- Number of clusters/centroids (n_clusters)
The number of clusters to form as well as the number of centroids to generate.
For an example of how to choose an optimal value for n_clusters refer to sphx_glr_auto_examples_cluster_plot_kmeans_silhouette_analysis.py.
- Number of runs (n_init)
Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see kmeans_sparse_high_dim).
When n_init=’auto’, the number of runs depends on the value of init: 10 if using init=’random’ or init is a callable; 1 if using init=’k-means++’ or init is an array-like.
Added in version 1.2: Added ‘auto’ option for n_init.
Changed in version 1.4: Default value for n_init changed to ‘auto’.
- Random seed (random_state)
Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See random_state.
- Tolerance (tol)
Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
Implementation¶
- class node_clustering.KMeansClustering[source]