Mini-batch K-means Clustering
Variant of the KMeans algorithm which uses mini-batches to reduce the computation time
| Configuration: |
n_clusters
The number of clusters to form as well as the number of
centroids to generate.
max_iter
Maximum number of iterations over the complete dataset before
stopping independently of any early stopping criterion heuristics.
max_no_improvement
Control early stopping based on the consecutive number of mini
batches that does not yield an improvement on the smoothed inertia.
To disable convergence detection based on inertia, set
max_no_improvement to None.
batch_size
Size of the mini batches.
init_size
Number of samples to randomly sample for speeding up the
initialization (sometimes at the expense of accuracy): the
only algorithm is initialized by running a batch KMeans on a
random subset of the data. This needs to be larger than n_clusters.
n_init
Number of random initializations that are tried.
In contrast to KMeans, the algorithm is only run once, using the
best of the n_init initializations as measured by inertia.
init
Method for initialization, defaults to ‘k-means++’:
‘k-means++’ : selects initial cluster centers for k-mean
clustering in a smart way to speed up convergence. See section
Notes in k_init for more details.
‘random’: choose k observations (rows) at random from data for
the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features)
and gives the initial centers.
compute_labels
Compute label assignment and inertia for the complete dataset
once the minibatch optimization has converged in fit.
reassignment_ratio
Control the fraction of the maximum number of counts for a
center to be reassigned. A higher value means that low count
centers are more easily reassigned, which means that the
model will take longer to converge, but should converge in a
better clustering.
tol
Control early stopping based on the relative center changes as
measured by a smoothed, variance-normalized of the mean center
squared position changes. This early stopping heuristics is
closer to the one used for the batch variant of the algorithms
but induces a slight computational and memory overhead over the
inertia heuristic.
To disable convergence detection based on normalized center
change, set tol to 0.0 (default).
random_state
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.
|
| Attributes: |
cluster_centers_
Coordinates of cluster centers
labels_
Labels of each point (if compute_labels is set to True).
inertia_
The value of the inertia criterion associated with the chosen
partition (if compute_labels is set to True). The inertia is
defined as the sum of square distances of samples to their nearest
neighbor.
|
| Inputs: | |
| Outputs: |
- model : model
Model
|
- Output ports:
-
- Configuration:
- n_clusters
- The number of clusters to form as well as the number of
centroids to generate.
- max_iter
- Maximum number of iterations over the complete dataset before
stopping independently of any early stopping criterion heuristics.
- max_no_improvement
Control early stopping based on the consecutive number of mini
batches that does not yield an improvement on the smoothed inertia.
To disable convergence detection based on inertia, set
max_no_improvement to None.
- batch_size
- Size of the mini batches.
- init_size
- Number of samples to randomly sample for speeding up the
initialization (sometimes at the expense of accuracy): the
only algorithm is initialized by running a batch KMeans on a
random subset of the data. This needs to be larger than n_clusters.
- n_init
- Number of random initializations that are tried.
In contrast to KMeans, the algorithm is only run once, using the
best of the
n_init initializations as measured by inertia.
- init
Method for initialization, defaults to ‘k-means++’:
‘k-means++’ : selects initial cluster centers for k-mean
clustering in a smart way to speed up convergence. See section
Notes in k_init for more details.
‘random’: choose k observations (rows) at random from data for
the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features)
and gives the initial centers.
- compute_labels
- Compute label assignment and inertia for the complete dataset
once the minibatch optimization has converged in fit.
- reassignment_ratio
- Control the fraction of the maximum number of counts for a
center to be reassigned. A higher value means that low count
centers are more easily reassigned, which means that the
model will take longer to converge, but should converge in a
better clustering.
- tol
Control early stopping based on the relative center changes as
measured by a smoothed, variance-normalized of the mean center
squared position changes. This early stopping heuristics is
closer to the one used for the batch variant of the algorithms
but induces a slight computational and memory overhead over the
inertia heuristic.
To disable convergence detection based on normalized center
change, set tol to 0.0 (default).
- random_state
- If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.
Some of the docstrings for this module have been automatically
extracted from the scikit-learn library
and are covered by their respective licenses.
-
class
node_clustering.MiniBatchKMeansClustering[source]