.. _`Mini-batch K-means Clustering`: .. _`org.sysess.sympathy.machinelearning.mini_batch_k_means`: Mini-batch K-means Clustering ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: dataset_blobs.svg :width: 48 Variant of the KMeans algorithm which uses mini-batches to reduce the computation time :Configuration: - *n_clusters* The number of clusters to form as well as the number of centroids to generate. - *max_iter* Maximum number of iterations over the complete dataset before stopping independently of any early stopping criterion heuristics. - *max_no_improvement* Control early stopping based on the consecutive number of mini batches that does not yield an improvement on the smoothed inertia. To disable convergence detection based on inertia, set max_no_improvement to None. - *batch_size* Size of the mini batches. - *init_size* Number of samples to randomly sample for speeding up the initialization (sometimes at the expense of accuracy): the only algorithm is initialized by running a batch KMeans on a random subset of the data. This needs to be larger than n_clusters. - *n_init* Number of random initializations that are tried. In contrast to KMeans, the algorithm is only run once, using the best of the ``n_init`` initializations as measured by inertia. - *init* Method for initialization, defaults to 'k-means++': 'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. 'random': choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. - *compute_labels* Compute label assignment and inertia for the complete dataset once the minibatch optimization has converged in fit. - *reassignment_ratio* Control the fraction of the maximum number of counts for a center to be reassigned. A higher value means that low count centers are more easily reassigned, which means that the model will take longer to converge, but should converge in a better clustering. - *tol* Control early stopping based on the relative center changes as measured by a smoothed, variance-normalized of the mean center squared position changes. This early stopping heuristics is closer to the one used for the batch variant of the algorithms but induces a slight computational and memory overhead over the inertia heuristic. To disable convergence detection based on normalized center change, set tol to 0.0 (default). - *random_state* If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. :Attributes: - *cluster_centers_* Coordinates of cluster centers - *labels_* Labels of each point (if compute_labels is set to True). - *inertia_* The value of the inertia criterion associated with the chosen partition (if compute_labels is set to True). The inertia is defined as the sum of square distances of samples to their nearest neighbor. :Inputs: :Outputs: **model** : model Model *Ports*: **Outputs**: :model: model Model *Configuration*: **n_clusters** The number of clusters to form as well as the number of centroids to generate. **max_iter** Maximum number of iterations over the complete dataset before stopping independently of any early stopping criterion heuristics. **max_no_improvement** Control early stopping based on the consecutive number of mini batches that does not yield an improvement on the smoothed inertia. To disable convergence detection based on inertia, set max_no_improvement to None. **batch_size** Size of the mini batches. **init_size** Number of samples to randomly sample for speeding up the initialization (sometimes at the expense of accuracy): the only algorithm is initialized by running a batch KMeans on a random subset of the data. This needs to be larger than n_clusters. **n_init** Number of random initializations that are tried. In contrast to KMeans, the algorithm is only run once, using the best of the ``n_init`` initializations as measured by inertia. **init** Method for initialization, defaults to 'k-means++': 'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. 'random': choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. **compute_labels** Compute label assignment and inertia for the complete dataset once the minibatch optimization has converged in fit. **reassignment_ratio** Control the fraction of the maximum number of counts for a center to be reassigned. A higher value means that low count centers are more easily reassigned, which means that the model will take longer to converge, but should converge in a better clustering. **tol** Control early stopping based on the relative center changes as measured by a smoothed, variance-normalized of the mean center squared position changes. This early stopping heuristics is closer to the one used for the batch variant of the algorithms but induces a slight computational and memory overhead over the inertia heuristic. To disable convergence detection based on normalized center change, set tol to 0.0 (default). **random_state** If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. .. automodule:: node_clustering .. class:: MiniBatchKMeansClustering