.. _`Mini-batch K-means Clustering`:

.. _`org.sysess.sympathy.machinelearning.mini_batch_k_means`:

Mini-batch K-means Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. image:: dataset_blobs.svg
   :width: 48


Variant of the KMeans algorithm which uses mini-batches to reduce the computation time

:Configuration:


  - *n_clusters*

    The number of clusters to form as well as the number of
    centroids to generate.


  - *max_iter*

    Maximum number of iterations over the complete dataset before
    stopping independently of any early stopping criterion heuristics.


  - *max_no_improvement*

    Control early stopping based on the consecutive number of mini
    batches that does not yield an improvement on the smoothed inertia.

    To disable convergence detection based on inertia, set
    max_no_improvement to None.


  - *batch_size*

    Size of the mini batches.


  - *init_size*

    Number of samples to randomly sample for speeding up the
    initialization (sometimes at the expense of accuracy): the
    only algorithm is initialized by running a batch KMeans on a
    random subset of the data. This needs to be larger than n_clusters.


  - *n_init*

    Number of random initializations that are tried.
    In contrast to KMeans, the algorithm is only run once, using the
    best of the ``n_init`` initializations as measured by inertia.


  - *init*

    Method for initialization, defaults to 'k-means++':

    'k-means++' : selects initial cluster centers for k-mean
    clustering in a smart way to speed up convergence. See section
    Notes in k_init for more details.

    'random': choose k observations (rows) at random from data for
    the initial centroids.

    If an ndarray is passed, it should be of shape (n_clusters, n_features)
    and gives the initial centers.


  - *compute_labels*

    Compute label assignment and inertia for the complete dataset
    once the minibatch optimization has converged in fit.


  - *reassignment_ratio*

    Control the fraction of the maximum number of counts for a
    center to be reassigned. A higher value means that low count
    centers are more easily reassigned, which means that the
    model will take longer to converge, but should converge in a
    better clustering.


  - *tol*

    Control early stopping based on the relative center changes as
    measured by a smoothed, variance-normalized of the mean center
    squared position changes. This early stopping heuristics is
    closer to the one used for the batch variant of the algorithms
    but induces a slight computational and memory overhead over the
    inertia heuristic.

    To disable convergence detection based on normalized center
    change, set tol to 0.0 (default).


  - *random_state*

    If int, random_state is the seed used by the random number generator;
    If RandomState instance, random_state is the random number generator;
    If None, the random number generator is the RandomState instance used
    by `np.random`.


:Attributes:


  - *cluster_centers_*

    Coordinates of cluster centers


  - *labels_*

    Labels of each point (if compute_labels is set to True).


  - *inertia_*

    The value of the inertia criterion associated with the chosen
    partition (if compute_labels is set to True). The inertia is
    defined as the sum of square distances of samples to their nearest
    neighbor.


:Inputs:


:Outputs:
    **model** : model
        Model

*Ports*:

    **Outputs**:

        :model: model

            Model

*Configuration*:

    **n_clusters**
        The number of clusters to form as well as the number of
        centroids to generate.
    **max_iter**
        Maximum number of iterations over the complete dataset before
        stopping independently of any early stopping criterion heuristics.
    **max_no_improvement**
        Control early stopping based on the consecutive number of mini
        batches that does not yield an improvement on the smoothed inertia.

        To disable convergence detection based on inertia, set
        max_no_improvement to None.
    **batch_size**
        Size of the mini batches.
    **init_size**
        Number of samples to randomly sample for speeding up the
        initialization (sometimes at the expense of accuracy): the
        only algorithm is initialized by running a batch KMeans on a
        random subset of the data. This needs to be larger than n_clusters.
    **n_init**
        Number of random initializations that are tried.
        In contrast to KMeans, the algorithm is only run once, using the
        best of the ``n_init`` initializations as measured by inertia.
    **init**
        Method for initialization, defaults to 'k-means++':

        'k-means++' : selects initial cluster centers for k-mean
        clustering in a smart way to speed up convergence. See section
        Notes in k_init for more details.

        'random': choose k observations (rows) at random from data for
        the initial centroids.

        If an ndarray is passed, it should be of shape (n_clusters, n_features)
        and gives the initial centers.
    **compute_labels**
        Compute label assignment and inertia for the complete dataset
        once the minibatch optimization has converged in fit.
    **reassignment_ratio**
        Control the fraction of the maximum number of counts for a
        center to be reassigned. A higher value means that low count
        centers are more easily reassigned, which means that the
        model will take longer to converge, but should converge in a
        better clustering.
    **tol**
        Control early stopping based on the relative center changes as
        measured by a smoothed, variance-normalized of the mean center
        squared position changes. This early stopping heuristics is
        closer to the one used for the batch variant of the algorithms
        but induces a slight computational and memory overhead over the
        inertia heuristic.

        To disable convergence detection based on normalized center
        change, set tol to 0.0 (default).
    **random_state**
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

.. automodule:: node_clustering

.. class:: MiniBatchKMeansClustering