.. _`Generate classification dataset`:

.. _`org.sysess.sympathy.machinelearning.generate_classification`:

Generate classification dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. image:: dataset_classes.svg
   :width: 48


Generates an artificial dataset useful for testing classification algorithms.

Generate a random n-class classification problem.
This initially creates clusters of points normally distributed (std=1) about
vertices of a 2 * class_sep-sided hypercube, and assigns an equal number of
clusters to each class. It introduces interdependence between these features
and adds various types of further noise to the data.

Prior to shuffling, X stacks a number of these primary 'informative' features,
'redundant' linear combinations of these, 'repeated' duplicates of sampled
features, and arbitrary noise for any remaining features.


*Configuration*:


  - *n_samples*

    The total number of samples generated.

  - *n_features*

    The number of features for each sample.

  - *n_informative*

    The number of informative features.

    Each class is composed of a number of gaussian clusters each located
    around the vertices of a hypercube in a subspace of dimension
    n_informative. For each cluster, informative features are drawn
    independently from N(0, 1) and then randomly linearly combined within
    each cluster in order to add covariance. The clusters are then placed
    on the vertices of the hypercube.

  - *n_redundant*

    The number of redundant features. These features are generated as random linear combinations of the informative features.

  - *n_repeated*

    The number of duplicated features, drawn randomly from the informative and the redundant features.

  - *n_classes*

    The number of classes (labels) for the classification problem

  - *n_clusters_per_class*

    The number of classes (or labels) of the classification problem.

  - *weights*

    Comma separated list of float weights for each class.

    Determines the proportions of samples assigned to each class. If None,
    then classes are balanced. Note that if len(weights) == n_classes - 1,
    then the last class weight is automatically inferred. More than
    n_samples samples may be returned if the sum of weights exceeds 1.

  - *flip_y*

    The fraction of samples whose class are randomly exchanged

  - *class_sep*

    Factor multiplying the hypercube dimension

  - *shift*

    Shift features by the specified comma separated value(s). If None, then features are shifted by a random value drawn in:  [-class_sep, class_sep].

  - *scale*

    Multiply features by the specified comma separated value(s). If None, then features are scaled by a random value drawn in:  [1, 100].
    Note that scaling happens after shifting.

  - *hypercube*

    If true clusters are put on vertices of a hypercube, otherwise a random polytope

  - *shuffle*

    Shuffle datapoints (otherwise given in cluster order)


*Input ports*:


*Output ports*:
    **X** : table
        X
    **Y** : table
        Y


**n_samples** (n_samples)
    The total number of samples generated.
**n_features** (n_features)
    The number of features for each sample.
**n_informative** (n_informative)
    The number of informative features.

    Each class is composed of a number of gaussian clusters each located
    around the vertices of a hypercube in a subspace of dimension
    n_informative. For each cluster, informative features are drawn
    independently from N(0, 1) and then randomly linearly combined within
    each cluster in order to add covariance. The clusters are then placed
    on the vertices of the hypercube.
**n_redundant** (n_redundant)
    The number of redundant features. These features are generated as random linear combinations of the informative features.
**n_repeated** (n_repeated)
    The number of duplicated features, drawn randomly from the informative and the redundant features.
**n_classes** (n_classes)
    The number of classes (labels) for the classification problem
**n_clusters_per_class** (n_clusters_per_class)
    The number of classes (or labels) of the classification problem.
**weights** (weights)
    Comma separated list of float weights for each class.

    Determines the proportions of samples assigned to each class. If None,
    then classes are balanced. Note that if len(weights) == n_classes - 1,
    then the last class weight is automatically inferred. More than
    n_samples samples may be returned if the sum of weights exceeds 1.
**flip_y** (flip_y)
    The fraction of samples whose class are randomly exchanged
**class_sep** (class_sep)
    Factor multiplying the hypercube dimension
**shift** (shift)
    Shift features by the specified comma separated value(s). If None, then features are shifted by a random value drawn in:  [-class_sep, class_sep].
**scale** (scale)
    Multiply features by the specified comma separated value(s). If None, then features are scaled by a random value drawn in:  [1, 100].
    Note that scaling happens after shifting.
**hypercube** (hypercube)
    If true clusters are put on vertices of a hypercube, otherwise a random polytope
**shuffle** (shuffle)
    Shuffle datapoints (otherwise given in cluster order)

.. automodule:: node_io

.. class:: MakeClassification