Generate classification dataset

../../../../_images/dataset_classes.svg

Generates an artificial dataset useful for testing classification algorithms.

Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of a 2 * class_sep-sided hypercube, and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

Prior to shuffling, X stacks a number of these primary ‘informative’ features, ‘redundant’ linear combinations of these, ‘repeated’ duplicates of sampled features, and arbitrary noise for any remaining features.

Configuration:
  • n_samples

    The total number of samples generated.

  • n_features

    The number of features for each sample.

  • n_informative

    The number of informative features.

    Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

  • n_redundant

    The number of redundant features. These features are generated as random linear combinations of the informative features.

  • n_repeated

    The number of duplicated features, drawn randomly from the informative and the redundant features.

  • n_classes

    The number of classes (labels) for the classification problem

  • n_clusters_per_class

    The number of classes (or labels) of the classification problem.

  • weights

    Comma separated list of float weights for each class.

    Determines the proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

  • flip_y

    The fraction of samples whose class are randomly exchanged

  • class_sep

    Factor multiplying the hypercube dimension

  • shift

    Shift features by the specified comma separated value(s). If None, then features are shifted by a random value drawn in: [-class_sep, class_sep].

  • scale

    Multiply features by the specified comma separated value(s). If None, then features are scaled by a random value drawn in: [1, 100]. Note that scaling happens after shifting.

  • hypercube

    If true clusters are put on vertices of a hypercube, otherwise a random polytope

  • shuffle

    Shuffle datapoints (otherwise given in cluster order)

Inputs:
Outputs:
X : table

X

Y : table

Y

Ports:

Outputs:

X:

table

X

Y:

table

Y

Configuration:

n_samples
The total number of samples generated.
n_features
The number of features for each sample.
n_informative

The number of informative features.

Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundant
The number of redundant features. These features are generated as random linear combinations of the informative features.
n_repeated
The number of duplicated features, drawn randomly from the informative and the redundant features.
n_classes
The number of classes (labels) for the classification problem
n_clusters_per_class
The number of classes (or labels) of the classification problem.
weights

Comma separated list of float weights for each class.

Determines the proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_y
The fraction of samples whose class are randomly exchanged
class_sep
Factor multiplying the hypercube dimension
shift
Shift features by the specified comma separated value(s). If None, then features are shifted by a random value drawn in: [-class_sep, class_sep].
scale
Multiply features by the specified comma separated value(s). If None, then features are scaled by a random value drawn in: [1, 100]. Note that scaling happens after shifting.
hypercube
If true clusters are put on vertices of a hypercube, otherwise a random polytope
shuffle
Shuffle datapoints (otherwise given in cluster order)
class node_io.MakeClassification[source]