Generate classification dataset

../../../../_images/dataset_classes.svg

Generates an artificial dataset useful for testing classification algorithms.

Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of a 2 * class_sep-sided hypercube, and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

Prior to shuffling, X stacks a number of these primary ‘informative’ features, ‘redundant’ linear combinations of these, ‘repeated’ duplicates of sampled features, and arbitrary noise for any remaining features.

Definition

Output ports

X table

X

Y table

Y

Configuration

class_sep (class_sep)

Factor multiplying the hypercube dimension

flip_y (flip_y)

The fraction of samples whose class are randomly exchanged

hypercube (hypercube)

If true clusters are put on vertices of a hypercube, otherwise a random polytope

n_classes (n_classes)

The number of classes (labels) for the classification problem

n_clusters_per_class (n_clusters_per_class)

The number of classes (or labels) of the classification problem.

n_features (n_features)

The number of features for each sample.

n_informative (n_informative)

The number of informative features.

Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundant (n_redundant)

The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeated (n_repeated)

The number of duplicated features, drawn randomly from the informative and the redundant features.

n_samples (n_samples)

The total number of samples generated.

scale (scale)

Multiply features by the specified comma separated value(s). If None, then features are scaled by a random value drawn in: [1, 100]. Note that scaling happens after shifting.

shift (shift)

Shift features by the specified comma separated value(s). If None, then features are shifted by a random value drawn in: [-class_sep, class_sep].

shuffle (shuffle)

Shuffle datapoints (otherwise given in cluster order)

weights (weights)

Comma separated list of float weights for each class.

Determines the proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

Implementation

class node_io.MakeClassification[source]