Generate classification dataset¶

Generates an artificial dataset useful for testing classification algorithms.

Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of a 2 * class_sep-sided hypercube, and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

Prior to shuffling, X stacks a number of these primary ‘informative’ features, ‘redundant’ linear combinations of these, ‘repeated’ duplicates of sampled features, and arbitrary noise for any remaining features.

Documentation

Generates an artificial dataset useful for testing classification algorithms.

Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of a 2 * class_sep-sided hypercube, and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

Prior to shuffling, X stacks a number of these primary ‘informative’ features, ‘redundant’ linear combinations of these, ‘repeated’ duplicates of sampled features, and arbitrary noise for any remaining features.

Configuration:

n_samples

The total number of samples generated.

n_features

The number of features for each sample.

n_informative

The number of informative features.

Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.

n_redundant

The number of redundant features. These features are generated as random linear combinations of the informative features.

n_repeated

The number of duplicated features, drawn randomly from the informative and the redundant features.

n_classes

The number of classes (labels) for the classification problem

n_clusters_per_class

The number of classes (or labels) of the classification problem.

weights

Comma separated list of float weights for each class.

Determines the proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.

flip_y

The fraction of samples whose class are randomly exchanged

class_sep

Factor multiplying the hypercube dimension

shift

Shift features by the specified comma separated value(s). If None, then features are shifted by a random value drawn in: [-class_sep, class_sep].

scale

Multiply features by the specified comma separated value(s). If None, then features are scaled by a random value drawn in: [1, 100]. Note that scaling happens after shifting.

hypercube

If true clusters are put on vertices of a hypercube, otherwise a random polytope

shuffle

Shuffle datapoints (otherwise given in cluster order)

Input ports:

Output ports:

Xtable: X
Ytable: Y

Definition

Input ports

Output ports

X

table

X

Y

table

Y

class node_io.MakeClassification[source]

Generate classification dataset¶

Sympathy for Data

Navigation

Related Topics