Generate classification dataset¶
Generates an artificial dataset useful for testing classification algorithms.
Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of a 2 * class_sep-sided hypercube, and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.
Prior to shuffling, X stacks a number of these primary ‘informative’ features, ‘redundant’ linear combinations of these, ‘repeated’ duplicates of sampled features, and arbitrary noise for any remaining features.
Documentation
Generates an artificial dataset useful for testing classification algorithms.
Generate a random n-class classification problem. This initially creates clusters of points normally distributed (std=1) about vertices of a 2 * class_sep-sided hypercube, and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.
Prior to shuffling, X stacks a number of these primary ‘informative’ features, ‘redundant’ linear combinations of these, ‘repeated’ duplicates of sampled features, and arbitrary noise for any remaining features.
Configuration:
n_samples
The total number of samples generated.
n_features
The number of features for each sample.
n_informative
The number of informative features.
Each class is composed of a number of gaussian clusters each located around the vertices of a hypercube in a subspace of dimension n_informative. For each cluster, informative features are drawn independently from N(0, 1) and then randomly linearly combined within each cluster in order to add covariance. The clusters are then placed on the vertices of the hypercube.
n_redundant
The number of redundant features. These features are generated as random linear combinations of the informative features.
n_repeated
The number of duplicated features, drawn randomly from the informative and the redundant features.
n_classes
The number of classes (labels) for the classification problem
n_clusters_per_class
The number of classes (or labels) of the classification problem.
weights
Comma separated list of float weights for each class.
Determines the proportions of samples assigned to each class. If None, then classes are balanced. Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. More than n_samples samples may be returned if the sum of weights exceeds 1.
flip_y
The fraction of samples whose class are randomly exchanged
class_sep
Factor multiplying the hypercube dimension
shift
Shift features by the specified comma separated value(s). If None, then features are shifted by a random value drawn in: [-class_sep, class_sep].
scale
Multiply features by the specified comma separated value(s). If None, then features are scaled by a random value drawn in: [1, 100]. Note that scaling happens after shifting.
hypercube
If true clusters are put on vertices of a hypercube, otherwise a random polytope
shuffle
Shuffle datapoints (otherwise given in cluster order)
Input ports:
- Output ports:
- Xtable
X
- Ytable
Y
Definition
Input ports
Output ports
- X
table
X
- Y
table
Y
- class node_io.MakeClassification[source]