Module model_selection

Expand description

TODO: add docstring for model_selection

§Model Selection methods

In statistics and machine learning we usually split our data into two sets: one for training and the other one for testing. We fit our model to the training data, in order to make predictions on the test data. We do that to avoid overfitting or underfitting model to our data. Overfitting is bad because the model we trained fits trained data too well and can’t make any inferences on new data. Underfitted is bad because the model is undetrained and does not fit the training data well. Splitting data into multiple subsets helps us to find the right combination of hyperparameters, estimate model performance and choose the right model for the data.

In smartcore a random split into training and test sets can be quickly computed with the train_test_split helper function.

use smartcore::linalg::basic::matrix::DenseMatrix;
use smartcore::model_selection::train_test_split;
use smartcore::linalg::basic::arrays::Array;

//Iris data
let x = DenseMatrix::from_2d_array(&[
          &[5.1, 3.5, 1.4, 0.2],
          &[4.9, 3.0, 1.4, 0.2],
          &[4.7, 3.2, 1.3, 0.2],
          &[4.6, 3.1, 1.5, 0.2],
          &[5.0, 3.6, 1.4, 0.2],
          &[5.4, 3.9, 1.7, 0.4],
          &[4.6, 3.4, 1.4, 0.3],
          &[5.0, 3.4, 1.5, 0.2],
          &[4.4, 2.9, 1.4, 0.2],
          &[4.9, 3.1, 1.5, 0.1],
          &[7.0, 3.2, 4.7, 1.4],
          &[6.4, 3.2, 4.5, 1.5],
          &[6.9, 3.1, 4.9, 1.5],
          &[5.5, 2.3, 4.0, 1.3],
          &[6.5, 2.8, 4.6, 1.5],
          &[5.7, 2.8, 4.5, 1.3],
          &[6.3, 3.3, 4.7, 1.6],
          &[4.9, 2.4, 3.3, 1.0],
          &[6.6, 2.9, 4.6, 1.3],
          &[5.2, 2.7, 3.9, 1.4],
          ]).unwrap();
let y: Vec<f64> = vec![
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
];

let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2, true, None);

println!("X train: {:?}, y train: {}, X test: {:?}, y test: {}",
            x_train.shape(), y_train.len(), x_test.shape(), y_test.len());

When we partition the available data into two disjoint sets, we drastically reduce the number of samples that can be used for training.

One way to solve this problem is to use k-fold cross-validation. With k-fold validation, the dataset is split into k disjoint sets. A model is trained using k - 1 of the folds, and the resulting model is validated on the remaining portion of the data.

The simplest way to run cross-validation is to use the cross_val_score helper function on your estimator and the dataset.

use smartcore::linalg::basic::matrix::DenseMatrix;
use smartcore::model_selection::{KFold, cross_validate};
use smartcore::metrics::accuracy;
use smartcore::linear::logistic_regression::LogisticRegression;
use smartcore::api::SupervisedEstimator;
use smartcore::linalg::basic::arrays::Array;

//Iris data
let x = DenseMatrix::from_2d_array(&[
          &[5.1, 3.5, 1.4, 0.2],
          &[4.9, 3.0, 1.4, 0.2],
          &[4.7, 3.2, 1.3, 0.2],
          &[4.6, 3.1, 1.5, 0.2],
          &[5.0, 3.6, 1.4, 0.2],
          &[5.4, 3.9, 1.7, 0.4],
          &[4.6, 3.4, 1.4, 0.3],
          &[5.0, 3.4, 1.5, 0.2],
          &[4.4, 2.9, 1.4, 0.2],
          &[4.9, 3.1, 1.5, 0.1],
          &[7.0, 3.2, 4.7, 1.4],
          &[6.4, 3.2, 4.5, 1.5],
          &[6.9, 3.1, 4.9, 1.5],
          &[5.5, 2.3, 4.0, 1.3],
          &[6.5, 2.8, 4.6, 1.5],
          &[5.7, 2.8, 4.5, 1.3],
          &[6.3, 3.3, 4.7, 1.6],
          &[4.9, 2.4, 3.3, 1.0],
          &[6.6, 2.9, 4.6, 1.3],
          &[5.2, 2.7, 3.9, 1.4],
          ]).unwrap();
let y: Vec<i32> = vec![
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
];

let cv = KFold::default().with_n_splits(3);

let results = cross_validate(
    LogisticRegression::new(),   //estimator
    &x, &y,                 //data
    Default::default(),     //hyperparameters
    &cv,                     //cross validation split
    &accuracy).unwrap();    //metric

println!("Training accuracy: {}, test accuracy: {}",
    results.mean_test_score(), results.mean_train_score());

The function cross_val_predict has a similar interface to cross_val_score, but instead of test error it calculates predictions for all samples in the test set.

Structs§

CrossValidationResult: Cross validation results.
KFold: K-Folds cross-validator
KFoldIter: An iterator over indices that split data into training and test set.

Traits§

BaseKFold: An interface for the K-Folds cross-validator

Functions§

cross_val_predict: Generate cross-validated estimates for each input data point. The data is split according to the cv parameter. Each sample belongs to exactly one test set, and its prediction is computed with an estimator fitted on the corresponding training set.
cross_validate: Evaluate an estimator by cross-validation using given metric.
train_test_split: Splits data into 2 disjoint datasets.