As an example, Kuhn and Johnson use data to model the daily ridership of Chicago’s public train system. These are the hypothetical inner monologues in chronological order:
1) EDA
The daily ridership values between stations seemed extremely correlated
Weekday and weekend ridership look very different
One day in the summer of 2010 has an abnormally large number of riders
Which stations had the lowest daily ridership values?
2) Feature Engineering
Dates should at least be encoded as day-of-the-week and year
Maybe PCA could be used on the correlated predictors to make it easier for the models to use them
Hourly weather records should probably be averaged into daily measurements
3) Model Fitting
Let’s start with simple linear regression, K-nearest neighbors, and a boosted decision tree
4) Model Tuning
How many neighbors should be used?
Should we run a lot of boosting iterations or just a few?
How many neighbors seemed to be optimal for these data?
5) Model Evaluation
Which models have the lowest root mean squared errors?
6) EDA
Which days were poorly predicted?
7) Model Evaluation
Variable importance scores indicate that the weather information is not predictive. We’ll drop them from the next set of models.
It seems like we should add a lot of boosting iterations for that model
8) Feature Engineering
We need to encode holiday features to improve predictions
9) Model Evaluation
Let’s drop KNN from the model list
Ames Housing Data
In this section, we’ll look into the Ames housing data set and do exploratory data analysis. The data set contains information on 2,930 properties in Ames.
Selecting the first 5 out of 74 columns:
As we can see the data is skew to the right (there are more inexpensive house than super expensive ones). Let’s do a log transformation:
We will add the log transformation as a new outcome column:
Splitting Data
Splitting the data into 80% training set and 20% test set using random sampling via the rsample package:
When there is a large class imbalance in classification, it is better to use stratified sampling. The training/test split is conducted separately within each class. For regression problems, the outcome variable can be binned into quartiles, and then stratified sampling can be conducted 4 separate times.
Let’s create the density distribution and quartiles:
We can graph the quartiles using the following code:
In our Ames case, expensive houses are rarer, so we can use the following stratfied sampling instead:
Extracting the training and test sets:
For time series data, we would use:
It assumes that the data have been presorted in choronological order.
Fitting Model
We are going to fit models using parsnip:
Parsnip enables a consistent model interface for different packages. For example, using the ranger package:
Using Model Results
The broom::tidy() function converts the result into tibble:
Predictions
Merging the results from multiple call to predict:
Let’s fit the same data using random forest. The steps are the same:
Model Workflow
Workflow is similar to pipelines in mlr3.
Workflow can be updated as well:
Or adding formulas:
And using general selectors:
Special Formulas and Inline Functions
Some models use extended formulas that base R functions cannot parse or execute. To get around this problem is to first add the variables and the supplementary model formula afterwards:
And using strata() function from the survival package for survival analysis:
Creating Multiple Workflows at Once
We can create a set of formulas:
These representations can be crossed with one or more models (in this case lm and glm):
And extract a particular model based on id:
The columns option and result are populated with specific type of objects that result from resampling.
For now we fit each models by using map():
Evaluating the Test Set
To test on the test set, we could invoke the convenience function last_fit() to fit on the training set and evaluate it on the test set:
The .workflow column contains the fitted workflow and can be extracted:
And the performance metrics and predictions:
Recipes
Feature engineering includes transformations and encoding of data. The recipes package can be used to combine different feature engineering and preprocessing tasks into a single object and then apply transformation to it.
Encoding Qualitative Data in a Numeric Format
step_unknown() can be used to change missing values to a dedicated factor level.
step_novel() can allocate a new level that are not encountered in training data.
step_other() can be used to analyze the frequencies of the factor levels and convert infrequently occurring values to a level of “other”, with a specified threshold.
step_dummy() creates dummy variables and has a recipe naming convention.
There are other formats such as Feature Hashing and Effects/Likelihood encodings that we shall cover later.
For interaction, in the ames example, step_interact(~ Gr_Liv_Area:startswith(“Bldg_Type_”)) make dummy variables and form the interactions. In this case, an example a interaction column is Gr_Liv_Area_x_Bldg_Type_Duplex.
Let’s create a recipe that:
log transformed “Gr_Liv_Area”
create dummy variables for all nominal predictors
lumping the bottom 1% of neighborhoods in terms of counts into “other” level
interaction between Gr_Liv_Area x Bldg_Type
spline for Latitutde
Adding reciple to workflow:
Extracting fitted parsnip object:
It is possible to call the tidy() function on a specific step based on the number identifier:
Interaction
Let us investigate deeper the interaction between Gr_Liv_area x Bldg_Type. By plotting Sale_Price vs Gr_Liv_Area and separate out by Bldg_Type, we can see a difference in the slopes:
Spline
The relationship between SalePrice and Latitude is essentially nonlinear. Latitude is similar to Neighborhood in the sense that both variables are trying to capture the variation in price from the variation of location but in different ways.
The following code graphs different degrees of freedom (# of knots across latitude) and df of 20 (we will tune this parameter later) looks reasonable:
PCA
In the Ames data, several predictors measure size of the property such as total basement size (Total_Bsmt_SF), size of the first floor (First_Flr_SF), gross living area (Gr_Liv_Area), etc. We can use the following recipe step:
This will convert the columns matching the regex and turn them into PC1, …, PC5 (default number of components to keep is 5).
Row Sampling
Downsampling the data keeps the minority class and takes a random sample of the majority class so class frequencies are balanced.
Upsampling replicates samples from the minority class.
General Transformations
step_mutate() can be used to conduct straightforward transformations:
skip = TRUE
Certain step recipes should not be applied to the predict() function. For example downsampling should not be applied during predict() but applied when using fit(). By default, skip = TRUE for step_downsample().
Getting Info from tidy()
We can get a summary of the recipe steps via tidy():
We can also get more info via the id parameter:
Column Roles
Suppose ames have a column of address that you want to leave in dataset but not use in modelling. This could be helpful when data are resampled and tracking which is which. This can be done with:
Metrics
Regression Metrics
Using the yardstick package, we can produce performance metrics. For example, taking the regression model we fitted earlier, we evaluate the performance on the test set:
To compute multiple metrics at once, we can use a metric set:
Binary Classification Metrics
We will use two_class_example dataset in modeldata package to demonstrate classification metrics:
A variety of yardstick metrics:
Or combining all 3:
The f_meas() metric emphasizes the positive class (event of interest). By default, the first level of the outcome factor is the event of interest. We can use the event_level argument to indicate which is the positive class:
There are other classification metrics that use the predicted probabilities as inputs rather than hard class predictions. For example, the Receiver Operating Characteristic (ROC) curve computes the sensitivy and specificity over a continuum of different event thresholds.
Plotting the ROC curve:
If the curve is close to the diagonal line, the model’s predictions would be no matter than random guessing.
Multiclass Classification Metrics
Using the hpc_cv dataset:
The hpc dataset contains the predicted classes and class probabilities for a linear discriminant analysis model fit to the HPC data set from Kuhn and Johnson (2013). These data are the assessment sets from a 10-fold cross-validation scheme. The data column columns for the true class (‘obs’), the class prediction(‘pred’) and columns for each class probability (columns ‘VF’, ‘F’, ‘M’, and ‘L’). Additionally, a column for the resample indicator is included.
Note that the estimator is “multiclass” vs the previous “binary”.
Sensitivity
There are a few ways to calculate sensitivity to multiclass classification. Below we set up some code that can be used to calculate sensitivity:
Macro-Averaging
Computes a set of on-versus-all metrics and averaging it:
Macro-Weighted Averaging
The average is weighted by the number of samples in each class:
Micro-Averaging
Aggregating the totals for each class, and then computes a single metric from the aggregates:
Multiclass ROC:
There are multiclass analog for ROC as well:
We can plot the ROC curve for each group by the folds:
Grouping the accuracy by resampling groups:
Resampling
When we measure performance on the same data that we used for training, the data is said to be resubstituted.
Predictive models that are capable of learning complex trends are known as low bias models. Many black-box ML models have low bias, while models such as linear regression and discriminant analysis are less adaptable and are considered high bias models.
Resampling is only conducted on the training set. For each iteration of resampling, the data are partitioned into an analysis set and an assessment set. Model is fit on the analysis set and evaluated on the assessment set.
Cross-Validation
Generally we prefer a 10-fold cross-validation as a default. Larger values result in resampling estimates with small bias but large variance, while smaller values have large bias but low variance.
The column splits contains the information on how to split the data (training/test). Note that data is not replicated in memory for the folds. We can show this by noticing the object size of ames_folds are not 10 times the size of ames_train:
To retrieve the first fold training data:
And test set:
Repeated CV
Depending on the size of the data, the V-fold CV can be noisy. One way to reduce the noise is to repeat V-fold multiple times (R times). The standard error would then be \(\frac{\sigma}{\sqrt{V\times R}}\).
As long as we have a lot of data relative to VxR, the summary statistics will tend toward a normal distribution by CLT.
Monte Carlo Cross-Validation (MCCV)
The difference between MCCV and CV is that the data in each fold is randomly selected:
This results in assessment sets that are not mutually exclusive.
Validation Sets
If the dataset is very large, it might be adequate to reserve a validation set and only do the validation once rather than CV.
Bootstrapping
A bootstrap sample is a sample that is drawn with replacement. This mean that some data points might be selected more than once. Furthermore, each data point has a 63.2% chance of inclusion in the training set at least once. The assessment set contains all of the training set samples that were not selected for the analysis set. The assessment set is often called the out-of-bag sample.
Compared to CV, bootstrapping has low variance but high bias. For example, if the true accuracy of the model is 90%, bootstrapping will tend to estimate value less than 90%.
Rolling Forecasts
Time series forecast require a rolling resampling. In this example, a training set of 6 samples is sampled with an assessment (validation) size of 30 days, with a 29 day skip:
The analysis() function shows the indices of training set and assessment() function shows the indices of validation set:
Estimating Performance
Using the earlier CV folds, we can run a resampling:
.metrics contains the assessment set performance statistics.
.notes contains any warnings or errors generated during resampling.
.predictions contain out-of-sample predictions when save_pred = TRUE.
To obtain the performance metrics:
To obtain the assessment set predictions:
We can check for predictions that are way off by first plotting the prediction vs actual test set:
There are two houses in the test set that are overpredicted by the model:
We then search for the data rows in the training set:
To use a validation set:
Parallel Processing
The number of possible worker possible worker processes is determined by the parallel package:
For fit_resamples() and other functions in tune, the user need to register a parallel backend package. On Unix we can load the doMC package:
For all operating systems, we can use the doParallel package:
Depending on the data and model type, the linear speed-up deteriorates after four to five cores. Note also that memory use will be increased as the data set is replicated across multiple core processes.
Saving the Resampled Objects
The models created during resampling are not retained. However, we can extract information of the models.
Let us review what we currently have for recipe:
To save and extract the linear model coefficients from a workflow:
An example to extract 2 folds coefficients:
All the results can be flattened into a tibble:
Comparing Models
Let us create 3 different linear models that add interaction and spline incrementally:
Using workflow_map(), we will run resampling for each model:
Using helper functions to report performance statistics:
Now let’s add another random forest model to the workflow set:
To plot the \(R^{2}\) confidence intervals for easy comparison:
Let us gather and summarize the correlation of the resamnpling statistics across each models:
There are large correlations across models implying a large within-resample correlations. Another way to view this is to plot each model with lines connecting the resamples:
If there are no correlations within resamples, there would not be any parallel lines. Testing whether the correlations are significant:
So why is this important? Consider the variance of a difference of two variables:
Hence, if there is a significant positive covariance, then any statistical test of the difference would be reduced.
Simple Hypothesis Testing Methods
With the ANOVA model, the predictors are binary dummy variables for different groups. The \(\beta\) parameters estimate whether two or more groups are different from one another:
Supposed the outcome \(y_{ij}\) are the individual resample \(R^{2}\) statistics and X1, X2, and X3 columns are indicators for the models:
\(\beta_{0}\) is the mean \(R^{2}\) statistic for the basic linear models
\(\beta_{1}\) is the change in the mean \(R^{2}\) statistic when interactions are added to the basic linear model
\(\beta_{1}\) is the change in the mean \(R^{2}\) statistic between basic linear model and random forest model
\(\beta_{1}\) is the change in the mean \(R^{2}\) statistic between basic linear model and model with interactions and splines
However, we still need to contend with how to handle the re-sample-to-resample effect. However, when considering two models at a time, the outcomes are matched by resample and the differences do not contain the resample-to-resample effect:
A Random Intercept Model (Bayesian Method)
Historically, the resample groups are considered block effect and an appropriate term is added to the model. Alternatively, the resample effect could be considered a random effect where the resamples were drawn at random from a larger population of possible models. Methods for fitting an ANOVA model with this type of random effect are Linear Mixed Model or a Bayesian Hierarchical Model.
The ANOVA model assumes the errors \(e_{ij}\) to be independent and follow a Gaussian distribution \(N(0, \sigma)\). A Bayesian linear model makes additional assumptions, We specify a prior distribution for the model parameters. For example a fairly uninformative priors:
To adapt the ANOVA model so that the resamples are adequately modeled, we consider a random intercept model. We assume that the resamples impact the only through the intercept and that the regression parameters \(\beta_{j}\) have the same relationship across resamples.
The perf_mod() in tidyposterior package can be used for comparing resampled models. For workflow sets, it creates an ANOVA model where the groups correspond to the workflows. If one of the workflows in the set had tunng parameters, the best tuning parameters set for each workflow will be used in the Bayesian analysis. In this case, perf_mode() focuses on between-workflow comparisons.
For objects that contain a single model and tuned using resampling, perf_mod() makes within-model comparisons. In this case, the Bayesian ANOVA model groups the submodels defined by the tuning parameters.
perf_mod() uses default priors for all parameters except for the random intercepts which we would need to specify.
Next we take random samples from the posterior distribution:
Plotting the 3 posterior distributions from the random samples:
Or we could call autoplot() directly on the perf_mod() object:
Once we have the posterior distribution for the parameters, it is trivial to get the posterior distributions for combinations of the parameters. For example, the contrast_models() can compare the differences between models:
And plotting the difference histogram:
The summary() method computes the credible intervals as well as a probability column that the proportion of the posterior is greater than 0.
Although statistically significant, the mean is too low to be practical. To check for practical significance of 2%, we can employ the Region of Practical Equivalence (ROPE) estimate, the size option of the summary function is used:
The pract_equiv column is the proportion of the posterior that is within [-size, size]. pract_neg and pract_pos are proportions below and above this interval. As we can see, the mean is not practically significant.
The autoplot() method can show the pract_equiv results that compare each workflow to the current best (the random forest model, in this case which is 1):
Model Tuning
Parameters that cannot be directly estimated from the training data but specified ahead of time are called hyperparmeters or tuning parameters.
Some examples of tuning paramaters are:
Boosting
Boosting is an ensemble model that combines a series of base models that are created sequentially and each model is dependent on the previous models. The number of boosting iterations is a tuning parameter.
Multilayer Perceptron
Also known as the single-layer aritifical neural network. The hidden units are linear combinations of the predictors that are feed into a activation function such as a sigmoid. The number of hidden units and type of activation functions can be tuned.
Gradient Descent
Gradient descent parameters such as the learning rates, momentum, and the number of optimazation iterations/epochs can be tuned.
PCA
The number of principal components to be used can be tuned.
Imputation by K-Nearest Neighbors
The number of neighbors can be tuned.
Logistic Regression
The logit link function (such as logit, probit, and complementary log-log) can be tuned.
But note that we should never tune a parameter in the prior distribution for Bayesian analysis as that will defeat the purpose of a prior distribution.
Likelihoods
But what criteria do we optimize? For cases where the statistical properties of the tuning parameter are tractable, we can maximize the likelihood or information criteria. Note that these statistical properties may not align with the results achieved using accuracy-oriented properties.
Consider fitting a logistic regression model to the iris dataset. We would like to optimize the link function. The three link functions in considerations are logit, probit, and the complementary log-log model.
The statistical approach is to compute the log-likelihood for each model and choose the model with the largest value. Traditionally, the likelihood is computed from the training data.
Let’s create a function to extract the likelihood from fitting the model:
According to the log-likelihood, the logit link function is the best.
But are the differences significant? We would need to see the confidence level as wel. We could resample the statistics. In the yardstick package, the mn_log_loss() function is used to estimate the negative log-likelihood:
Next we test it on the different link functions:
Let’s graph the confidence intervals for log-likelihood:
As we can see from the graphs, there is no difference in the AUC but considerable difference in terms of log-likelihoods.
Search Methods
Tuning parameters optimization falls into 2 categories:
Grid Search
The user predefine a set of parameter values to evaluate. An example would be the space-filling design.
Iterative Search
Also known as sequential search. New parameter combinations are based on previous results. Almost any nonlinear optimization method is appripriate, and in some cases, an initial set of results is required to start the optimization process. An example would be the Nelder-Mead simplex method.
Hybrid Search
After an initial grid search, a sequential optimization can start from the best grid combination.
Tuning Parameters in Parsnip Model Specification
There are two kind of parameter arguments in parsnip model specification. Main arguments are those that are most used in mulitple engines and engine specific parameters. For example, for the “rand_forest”” parsnip model, “trees”, and “min_n” are main parameters and “regularization.factor” is a secondary parameter:
We can use the extract_parameter_set_dials() function to enumerate the tuning parameters for a mlp object:
When both recipe and model specification are combined in a workflow, extract_parameter_set_dials will show both tuning parameters:
And to extract the range from a particular parameter object:
Each tuning parameter argument has a corresponding function in the dials package that usually has the same name and gives the default range:
Some of them has a slight tweak in their names:
Some of the parameters do not have a default value until you use the finalize() function. For example, the following rspec would have a “?” in the “mtry” value because the algorithm doesn’t know the number of columns:
The finalize() function can execute the recipe once to obtain the dimensions:
The range of the parameters can also be updated:
But note that certain tuning parameters are associated with transformations. For example, the penalty parameter has a transformation by default (making sure the penalty is positive):
As you can see from above, the penalty is being transformed. In order to pass in transformed range:
Grid Search
Grid search is when we predefine a set of parameter values to evaluate. The parameter space can become unmanageable because of the curse of dimensionality.
We will consider a multilayer perceptron model with the following tuning parameters:
number of hidden units
number of epochs/iterations
amount of weight decay penalization
Regular Grids
The crossing() function is one way to create a regular grid:
Or using grid_*() functions:
The levels argument can also take a named vector of values:
Nonregular Grids
The grid_random() function generates independent uniform random numbers across parameter ranges:
The problem with grid_random() is the potential of overlapping parameter combinations. For example, rows with the same hidden_units value.
A better approach is using a type of experimental design called space-filling designs that minimizing overlapping combinations and space the points farther apart from each other. The default design is the maximum entropy design.
Evaluating the Grid
To choose the best tuning parameter combination, each candidate set is tested on data that were not used in the model. Resapling or a validation set can be used for this purpose.
In this example, we will use the “cell” dataset which consists of 56 imaging measurements on 2,019 human breast cancer cells. The predictors represent shape and intensity characteristics of different parts of the cells (e.g., the nucleus, the cell boundary, etc.). Each cell belongs to one of two classes (PS vs WS). There is a high correlation between the predictors and many of the predictors have skewed distributions.
Due to the high correlation, we will use PCA and tune the number of components to keep. Yeo-Johnson transformation is to help in making more symmetric distributions due to the skewness. Two normalization is done because the lower-rank PCA components tend to have a wider range than the higher-rank components although technically they are on the same scale.
Note that in step_pca(), using zero PCA components is a shortcut to skip the feature extraction. In this way, resampling can include the original predictors and compared with results including PCA components.
The default epoch ranges are too wide and num_comp range too low:
So let’s increase it:
Let’s start the tuning. The interface to tune_grid() is similar to fit_resamples(). To start, we evaluate a regular grid with 3 levels:
It is possible to graph the performance profiles across tuning parameters:
The number of epochs doesn’t seem to matter much to the AUC. The number of hidden units appears to matter when the amount of regularization penalty is low. When the penalty is high, the ROC across hidden units is pretty flat.
We can see the top 5 ranked performance:
To use a space-filling design, the grid argument can be given an integer value. For example, to evaluate the same range using a maximum entropy design with 20 candidate values:
Because there is no fixed regular design, we can only graph the marginal effects plot via autoplot():
We can see the top 5 ranked performance:
Note the contrasting result from the regular grids. Generally, it is a good idea to evaluate the models over multiple metrics and better to choose a slightly suboptimal but simpler model. In this case, would be a larger penalty values with fewer hidden units.
To save intermediary values in tunegrid(), use the extract option to control_grid and setting the save_pred to TRUE.
Finalizing the Model
To replace all the tune parameters with the best result from tuning:
And fitting a model to the entire training set:
Note there are different finalize functions depending on whether one is using a workflow, model or recipe:
finalize_workflow()
finalize_model()
finalize_recipe()
Tools for Creating Tuning Specifications
The “usemodels” package is able to produce R code for tuning the model given just the dataframe and model formula. For example:
Parallel Processing
There are 2 type of parallelization strategy in tune functions via the argument “parallel_over””. The default argument is “resamples”. Each fold would only do the preprocessing once, but would do multiple tuning combinations per worker as the number of workers needed cannot exceed the number of folds.
The other argument is “everything”. This would replicate processing multiple times per worker, but tuning combinations can be assigned to multiple workers.
Access to Global Variables
When you pass a variable defined in the global environment, it could cause problem when running in parallel with parsnip. It is better to pass them as value rather than reference by using the !! operator.
When you have more than one value, use the !!! operator:
Racing Methods
Some tuning parameters might give bad metrics in a number of resamplings, and dropped out totally in the subsequent resamples. This is known as racing methods.
The finetune package contains functions for racing (mirrors tune_grid):
show_best() returns the best models but returns only the configurations that were never elminated. Note less than 10 got returned:
Iterative Search
Also known as sequential search. Parameter combinations are discovered sequentially based on previous results.
We are going to use a SVM model and the cell dataset to demonstrate iterative search. The two tuning parameters are the SVM cost value and the RBF kernel parameter \(\sigma\). The workflow and tuning spec are:
To improve the visualization of the search, we will narrow down the kernel parameter range:
Bayesian Optimization
Bayesian optimization techniques analyze the current resampling results and create a predictive model to suggest tuning parameter values. The suggested parameter combination is then resampled and the cycle follows.
Bayesian optimization analyze the current resampling results and create a surrogate predictive model to approximate the underlying unknown blackbox function (ROC vs cost + sigma), and suggesting tuning parameter values via an acquisition function.
We first address the surrogate function which we would use a Gaussian process model. Gaussian process models are specificed by mean and covariance functions. The covariance will be make up of RBF basis kernel functions. The nature of covariance function is as the distance between two tuning parameter combination increases, the covariance will increase exponentially.
A class of acquisition functions (Exploration vs Exploitation) facilitate the trade-off between mean and variance. Exploration biases toward higher variance region while Exploitation biases toward region that are closer to existing results. “Expected Improvement” is one of the commonly used acquisition functions. A higher variance region would tend to give a higher expected improvement. Another would be the confidence level approach. This is done by funding the tuning parameter combination associated with the largest confidence interval.
Given a starting initial param region:
Using the initial regular grid search, we continue with the Gaussian process model using the tune_bayes() function:
The autoplot() shows how the outcome changed over the search:
And also the parameter values over iterations:
Note that rather than using a regular design for the initial grid, a space-filling design would probably be a better choice.
Simulated Annealing
The key distinguishing factor and unlike most graident-based optimization routines, simulated annleaing can reassess previous solutions. The process starts with an initial value and create a new candidate that is a small perturbation of the previous value.
The acceptance probability:
\[\begin{aligned}
P(\text{accept parameters at iteration }i) &= e^{c \times D_{i} \times i}\\
i &= \text{iteration number}\\
c &= \text{constant}\\
D_{i} &= \% \text{ difference between the old and new values}
\end{aligned}\]
The probability is then compared to a random uniform number. If the acceptance probability is lower than the random uniform number, the search discards the current parameters and next iteration creates its candidate value in the neighborhood of its previous value. If not, the next iteration forms the next set of parameters based on current values.
The neighborhood is determined by scaling the current candidate to be between 0 and 1 based on the parameter range. Then radius values between 0.05 and 0.15 are considered. For nonnumeric parameters, we assign a probability for how often the parameter value changes.
To illustrate simulated annealing, we use the glmnet model and the following paramters tuning:
The “no_improve” is an integer indicating the number of iterations that will continue if no improved results are discovered.
Running simulated annealing:
We can also plot the performance and the tuning parameters selection across iterations:
Screening Multiple Models
Workflow sets provide an interface to fit and compare multiple models. In this example we will use the concrete dataset, where we are predicting the compressive strength of concrete mixtures using ingredients as predictors.
The same set of ingredients are used multiple times. To prevent artificially inflating the performance estimates, we take the mean values:
And doing a 10-fold CV:
We will create a set of model specifications ranging from linear regression to neural nets. We would need to add two parsnip addins:
Different models might require different recipes, so we created two of them:
For the first workset, we combine the recipe to only models that require the predictors to be in the same units:
We can extract a specific workflow using id:
Or adding new spec:
Let us create another workflow set that uses dplyr selectors for the outcome and predictors:
And finally we assemble the set that uses nonlinear terms and interactions:
Binding all workflow sets:
To run the tuning for all workflow sets, we would use workflow_map(). We will use the same seed for each execution of tune_grid. Note that this would take a very long time to run.
Because the above would take a very long time to run, a more realistic is to use the racing approach. Which are parameter sets that have bad performance earlier on would not be considered in subsequent resamples.
Ranking the results:
And plotting the ranking:
If you want to look at the tuning parameter results for a specific model:
And performance metrics and prediction results:
Finalizing a Model
Manually picking “boosting” as the best result and finalizing workflow:
See the final test set results:
And visualizing the predictions:
Dimensionality Reduction
Principal Component Analysis (PCA) is one of the most popular methdos for dimensionality reduction. For this section, we will use the beans dataset. Images of 13,611 grains of 7 different registered dry beans were taken with a high resolution camera.
The training data come from a set of manually labelled images. In the bean data, 16 morphology features were computed: area, perimeter, major axis length, minor axis length, aspect ratio, eccentricity, convex area, equivalent diameter, extent, solidity, roundness, compactness, shape factor1, shape factor 2, shape factor 3, and shape factor 4.
We will split the dataset into training and test set, and the validation sets:
Many of the features have high correlations within themselves and we can see that by plotting the scatterplot matrix:
We can use step_orderNorm() in the bestNormalize package to enforce a symmetric distribution for the predictors. Also step_zv() remove variables that only contain a single value, and step_normalize() standardize variables to mean 0 and variance 1.
Preparing a Recipe
prep(recipe, training) fits the recipe to the training set, while bake(recipe, new_data) applies the recipe operations to new_data. For prep(), when retain=TRUE (by default), the pre-processed training data is saved within the recipe. For large training set, use retain=FALSE to avoid memory issue.
Baking the Recipe
Validation set samples can be assessed by:
To process the validation set samples:
Let us graph the histogram for the area predictor before and after processing:
As for bake(), if retain=TRUE, we can get the prediction for training data using:
We will create a helper function (for use later) to prep(), bake(), and plot the data:
Principal Component Analysis (PCA)
PCA is an unsupervised method that uses linear combinations of the predictors to define new features which account for as much variation as possible. The number of components to retain as new predictors:
The “learntidymodels” package has functions that visualize the top features for each component:
The top loadings are mostly related to size and the second loadings are related to measures of elongation.
Partial Least Squares (PLS)
PLS is a supervised version of PCA. On top of maximizing the variation in the predictors, it also maximize the relationship between the components and the outcome.
Looking at the top loadings:
We can see the first two PLS components are similar to the first two PCA components. However the other components are different.
Independent Component Analysis (ICA)
ICA finds components that are as statistically independent from one another as possible. It can be thought of as maximizing the “Non-Gaussianity” of the ICA components. Rather than compressing information like PCA, it is trying to separate information.
However, the ICA does not separate the bean classes well for the first 2 components.
Uniform Manifold Approximation and Projection (UMAP)
UMAP is similar to t-SNE method. UMAP uses Euclidean distance based on nearest neighbor method to find local areas where the data points are more likely to be related. The relationship between data points is saved as a directed graph model. To do this, the algorithm has an optimization process that uses cross-entropy to map data points to a smaller set of features.
There is also a supervised version of UMAP:
UMAP is a powerful method to reduce the feature space. However, it can be very sensitive to tuning parameters such as the number of neighbors, etc.
Modelling
Let’s explore models with dimensionality reduction added and along with no transformation at all. We would consider the following models:
Naive Bayes
Bagged Trees
Single-Layer Neural Network
Flexible Discriminant Analysis (FDA)
Regularized Discriminant Analysis (RDA)
Next, we create recipe for PLS and UMAP sharing a common base recipe:
The workflow_set() takes the preprocessors and crosses them with the models. The workflow_map() applies grid search across 10 parameter combinations and the metric is area under the ROC curve:
Next we rank the models by their validation set based on the AUC:
We can plot the performance of various models:
We will select the second best performing RDA model with PLS features as the final model. We will finalize the workflow with the best parameters, fit it with the training set, then evaluate with test set:
Extracting the actual fit:
And the metrics:
Encoding Categorical Data
Some models, such as those based on trees and rules, can handle categorical data natively and do not require encoding or transformation. Naive Bayes is another example that can deal with categorical variables natively.
Other models require the tranformation of categorical variable into a set of dummy or indicator variables. However, if there are too many categories or new cateogries at prediction, straighforward dummy variables are not a good fit.
For ordinal variables, in base R, the default encoding strategy is to make new numeric columns that are polynomial expansions of the data. For example, if there are 5 ordinal values, the factor column is relaced with 4 new columns, namely linear, quadratic, cubic, and quartic terms. However this form of encoding is not intuitive. Instead consider recipe steps related to order factors such as step_unorder() for regular factors and step_ordinalscore() for ordinal factors.
Effects Encoding
Effects encoding replaces categorical variables with a numeric column that measures the effect of the column. In the embed package, there are several recipe step functions for different kind of effect encodings:
step_lencode_glm()
step_lencode_mixed()
step_lencode_bayes()
These steps use a generalized linear model to estimate the effect of each level.
For example, in the Ames housing data, we can compute the mean or median sale price for each neighborhood:
In this example, we use the step_lencode_glm() that replaces each categories with scores derived from a generalized linear model:
We can use tidy() to see the results of the scores. Number 2 refers to the glm step:
Note that there is a level created for level not seen in the training data:
When you create an effect encoding for your categorical variable, you are effectively layering a mini-model inside your actual model which might result in overfitting.
Effect Encodings with Partial Pooling
Some factor level has very little samples so it make more sense to pool them together. The effects for each level are modeled all at once using a mixed or hierarchical generalized linear model:
As we can see the number of levels is less:
The neighborhoods with the fewest homes in them have been pulled (either up or
down) toward the mean effect. When we use pooling, we shrink the effect estimates toward the mean because we don’t have as much evidence about the price in those neighborhoods.
Explaining Models
For certain models, it is easier to explain why the model makes its predictions. For example, linear regression contains coefficients for each predictor that are easy to interpret. For models like random forest, it is less clear. But with the help of DALEX package, we can apply model explainer algorithms to help understanding.
Models trained with tidymodels have to explained with other supplementary software in R packages such as lime, vip, and DALEX.
vip functions use model-based methods that take advantage of model structure and generally faster.
DALEX functions use model-agnostic methods that can be applied to any model.
Local Explanations
To start, we use a simple linear regression model to showcase the functionality of DALEX:
We then create an explainer in R for the lm model:
Taking an example from training data:
And attempting to explain by DALEX function predict_parts():
We can see that the contribution column is similar but not exact to the coefficients from lm. Let’s check the prediction from the fit:
We can see the prediction is exactly the same as the explain model. I reckon some of the explaination is shifted to Longitude from Latitude.
Now let’s expand the predictors which include interactions and splines:
Because this linear model is trained with spline terms for Latitude and Longitude, the contributions combines the effects of the spline terms.
Let’s fit another random forest model:
Notice that the importance ranking is different for random forest and linear regression.
If we choose the order for the random forest model explanation to be the same as the default for the linear model (chosen via a heuristic), we can change the relative importance of the features.
Because the explanations change based on order, we can use Shapley additive explanations (SHAP), where the average contributions of features are computed under different combinations of feature orderings. Let’s compute SHAP attributions for our duplex using B = 20 random orderings:
We can plot this result in R. The box plots display the distribution of contributions across all the orderings we tried, and the bars display the average attribution for each feature:
Global Explanations
Previously we addressed which variables are most important in predicting sale price for a single home. Global feature importance however addresses the most important variables for a model in aggregate.
One way to calculate this is to shuffle/permute the features to measure how much worse the model fits the data compared to before shuffling.
In DALE, we can compute this kind of variable importance via model_parts():
In this example, Gr_Liv_Area has the most importance because dropping it result in the largest increase in RMSE.
Building Global Explanations from Local Explanations
Previously we touched on local explanations for a single observation (via Shapley additive explanations) and also global model explanations for the whole dataset by permuting features. It is also possible to build a global model explanations by aggregating local model explanations. Partial dependence profiles show how the expected value of a model prediction, changes as a function of a feature.
One way is to build a profile by aggregating or averaging profiles for individual observations. A Ceteris Paribus (CP) profile shows how an individual observation prediction changes as a function of a given feature. We can employ DALEX model_profile() to accomplish this:
The profiles are grouped by “ids” and we can graph each and also aggregate a mean profile:
Prediction Quality
A predictive model can almost always produce a prediction. When a new data point is well outside of the range of data used to create the model, causing inappropriate extrapolation.
There are two methods for quantifying potential prediction quality:
Equivocal Zones
Applicability
Equivocal Results
Equivocal zone is a range of results in which the prediction should not be reported as positive as the prediction is too uncertain.
Suppose we have the following true logistic model:
Both parameters are bivariate normal distributions with a correlation of 0.70. We create a training set of 200 samples and test set of 50 samples:
We estimate a logistic regression model using Bayesian methods, using Gaussian as prior distributions:
Preparing the prediction data:
Graphing the data:
The “probably” package contains functions for equivocal zones. Rows that are within \(0.50 \pm 0.15\) are given a value of [EQ]:
Contrasting the two confusion matrices:
Rather than using predicted class probability to disqualify data points, a slightly better approach is to use the standard error of the class probability. One important aspect of the standard error is that it will increase where there is significant extrapolation. The Bayesian model naturally estimates the standard error of prediction.
Applicability
Equivocal zones try to measure the reliability of a prediction based on the model outputs. However, model statistics such a she standard error of prediction might not be able to measure the impact of extrapolation.
Taking the example of ridership in Chicago. We can see that the model performs really badly from the 2020 data due to Covid in the book. Over here, we will use PCA to see if the test data is very far from the training data by computing scores along with the distnace to the center of the training set.
The applicable package can develop an applicability domain model using PCA. We indicate a threshold of 99%, such that the number of predictors are retained to ensure the treshold is maintained.
Plotting the percentile of data points distance from the center of training data:
Computing the distance score and percentile for test data:
And computing for 2020 data:
We can clearly see that the 2020 data is way off the center.
Ensembles of Models
A model ensemble is the aggregation of predictions of multiple single learners. In this section, we will talk about aggregating different type of models known as model stacking.
A stacking model can include different types of models (e.g. trees and neural networks) and different configurations of the same model (e.g. trees with different depths).
We will use the race_results workflow set that we created earlier on:
Next, we create an empty data stack and add candidate models:
Meta-Learning Model
The different models are combined into a meta-learning model. Any model can be used but the most commonly used model is a regularized GLM which includes linear, logistc, and multinomial models. Specifically, regularization via the lasso penalty can help to reduce the number of candidates and also alleviate ehe problem of high correlation between ensemble candidates. By default, the blending coefficients are nonnegative but can be changed via an optional argument.
Because the outcome is numeric, linear regression is used for the metamodel:
Plotting the performance metrics with the default penalization:
We can increase the penalty by passing down the parameter:
And plotting the performance metrics again with the new penalization range:
Let’s plot the contribution of each model type:
To be able to use the stacking model for prediction, we first need to fit the stacking model:
Recall that the best boosted tree had a test set RMSE of 3.33. Let’s see how the ensemble model compare on the test set:
Inferential Analysis
Inferential models are created not only for their predictions but also to make inferences about some component of the model. In predictive models, predictions on holdout data are used to validate the quality of the model while inferential methods focus on validating the probabilistic or structural ssumptions that are made prior to fitting the model.
Inference for Count Data
We will use the data consisting of 915 PhD biochemistry graduates and tries to explain factors that impact their academic productivity measured by the count of articles published within 3 years.
Plotting the histogram, we can see that many graduates did not publish any articles and the outcome is right-skewed:
Let’s start with deterimining whether there is a difference in publications between men and women:
Next, we generate a confidence interval for ths mean by creating a bootstrap distribution via generate():
A percentile interval is calculated using:
We can visualize the confidence interval as well:
Permutation Test
If we require a p-value, the infer package can compute the value via a permutation test by adding a hypothesize() function call to state the type of assumption to test:
Similarly, we can visualize the distribution and draw a vertical line to indicate the observed value:
Log-Linear Models
The previous two-sample test are probably suboptimal as they do not account for other factors that might explain the observed relationship between publication rate and set. Let’s move to a more complex model considering additional covariates:
To double check the CI, we use the bootstrap confidence intervals for lm() and glm() models:
Note that the bootstrap confidence interval is wider than the parametric confidence interval. If the data were truly Poisson, their intervals would have more similar widths.
To determine which predictors to keep, we can conduct the Likelihood Ratio Test (LRT) between nested models. Based on the CI, the predictor phd is statistically insignificant. Let us test the following hypothesis:
We fit a reduced model and use ANOVA to check for significance:
We can tidy the results for the LRT to get the p-value
Zero-Inflated Poisson (ZIP) Model
There are occasions where the number of zero counts is larger than what a simple Poisson distribution would prescribe. The equation for the mean \(\lambda\) is:
But unfortunately, anova() isn’t implemented for zeroinfl objects. Rather, we see which has a higher AIC:
Note that we are not conducting a formal statistical test but estimating the ability of the data to fit the data. The results indicate that the ZIP model is preferable.
It is better if we can resample a large number of these two models and assess the uncertainty of the AIC statistics. Note that apparent = TRUE creates an additional bootstrap sample with the holdout set the same as training set.
Next we compute the AIC for both models:
Only one ZIP resample has AIC less than GLM, so it seems definitive that accounting for the excessive number of zero counts is a good idea.
Let’s create a confidence interval for the coefficients and show the first sample:
Let’s visualize the bootstrap distribution of the coefficients:
We can compute the bootstrap confidence interval and t-distribution confidence interval. The rsample package contains a set of functions named int_*() that compute different types of bootstrap intervals (standard percentil and t-distribution):
From this, we can get a pretty good idea which predictors to include in the ZIP model.
Passionate software developer with a background in CS, Math, and Statistics. Love challenges and solving hard quantitative problems with interest in the area of finance and ML.