If you collaborated with anyone, you must include “Collaborated with: FIRSTNAME LASTNAME” at the top of your lab!
For this lab, note that there are tidyverse methods to perform cross-validation in R (see the rsample
package). However, your goal is to understand and be able to implement the algorithm “by hand”, meaning that automated procedures from the rsample
package, or similar packages, will not be accepted.
To begin, load in the popular penguins
data set from the package palmerpenguins
.
library(palmerpenguins)
data(package = "palmerpenguins")
Our goal here is to predict output class species
using covariates bill_length_mm
, bill_depth_mm
, flipper_length_mm
, and body_mass_g
. All your code should be within a function my_knn_cv
.
Input:
train
: input data framecl
: true class value of your training datak_nn
: integer representing the number of neighborsk_cv
: integer representing the number of foldsPlease note the distinction between k_nn
and k_cv
!
Output: a list with objects
class
: a vector of the predicted class \(\hat{Y}_{i}\) for all observationscv_err
: a numeric with the cross-validation misclassification errorYou will need to include the following steps:
fold
that randomly assigns observations to folds \(1,\ldots,k\) with equal probability. (Hint: see the example code on the slides for k-fold cross validation)knn()
from the class
package to predict the class of the \(i\)th fold using all other folds as the training data.class
as the output of knn()
with the full data as both the training and the test data, and the value cv_error
as the average misclassification rate from your cross validation.Submission: To prove your function works, apply it to the penguins
data. Predict output class species
using covariates bill_length_mm
, bill_depth_mm
, flipper_length_mm
, and body_mass_g
. Use \(5\)-fold cross validation (k_cv = 5
). Use a table to show the cv_err
values for 1-nearest neighbor and 5-nearest neighbors (k_nn = 1
and k_nn = 5
). Comment on which value had lower CV misclassification error and which had lower training set error (compare your output class
to the true class, penguins$species
).
Now, we will predict output body_mass_g
using covariates bill_length_mm
, bill_depth_mm
, and flipper_length_mm
. All your code should be within a function my_rf_cv
.
Input:
k
: number of foldsOutput:
Your code will look very similar to Part 1! You will need the following steps:
fold
within the penguins
data that randomly assigns observations to folds \(1,\ldots,k\) with equal probability. (Hint: see the example code on the slides for k-fold cross validation)randomForest()
from the randomForest
package to train a random forest model with \(100\) trees to predict body_mass_g
using covariates bill_length_mm
, bill_depth_mm
, and flipper_length_mm
. randomForest()
takes formula input. Your code here will probably look something like: MODEL <- randomForest(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, data = TRAINING_DATA, ntree = 100)
body_mass_g
of the \(i\)th fold which was not used as training data. randomForest()
works similar to lm()
. Your code here will probably looks something like: PREDICTIONS <- predict(MODEL, TEST_DATA[, -1])
where we remove the first column, body_mass_g
from our test data.body_mass_g
and true body_mass_g
.Submission: To prove your function works, apply it to the penguins
data. Predict body_mass_g
using covariates bill_length_mm
, bill_depth_mm
, and flipper_length_mm
. Run your function with \(5\)-fold cross validation (k = 5
) and report the CV MSE.