Use the following code to generate data:
library(ggpubr)
# generate data
set.seed(302)
n <- 30
x <- sort(runif(n, -3, 3))
y <- 2*x + 2*rnorm(n)
x_test <- sort(runif(n, -3, 3))
y_test <- 2*x_test + 2*rnorm(n)
df_train <- data.frame("x" = x, "y" = y)
df_test <- data.frame("x" = x_test, "y" = y_test)
# store a theme
my_theme <- theme_bw(base_size = 16) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5))
# generate plots
g_train <- ggplot(df_train, aes(x = x, y = y)) + geom_point() +
xlim(-3, 3) + ylim(min(y, y_test), max(y, y_test)) +
labs(title = "Training Data") + my_theme
g_test <- ggplot(df_test, aes(x = x, y = y)) + geom_point() +
xlim(-3, 3) + ylim(min(y, y_test), max(y, y_test)) +
labs(title = "Test Data") + my_theme
ggarrange(g_train, g_test) # from ggpubr, to put side-by-side
1a. For every k in between 1 and 10, fit a degree-k polynomial linear regression model with y
as the response and x
as the explanatory variable(s). (Hint: Use poly()
, as in the lecture slides.)
1b. For each model from (a), record the training error. Then predict y_test
using x_test
and also record the test error.
1c. Present the 10 values for both training error and test error on a single table. Comment on what you notice about the relative magnitudes of training and test error, as well as the trends in both types of error as \(k\) increases.
1d. If you were going to choose a model based on training error, which would you choose? Plot the data, colored by split. Add a line to the plot representing your selection for model fit. Add a subtitle to this plot with the (rounded!) test error. (Hint: You can use as much of my code as you want for this and part (e). See Lecture Slides 8!)
1e. If you were going to choose a model based on test error, which would you choose? Plot the data, colored by split. Add a line to the plot representing your selection for model fit. Add a subtitle to this plot with the (rounded!) test error.
1f. What do you notice about the shape of the curves from part (d) and (e)? Which model do you think has lower bias? Lower variance? Why?