We will start by plotting the data using a pairwise plot for exploratory data analysis. You will need the GGally
package to run the code below. First install the package if you haven’t already.
# install.packages("GGally")
library(palmerpenguins)
library(tidyverse)
library(GGally)
data(penguins)
Note the plotting code below comes from the excellent Allison Horst, available here.
penguins <- penguins %>% drop_na()
penguins %>%
select(species, where(is.numeric)) %>%
GGally::ggpairs(aes(color = species),
columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm"),
upper = list(continuous = wrap("cor", size = 3))) +
scale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4")) +
theme_bw()
1. Comment on what you see in the plot above. Use several sentences, as later in this short lab you will be comparing your conclusions here to those from PCA. Focus on the associations between species and the various quantitative variables, and how those associations differ across species.
There are many, many functions in R for implementing PCA. We will focus on prcomp()
. Note that you may be asked to use a different function someday!
prcomp()
parameters include the numeric data and a Boolean for whether or not to scale the data. See ?prcomp
for details. We will store the output of prcomp()
as pc_out
.
colnames(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
penguins_num <- penguins %>%
select(contains("_mm"), "body_mass_g")
pc_out <- prcomp(penguins_num, scale = TRUE)
Let’s first investigate this object.
names(pc_out)
## [1] "sdev" "rotation" "center" "scale" "x"
2. Use the documentation for prcomp()
and your intuition with data exploration to explain what the center
and scale
elements of the pc_out
object represent.
3. Print out a table of the proportion of variance explained by the principal components. You may find the summary()
function useful here. See prcomp()
for examples of how to use it. How many principal components would you suggest using in this application and why?
# YOUR CODE HERE
We are now going to plot the biplot for these data, using ggbiplot
within the ggbiplot
package. We need to install this package from GitHub! You will need the remotes
package, if you do not have it already to install the package from GitHub. (Remember: package installs never stay in our final RMarkdown document, hence why they are commented out here for your reference only.)
Here I will use the standard function, with some graphical options.
# install.packages("remotes")
# remotes::install_github("vqv/ggbiplot")
library(ggbiplot)
ggbiplot(pc_out, groups = as.factor(penguins$species), varname.size = 3.4) +
xlim(-3, 3) +
theme_bw() + scale_colour_manual(values = c("darkorange","purple","cyan4"))
4. First let’s discuss the variables. Discuss which variables (flipper_length_mm
, body_mass_g
, bill_length_mm
, bill_depth_mm
) seem to be correlated, which do not, and why you are coming to these conclusions.
5. Next let’s discuss the species. How might we distinguish the species of penguins using the first two principal components? (e.g. Which species tend to have higher scores for the first principal component? The second?)
6. Translate your answer in part 5 to the original variables (flipper_length_mm
, body_mass_g
, bill_length_mm
, bill_depth_mm
). What does this suggest about associations between the species and variables? How does this compare to what you observed in the pairwise plot from Part 1?