class: center, top, title-slide # STAT 302, Lecture Slides 2 ## Programming Fundamentals ### Bryan Martin --- # Outline 1. Control flow: `if`, `else`, `while` 2. Loops: `for` and `while` 3. Functions 4. Packages 5. Data 6. Managing Data .middler[**Goal:** Use R's programming capabilities to build efficient functions and workflows.] --- class: inverse .sectionhead[Part 1: Control Flow] --- layout: true # Control Flow --- ## `if` statements * `if` statement give our computer conditions for chunks of code * If our condition is `TRUE`, then the chunk evaluates * If our condition is `FALSE`, then it does not * We must give our condition as a single Boolean --- ## `if` statements ```r x <- 1 # Conditions go in parenthesis after if if (x > 0) { # code chunks get surrounded by curly brackets cat("x is equal to", x, ", a positive number!") } ``` ``` ## x is equal to 1 , a positive number! ``` -- ```r x <- -1 # Conditions go in parenthesis after if if (x > 0) { # code chunks get surrounded by curly brackets cat("x is equal to", x, ", a positive number!") } ``` --- ## `else` statments * We can use `else` to specify what we want to happen when our condition is `FALSE` ```r x <- 1 if (x > 0) { cat("x is equal to", x, ", a positive number!") } else { cat("x is equal to", x, ", a negative number!") } ``` ``` ## x is equal to 1 , a positive number! ``` -- ```r x <- -1 if (x > 0) { paste0("x is equal to ", x, ", a positive number!") } else { paste0("x is equal to ", x, ", a negative number!") } ``` ``` ## [1] "x is equal to -1, a negative number!" ``` --- ## `else if` * Use `else if` to set a sequence of conditions * The final `else` will evaluate anything left ```r x <- 1 if (x > 0) { paste0("x is equal to ", x, ", a positive number!") } else if (x < 0) { paste0("x is equal to ", x, ", a negative number!") } else { paste0("x is equal to ", x, "!") } ``` ``` ## [1] "x is equal to 1, a positive number!" ``` --- ```r x <- -1 if (x > 0) { paste0("x is equal to ", x, ", a positive number!") } else if (x < 0) { paste0("x is equal to ", x, ", a negative number!") } else { paste0("x is equal to ", x, "!") } ``` ``` ## [1] "x is equal to -1, a negative number!" ``` -- ```r x <- 0 if (x > 0) { paste0("x is equal to ", x, ", a positive number!") } else if (x < 0) { paste0("x is equal to ", x, ", a negative number!") } else { paste0("x is equal to ", x, "!") } ``` ``` ## [1] "x is equal to 0!" ``` --- layout: false layout: true # Control Flow: Examples --- ## Absolute Value ```r x <- 5 if (x >= 0) { x } else { -x } ``` ``` ## [1] 5 ``` -- ```r x <- -5 if (x >= 0) { x } else { -x } ``` ``` ## [1] 5 ``` --- ## Checking if `x` is even or negative ```r x <- 6 if ((x %% 2) == 0 | x < 0) { TRUE } else { FALSE } ``` ``` ## [1] TRUE ``` -- ```r x <- -5 if ((x %% 2) == 0 | x < 0) { TRUE } else { FALSE } ``` ``` ## [1] TRUE ``` --- ```r x <- 5 if ((x %% 2) == 0 | x < 0) { TRUE } else { FALSE } ``` ``` ## [1] FALSE ``` --- ## Check length of strings Note: We will need the `stringr` package for this ```r # Run this if you have never installed stringr before! # install.packages("stringr") library(stringr) ``` ```r x <- "cat" if (str_length(x) <= 10) { cat("x is a pretty short string!") } else { cat("x is a pretty long string!") } ``` ``` ## x is a pretty short string! ``` --- ```r x <- "A big fluffy cat with orange fur and stripes" if (str_length(x) <= 10) { cat("x is a pretty short string!") } else { cat("x is a pretty long string!") } ``` ``` ## x is a pretty long string! ``` --- ## Check class ```r x <- 5 if (is.numeric(x)) { cat("x is a numeric!") } else if (is.character(x)) { cat("x is a character!") } else { cat("x is some class I didn't check for in my code!") } ``` ``` ## x is a numeric! ``` --- ## Check class ```r x <- list() if (is.numeric(x)) { cat("x is a numeric!") } else if (is.character(x)) { cat("x is a character!") } else { cat("x is some class I didn't check for in my code!") } ``` ``` ## x is some class I didn't check for in my code! ``` --- layout: false class: inverse .sectionhead[Part 2: for loops] --- layout: true # Loops --- ## `for` loops `for` loops iterate along an input vector, stores the current value of the vector as a variable, and repeatedly evaluates a code chunk until the vector is exhausted ```r for (i in 1:10) { print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ## [1] 10 ``` --- ## `while` loops `while` loops continuously evaluate the inner code chunk until the condition is `FALSE`. Be careful here! It is possible to get stuck in an infinite loop! ```r x <- 0 while (x < 5) { cat("x is currently", x, ". Let's increase it by 1!") x <- x + 1 } ``` ``` ## x is currently 0 . Let's increase it by 1!x is currently 1 . Let's increase it by 1!x is currently 2 . Let's increase it by 1!x is currently 3 . Let's increase it by 1!x is currently 4 . Let's increase it by 1! ``` --- ## `while` loops Let's see if we can clean up that output. Add `"\n"` to a string to force a line break. ```r x <- 0 while (x < 5) { cat("x is currently ", x, ". Let's increase it by 1! \n", sep = "") x <- x + 1 } ``` ``` ## x is currently 0. Let's increase it by 1! ## x is currently 1. Let's increase it by 1! ## x is currently 2. Let's increase it by 1! ## x is currently 3. Let's increase it by 1! ## x is currently 4. Let's increase it by 1! ``` --- layout: false layout: true # Loops: Examples --- ## String Input ```r string_vector <- c("a", "b", "c", "d", "e") for (mystring in string_vector) { print(mystring) } ``` ``` ## [1] "a" ## [1] "b" ## [1] "c" ## [1] "d" ## [1] "e" ``` --- ## Nested Loops ```r counter <- 0 for (i in 1:3) { for (j in 1:2) { counter <- counter + 1 cat("i = ", i, ", j = ", j, ", counter = ", counter, "\n", sep = "") } } ``` ``` ## i = 1, j = 1, counter = 1 ## i = 1, j = 2, counter = 2 ## i = 2, j = 1, counter = 3 ## i = 2, j = 2, counter = 4 ## i = 3, j = 1, counter = 5 ## i = 3, j = 2, counter = 6 ``` --- ## Nested Loops ```r for (i in 1:3) { for (j in 1:2) { print(i * j) } } ``` ``` ## [1] 1 ## [1] 2 ## [1] 2 ## [1] 4 ## [1] 3 ## [1] 6 ``` --- ## Filling in a vector Note: Usually, this is an inefficient way to do this! Try to vectorize code wherever possible! ```r # Inefficient x <- rep(NA, 5) for (i in 1:5) { x[i] <- i * 2 } x ``` ``` ## [1] 2 4 6 8 10 ``` ```r # Much better x <- seq(2, 10, by = 2) x ``` ``` ## [1] 2 4 6 8 10 ``` --- ## Filling in a vector ```r library(stringr) x <- rep(NA, 5) my_strings <- c("a", "a ", "a c", "a ca", "a cat") for (i in 1:5) { x[i] <- str_length(my_strings[i]) print(x) } ``` ``` ## [1] 1 NA NA NA NA ## [1] 1 2 NA NA NA ## [1] 1 2 3 NA NA ## [1] 1 2 3 4 NA ## [1] 1 2 3 4 5 ``` --- ## Filling in a matrix Note: Usually, this is an inefficient way to do this! Try to vectorize code wherever possible! ```r x <- matrix(NA, nrow = 4, ncol = 3) for (i in 1:4) { for (j in 1:3) { x[i, j] <- i * j } } x ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 2 4 6 ## [3,] 3 6 9 ## [4,] 4 8 12 ``` --- ## Continue until positive sample ```r set.seed(3) x <- -1 while (x < 0) { x <- rnorm(1) print(x) } ``` ``` ## [1] -0.9619334 ## [1] -0.2925257 ## [1] 0.2587882 ``` ```r x ``` ``` ## [1] 0.2587882 ``` --- layout: false class: inverse .sectionhead[Part 3: Functions] --- layout: true # Functions --- We've already seen and used several functions, but you can also create your own! This is incredibly useful when: * You use the same code chunk repeatedly * You want to generalize your workflow to multiple inputs * You want others to be able to use your code * You want to complete your assignments for STAT 302 --- ## Anatomy of a function ```r function_name <- function(param1, param2 = "default") { # Body of the function return(output) } ``` * `function_name`: the name you want to give your function, what you will use to call it * `function()`: call this to define a function * `param1`, `param2`: function parameters, what the user inputs. You can assign default values by setting them equal to something in the function definition * **Body**: the actual code that is executed * `return()`: is what your function will return to the user --- layout: false layout: true # Functions: Examples --- ## Square a number, add 2 ```r square_plus_2 <- function(x) { y <- x^2 + 2 return(y) } square_plus_2(4) ``` ``` ## [1] 18 ``` ```r square_plus_2(10) ``` ``` ## [1] 102 ``` ```r square_plus_2(1:5) ``` ``` ## [1] 3 6 11 18 27 ``` --- ```r square_plus_2("some string") ``` ``` ## Error in x^2: non-numeric argument to binary operator ``` What happened here? We wrote a function for numerics only but didn't check the input! Let's try making our function more robust by adding a `stop` ```r square_plus_2 <- function(x) { if (!is.numeric(x)) { stop("x must be numeric!") } else { y <- x^2 + 2 return(y) } } square_plus_2("some string") ``` ``` ## Error in square_plus_2("some string"): x must be numeric! ``` --- ## Check if the input is positive ```r check_pos <- function(x) { if (x > 0) { return(TRUE) } else if (x < 0) { return(FALSE) } else { return(paste0("x is equal to ", x, "!")) } } check_pos(-3) ``` ``` ## [1] FALSE ``` ```r store_output <- check_pos(0) store_output ``` ``` ## [1] "x is equal to 0!" ``` --- ## Make a table We'll use `str_c` from the `stringr` package for this function. ```r library(stringr) my_summary <- function(input, percentiles = c(.05, .5, .95)) { if (!is.numeric(input) | !is.numeric(percentiles)) { stop("The input and percentiles must be numeric!") } if (max(percentiles) > 1 | min(percentiles) < 0) { stop("Percentiles must all be in [0, 1]") } # Convert percentiles to character percent, append " Percentile" to each labels <- str_c(percentiles * 100, " Percentile") output <- quantile(input, probs = percentiles) names(output) <- labels return(output) } ``` --- ## Make a table ```r x <- rnorm(100) my_summary(x) ``` ``` ## 5 Percentile 50 Percentile 95 Percentile ## -1.22236488 0.06183487 1.22655423 ``` ```r my_summary(x, percentiles = c(.07, .5, .63, .91)) ``` ``` ## 7 Percentile 50 Percentile 63 Percentile 91 Percentile ## -1.13785677 0.06183487 0.36358152 1.16185072 ``` ```r my_summary(c("string1", "string2")) ``` ``` ## Error in my_summary(c("string1", "string2")): The input and percentiles must be numeric! ``` ```r my_summary(x, percentiles = c(-7, .5, 1.3)) ``` ``` ## Error in my_summary(x, percentiles = c(-7, 0.5, 1.3)): Percentiles must all be in [0, 1] ``` --- ## Function with iteration ```r my_sum <- function(x) { total <- 0 for (i in 1:length(x)) { total <- total + x[i] } return(total) } my_sum(1:5) ``` ``` ## [1] 15 ``` --- layout: false class: inverse .sectionhead[Style guide!] --- layout: true # Style guide! --- .middler[Once again, we will using a mix of the [Tidyverse style guide](https://style.tidyverse.org/) and the [Google style guide](https://google.github.io/styleguide/Rguide.html).] --- ## Function Names Strive to have function names based on verbs. Otherwise, standard variable name style guidelines apply! ```r # Good add_row() permute() # Bad row_adder() permutation() ``` --- ## Spacing Place a space before and after `()` when used with `if`, `for`, or `while`. ```r # Good if (condition) { x + 2 } # Bad if(condition){ x + 2 } ``` --- ## Spacing Place a space after `()` used for function arguments. ```r # Good if (debug) { show(x) } # Bad if(debug){ show(x) } ``` --- ## Code Blocks * `{` should be the last character on the line. Related code (e.g., an `if` clause, a function declaration, a trailing comma, ...) must be on the same line as the opening brace. It should be preceded by a single space. * The contents within code blocks should be indented by two spaces from where it started * `}` should be the first character on the line. --- ## Code Blocks ```r # Good if (y < 0) { message("y is negative") } if (y == 0) { if (x > 0) { log(x) } else { message("x is negative or zero") } } else { y^x } ``` --- ## Code Blocks ```r # Bad if (y<0){ message("Y is negative") } if (y == 0) { if (x > 0) { log(x) } else { message("x is negative or zero") } } else { y ^ x } ``` --- ## In-line Statments In general, it's ok to drop the curly braces for very simple statements that fit on one line. However, function calls that affect control flow (`return`, `stop`, etc.) should always go in their own `{}` block: ```r # Good y <- 10 x <- if (y < 20) "Too low" else "Too high" if (y < 0) { stop("Y is negative") } find_abs <- function(x) { if (x > 0) { return(x) } x * -1 } ``` --- ## In-line Statements In general, it's ok to drop the curly braces for very simple statements that fit on one line. However, function calls that affect control flow (`return`, `stop`, etc.) should always go in their own `{}` block: ```r # Bad if (y < 0) stop("Y is negative") if (y < 0) stop("Y is negative") find_abs <- function(x) { if (x > 0) return(x) x * -1 } ``` --- ## Long lines in functions If a function definition runs over multiple lines, indent the second line to where the definition starts. ```r # Good long_function_name <- function(a = "a long argument", b = "another argument", c = "another long argument") { # As usual code is indented by two spaces. } # Bad long_function_name <- function(a = "a long argument", b = "another argument", c = "another long argument") { # Here it's hard to spot where the definition ends and the # code begins } ``` --- ## `return` Strictly speaking, `return` is not necessary in a function definition. The function will output the last line of executable R code. The following function definitions will output the same results! ```r Add_Values <- function(x, y) { return(x + y) } Add_Values <- function(x, y) { x + y } ``` Note that our two guides disagree on which of these is preferable. Personally, I always make my `return` statements explicit, so I prefer the former. --- ## Commenting functions For now, when commenting functions, include (at least) 3 lines of comments: * a comment describing the purpose of a function * a comment describing each input * a comment describing the output The function body should be commented as usual! --- ```r # Good ---- # Function: square_plus_2, squares a number and then adds 2 # Input: x, must be numeric # Output: numeric equal to x^2 + 2 square_plus_2 <- function(x) { # check that x is numeric if (!is.numeric(x)) { stop("x must be numeric!") } else { # if numeric, then square and add 2 y <- x^2 + 2 return(y) } } # Bad ---- # Function for problem 2c square_plus_2 <- function(x) { if (!is.numeric(x)) { stop("x must be numeric!") } else { y <- x^2 + 2 return(y) } } ``` --- layout: false # Summary * Use `if` and `else` to set conditions * Use `for` and `while` to write loops * Functions include a input parameters, a body of code, and an output * Functions are essential for a good workflow! --- class: inverse .sectionhead[Part 4: Packages] --- layout: true # Packages --- ## What is an R package? * Packages bundle together code, data, and documentation in an easy to share way. * They come with functions that others have written for you to make your life easier, and greatly improve the power of R! * Packages are the reason we are learning about R in this course. * Packages can range from graphical software, to web scraping tools, statistical models for spatio-temporal data<sup>1</sup>, microbial data analysis tools<sup>2</sup>, and more! .footnote[[1] This is a shameless self plug. [2] This is also a shameless self plug.] --- ## Where are packages? * The most popular package repository is the Comprehensive R Archive Network, or [CRAN](https://cran.r-project.org/) * As of making this slide, it includes over 16,000 packages * Other popular repositories include [Bioconductor](https://www.bioconductor.org/) and [Github](https://github.com/) --- ## How do I install packages? If a package is available on CRAN, like most packages we will use for this course, you can install it using `install.packages()`: ```r install.packages("PACKAGE_NAME_IN_QUOTES") ``` You can also install by clicking *Install* in the *Packages* tab through RStudio. For the most part, after you install a package, it is saved on your computer until you update R, and you will not need to re-install it. Thus, you should **never** include a call to `install.packages()` in any `.R` or `.Rmd` file! --- ## How do I use a package? After a package is installed, you can load it into your current R session using `library()`: ```r library(PACKAGE_NAME) # or library("PACKAGE_NAME") ``` Note that unlike `install.packages()`, you do not need to include the package name in quotes. --- ## How do I use a package? Loading a package must be done with each new R session, so you should put calls to `library()` in your `.R` and `.Rmd` files. Usually, I do that in the opening code chunk. If it is a `.Rmd`, I set the parameter `include = FALSE` to hide the messages and code, because they are usually unnecessary to the reader of my HTML. ```{r, include = FALSE} library(ggplot2) ``` --- layout: false class: inverse .sectionhead[Part 5: Data] --- layout: true # Lists --- **Lists**, like vectors and matrices, are a class of objects in R. Lists are special because they can store multiple different types of data. ```r my_list <- list("some_numbers" = 1:5, "some_characters" = c("a", "b", "c"), "a_matrix" = diag(2)) my_list ``` ``` ## $some_numbers ## [1] 1 2 3 4 5 ## ## $some_characters ## [1] "a" "b" "c" ## ## $a_matrix ## [,1] [,2] ## [1,] 1 0 ## [2,] 0 1 ``` Make sure to store items within a list using `=`, not `<-`! --- ## Accessing List Elements There are three ways to access an item within a list * double brackets `[[]]` with its name in quotes * double brackets `[[]]` with its index as a number * dollar sign `$` followed by its name without quotes --- * double brackets `[[]]` with its name in quotes * double brackets `[[]]` with its index as a number * dollar sign `$` followed by its name without quotes ```r my_list[["some_numbers"]] ``` ``` ## [1] 1 2 3 4 5 ``` ```r my_list[[1]] ``` ``` ## [1] 1 2 3 4 5 ``` ```r my_list$some_numbers ``` ``` ## [1] 1 2 3 4 5 ``` --- ## Why double brackets? If you use a single bracket to index, like we do with matrices and vectors, you will return a list with a single element. ```r my_list[1] ``` ``` ## $some_numbers ## [1] 1 2 3 4 5 ``` ```r my_list[[1]] ``` ``` ## [1] 1 2 3 4 5 ``` Note that this means you can only return a single item in a list using double brackets or the dollar sign! (Why?) --- ## Why double brackets? This is a subtle but important difference! ```r my_list[1] + 1 ``` ``` ## Error in my_list[1] + 1: non-numeric argument to binary operator ``` ```r my_list[[1]] + 1 ``` ``` ## [1] 2 3 4 5 6 ``` --- ## Subsetting a list You can subset a list similarly to vectors and matrices using single brackets. ```r my_list[1:2] ``` ``` ## $some_numbers ## [1] 1 2 3 4 5 ## ## $some_characters ## [1] "a" "b" "c" ``` ```r my_list[-2] ``` ``` ## $some_numbers ## [1] 1 2 3 4 5 ## ## $a_matrix ## [,1] [,2] ## [1,] 1 0 ## [2,] 0 1 ``` --- ## Adding to a list We can use the same tools we used to access list elements to add to a list. However, if we use double brackets, we must use quotes, otherwise R will search for something that does not yet exist. ```r my_list$a_boolean <- FALSE my_list[["a_list"]] <- list("recursive" = TRUE) ``` --- ```r my_list ``` ``` ## $some_numbers ## [1] 1 2 3 4 5 ## ## $some_characters ## [1] "a" "b" "c" ## ## $a_matrix ## [,1] [,2] ## [1,] 1 0 ## [2,] 0 1 ## ## $a_boolean ## [1] FALSE ## ## $a_list ## $a_list$recursive ## [1] TRUE ``` --- ## Names of List Items Call `names()` to get a vector of list item names. ```r names(my_list) ``` ``` ## [1] "some_numbers" "some_characters" "a_matrix" "a_boolean" ## [5] "a_list" ``` --- ## Why bother? * Lists give us **key-value pairs**, also known as **dictionaries** or **associative arrays** * This means we can look up items in a list by name, rather than location * For example, if we know we are looking for `output` within a list, we can always search for it, regardless of how the list was created or what else it contains --- layout: false layout: true # Data Frames --- A **data frame** in R is essentially a special type of list, where each item is a vector of equal length. Typically, we say that data has `\(n\)` rows (one for each **observation**) and `\(p\)` columns (one for each **variable**) Unlike a matrix, columns can have different types. However, many column functions still apply! (such as `colSums`, `summary`, etc.) --- ## Creating a data frame An easy way to create a data frame is to use the function `data.frame()`. Like lists, make sure you define the names using `=` and not `<-`! ```r my_data <- data.frame("var1" = 1:3, "var2" = c("a", "b", "c"), "var3" = c(TRUE, FALSE, TRUE)) my_data ``` ``` ## var1 var2 var3 ## 1 1 a TRUE ## 2 2 b FALSE ## 3 3 c TRUE ``` --- ## Creating a data frame If you import or create numeric data as a `matrix`, you can also convert it easily using `as.data.frame()` ```r my_matrix <- matrix(1:9, nrow = 3, ncol = 3) as.data.frame(my_matrix) ``` ``` ## V1 V2 V3 ## 1 1 4 7 ## 2 2 5 8 ## 3 3 6 9 ``` --- ## Subsetting data frames We can subset data frames using most of the tools we've learned about subsetting so far. We can use keys or indices. ```r my_data$var1 ``` ``` ## [1] 1 2 3 ``` ```r my_data["var1"] ``` ``` ## var1 ## 1 1 ## 2 2 ## 3 3 ``` ```r my_data[["var1"]] ``` ``` ## [1] 1 2 3 ``` --- ## Subsetting data frames ```r my_data[1] ``` ``` ## var1 ## 1 1 ## 2 2 ## 3 3 ``` ```r my_data[[1]] ``` ``` ## [1] 1 2 3 ``` ```r my_data[, 1] ``` ``` ## [1] 1 2 3 ``` ```r my_data[1, ] ``` ``` ## var1 var2 var3 ## 1 1 a TRUE ``` --- ## Adding to a data frame We can add to a data frame using `rbind()` and `cbind()`, but be careful with type mismatches! We can also add columns using the column index methods. ```r # These all do the same thing my_data <- cbind(my_data, "var4" = c(3, 2, 1)) my_data$var4 <- c(3, 2, 1) my_data[, "var4"] <- c(3, 2, 1) my_data[["var4"]] <- c(3, 2, 1) my_data ``` ``` ## var1 var2 var3 var4 ## 1 1 a TRUE 3 ## 2 2 b FALSE 2 ## 3 3 c TRUE 1 ``` --- ## Adding to a data frame ```r rbind(my_data, c(1, 2, 3, 4)) ``` ``` ## var1 var2 var3 var4 ## 1 1 a 1 3 ## 2 2 b 0 2 ## 3 3 c 1 1 ## 4 1 2 3 4 ``` ```r rbind(my_data, list(4, "d", FALSE, 0)) ``` ``` ## var1 var2 var3 var4 ## 1 1 a TRUE 3 ## 2 2 b FALSE 2 ## 3 3 c TRUE 1 ## 4 4 d FALSE 0 ``` --- ## Investigating a data frame We can use `str()` to see the structure of a data frame (or any other object!) ```r my_data2 <- rbind(my_data, c(1, 2, 3, 4)) str(my_data2) ``` ``` ## 'data.frame': 4 obs. of 4 variables: ## $ var1: num 1 2 3 1 ## $ var2: chr "a" "b" "c" "2" ## $ var3: num 1 0 1 3 ## $ var4: num 3 2 1 4 ``` ```r my_data2 <- rbind(my_data, list(4, "d", FALSE, 0)) str(my_data2) ``` ``` ## 'data.frame': 4 obs. of 4 variables: ## $ var1: num 1 2 3 4 ## $ var2: chr "a" "b" "c" "d" ## $ var3: logi TRUE FALSE TRUE FALSE ## $ var4: num 3 2 1 0 ``` --- Most data frames will have column names describing the variables. They can also include rownames, which we can add using `rownames()`. ```r rownames(my_data2) <- c("Obs1", "Obs2", "Obs3", "Obs4") my_data2 ``` ``` ## var1 var2 var3 var4 ## Obs1 1 a TRUE 3 ## Obs2 2 b FALSE 2 ## Obs3 3 c TRUE 1 ## Obs4 4 d FALSE 0 ``` --- layout: false # Tibbles `tibbles` are a special Tidyverse data frame from the `tibble` package. You can convert data frames to tibbles using `as_tibble()`, or you can create them similarly to data frames using `tibble()`. The biggest benefit of tibbles is that they display nicer in your R console, automatically truncating output and including variable type to print nicely. Tidyverse has (rightfully) decided rownames are obsolete, and so they do not include rownames by default. However, we can include our rownames as a variable using the parameter `rownames` in `as_tibble()`. ```r library(tibble) my_tibble <- as_tibble(my_data2, rownames = "Observation") my_tibble ``` ``` ## # A tibble: 4 x 5 ## Observation var1 var2 var3 var4 ## <chr> <dbl> <chr> <lgl> <dbl> ## 1 Obs1 1 a TRUE 3 ## 2 Obs2 2 b FALSE 2 ## 3 Obs3 3 c TRUE 1 ## 4 Obs4 4 d FALSE 0 ``` --- layout: true # Tidy Data Principles --- There are three rules required for data to be considered tidy * Each variable must have its own column * Each observation must have its own row * Each value must have its own cell --- Seems simple, but can sometimes be tricky! What's untidy about the following data? <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Hospital </th> <th style="text-align:right;"> Diseased </th> <th style="text-align:right;"> Healthy </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 16 </td> </tr> </tbody> </table> -- * **Observations:** the number of individuals at a given hospital and of a given disease status * **Variables:** the hospital, the disease status, the counts * **Values:** Hospital A, Hospital B, Hospital C, Hospital D, individual count values, *Disease Status Healthy*, *Disease Status Diseased* --- Problem: column headers are values, not variables! How can we tidy it up? -- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Hospital </th> <th style="text-align:left;"> Status </th> <th style="text-align:right;"> Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 16 </td> </tr> </tbody> </table> --- Another example: <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Country </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> m1624 </th> <th style="text-align:right;"> m2534 </th> <th style="text-align:right;"> f1624 </th> <th style="text-align:right;"> f2534 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:right;"> 49 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 41 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:right;"> 34 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 43 </td> </tr> </tbody> </table> -- * **Observations:** the number of individuals in a given country, in a given year, of a given gender, and in a given age group * **Variables:** Country, year, gender, age group, counts * **Values:** Country A, Country B, Year 2018, Gender "m", Gender "f", Age Group "1624", Age Group "2534", individual counts --- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Country </th> <th style="text-align:right;"> Year </th> <th style="text-align:left;"> Gender </th> <th style="text-align:left;"> Age_Group </th> <th style="text-align:right;"> Counts </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 49 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 55 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 47 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 41 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 34 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 43 </td> </tr> </tbody> </table> --- ## How to tidy data? 1. Identify the observations, variables, and values 2. Ensure that each observation has its own row * Be careful for individual observations spread over multiple tables/Excel files/etc, or multiple types of observations within a single table (this would result in many empty cells) 3. Ensure that each variable has its own column * Be careful for variables spread over two columns, multiple variables within a single column, variables as rows 4. Ensure that each value has its own cell * Be careful for values as column headers --- ## Why tidy data? * Easier to read data * Easier to analyze and plot using standard software (required for `ggplot2`) * Easier to understand what the data represents * Fewer issues with missing values --- ## A final reference Hadley Wickham is the ultimate resource on tidy data principles. [Here is a fantastic reference going through all these principles in more detail and with more examples.](https://vita.had.co.nz/papers/tidy-data.pdf) --- layout: false class: inverse .sectionhead[Part 6: Managing Data] --- layout: true # Working Directory --- ## Seeing your working directory A **working directory** is the filepath R uses to save and look for data. You can check for your current working directory using `getwd()` ```r getwd() ``` ``` ## [1] "/Volumes/GoogleDrive/My Drive/Git Repos/STAT302/docs/LectureSlides/lectureslides2_fundamentals" ``` This location is where R will look by default! --- ## Changing your working directory You can change your working directory using `setwd()`. ```r setwd("/Users/Bryan/Desktop/STAT302/") ``` You can use the shorthand `..` to reference a parent directory relative to where you are now. ```r setwd("..") getwd() ``` ``` ## [1] "/Volumes/GoogleDrive/My Drive/Git Repos/STAT302/docs/LectureSlides" ``` We can also reference the current directory using the shorthand `.`. ```r setwd("./lectureslides2") ``` ```r getwd() ``` ``` ## [1] "/Volumes/GoogleDrive/My Drive/Git Repos/STAT302/docs/LectureSlides/lectureslides2_fundamentals" ``` --- ## Working directories and R Markdown Do not change your working directory inside R Markdown files! By default, R Markdown sets the filepath they are in as the working directory. Changing this can (will) mess up your analysis, and make your work less reproducible. --- ## Saving Data You can save single R objects as `.rds` files using `saveRDS()`, multiple R objects as `.RData` or `.rda` files using `save()`, and your entire workspace as `.RData` using `save.image()`. ```r object1 <- 1:5 object2 <- c("a", "b", "c") # save only object1 saveRDS(object1, file = "object1_only.rds") # save object1 and object2 save(object1, object2, file = "both_objects.RData") # save my entire workspace save.image(file = "entire_workspace.RData") ``` In general, I recommend using `.RData` for multiple objects, and I recommend against using `save.image()`, basically ever. `save.image()` should never be a part of your workflow. Personally, I only use it if I need to quickly close R and want to come back to exactly where I was later. (For example, a coffee shop I was working at closed). I will always delete the file later so it does not mess with my workflow. --- ## Loading Data You can load `.rds` files using `readRDS()` and `.Rdata` and `.rda` files using `load()`. ```r # load only object1 readRDS("object1_only.rds") # load object1 and object2 load("both_objects.RData") # load my entire workspace load("entire_workspace.RData") ``` --- ## Notes on Saving and Loading R Data The values in quotes are all filepaths, and by default, R will search for these objects in your current working directory. You can change where R searches for images by adjusting this filepath. For example, if you save your data in a `Data` subfolder within your working directory, you might try ```r load("./Data/my_data.RData") ``` --- ## Other types of data Often, you will read and write files as **c**omma **s**eparated **v**alues, or `.csv`. You can do this by navigating *File > Import Dataset* in the menu bar, but generally I recommend doing it manually using the `readr` package. You will need to do so if loading data is part of your work flow, such as if it is required for an R Markdown writeup. ```r library(readr) # read a .csv file in a "Data" subfolder read_csv("./Data/file.csv") # save a .csv file in a "Data" subfolder write_csv("./Data/data_output.csv") ``` `readr` can also handle many more types of data! See more details about `readr` using the fantastic cheat sheet available [here.](https://rstudio.com/resources/cheatsheets/) --- ## Working Directories Summary * Working directories are the default filepaths R uses to save and load files * When working in a `.Rmd`, your default filepath is wherever the `.Rmd` is stored, and you should leave it there * You can change your working directory with `setwd()`. * You can reference your current working directory using `.` and the parent directory of your current working directory using `..` For larger analysis projects, I recommend using R projects to automatically manage your working directory for you! --- layout: false layout: true # Projects --- Good file organization requires you to keep all your input data, R scripts, output data and results, and figures together. You can do this using **Projects**. You can create a project by going to *File > New Project*. If you want your project in a folder you have already created, select *Existing Directory*. If you want RStudio to automatically make you a new folder with a project, select *New Directory*. Then select *Empty Project* to create a standard project. This will create a `.Rproj` file on your computer. When working with a project, save and manage your work as usual. When you close and re-open R, *do so by double-clicking on your `.Rproj` file!* This will automatically open everything as you left it, except your environment will be fresh, helping with reproducibility. --- ## Benefits of Projects * Automatically manages your working directory, even if you move the project file * Remembers your working directory and command history, all the files you were working on are still open. * Helps with reproducibility. You can share R project files and the project will load on other computer exactly as it does on yours. * Helps keep your separate analyses separate. For example, you won't need to worry if you defined a variable `x` in multiple different analyses * Easy to integrate with version control such as git (more on this later!)