class: center, top, title-slide # STAT 302, Lecture Slides 4 ## Regular Expressions and Strings ### Bryan Martin --- # Outline 1. Regular Expressions 2. Strings 3. Factors 4. Dates and Times .middler[**Goal:** Learn how to efficiently work with strings, factors, and dates!] --- ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ── ``` ``` ## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4 ## ✓ tibble 3.0.4 ✓ dplyr 1.0.2 ## ✓ tidyr 1.1.2 ✓ stringr 1.4.0 ## ✓ readr 1.4.0 ✓ forcats 0.5.0 ``` ``` ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::group_rows() masks kableExtra::group_rows() ## x dplyr::lag() masks stats::lag() ``` --- class: inverse .sectionhead[Part 1. Regular Expressions] --- layout: true # Regular Expressions --- Many, many thanks to the developers of `stringr`, who developed these excellent demonstrations of regular expressions. --- Recall that strings in R are characters grouped together within double quotes `""` or single quotes `''`. However, there are characters that cannot be represented directly in an R string. For example, what if your string contains a double quote, or a line break? A line break in R is done using the regular expression `\n`. For example: ```r cat("Line 1\nLine 2") ``` ``` ## Line 1 ## Line 2 ``` However, with regular expressions, the backslash `\` is a special character. Thus, in order to have `\` within a regular expression, we must precede the backslash with its own backslash! For example, the regular expression `\\` matches the string `\`. Thus, if we want to use a new line in a regular expression, we need to type `\\n`, giving us `\n`. Confusing? Don't worry, we will go through many examples. --- ```r see <- function(rx) { str_extract_all("abc ABC 123\t.!?\\(){}\n", rx) %>% unlist() %>% str_c(collapse = "") } print("abc ABC 123\t.!?\\(){}\n") ``` ``` ## [1] "abc ABC 123\t.!?\\(){}\n" ``` ```r see("a") ``` ``` ## [1] "a" ``` --- ```r see(".") ``` ``` ## [1] "abc ABC 123\t.!?\\(){}" ``` What happened here? `.` is a special character in a regular expression, meaning every character except a new line. If we want to search for `.` ```r see("\\.") ``` ``` ## [1] "." ``` ```r see("?") ``` ``` ## Error in stri_extract_all_regex(string, pattern, simplify = simplify, : Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX, context=`?`) ``` `?` is also special character. So to search for that symbol ```r see("\\?") ``` ``` ## [1] "?" ``` --- Similarly for other special characters. ```r see("\\}") ``` ``` ## [1] "}" ``` ```r see("\\(") ``` ``` ## [1] "(" ``` What if we want to search for the backslash itself? ```r see("\\\\") ``` ``` ## [1] "\\" ``` --- We can search for special commands, such as... * new line `\n` ```r see("\\n") ``` ``` ## [1] "\n" ``` * Tabs `\t` ```r see("\\t") ``` ``` ## [1] "\t" ``` * Any white spaces `\s` ```r see("\\s") ``` ``` ## [1] " \t\n" ``` --- * Any digit `\d` ```r see("\\d") ``` ``` ## [1] "123" ``` * Any word character `\w` ```r see("\\w") ``` ``` ## [1] "abcABC123" ``` * Any boundaries `\b` ```r see("\\b") ``` ``` ## [1] "" ``` (note this doesn't return any characters, but try it with `str_view()` in your console) --- We can search for groups of characters such as * digits `[:digit:]` ```r see("[:digit:]") ``` ``` ## [1] "123" ``` * letters `[:alpha:]` ```r see("[:alpha:]") ``` ``` ## [1] "abcABC" ``` * letters and numbers `[:alnum:]` ```r see("[:alnum:]") ``` ``` ## [1] "abcABC123" ``` --- * lowercase letters `[:lower:]` ```r see("[:lower:]") ``` ``` ## [1] "abc" ``` * uppercase letters `[:upper:]` ```r see("[:upper:]") ``` ``` ## [1] "ABC" ``` * punctuation `[:punct:]` ```r see("[:punct:]") ``` ``` ## [1] ".!?\\(){}" ``` --- * letters, numbers, and punctuation `[:graph:]` ```r see("[:graph:]") ``` ``` ## [1] "abcABC123.!?\\(){}" ``` * space characters `[:space:]` ```r see("[:space:]") ``` ``` ## [1] " \t\n" ``` * space and tab (not new line) `[:blank:]` ```r see("[:blank:]") ``` ``` ## [1] " \t" ``` --- ## Alternates ```r alt <- function(rx) { str_extract_all("abcde", rx) %>% unlist() %>% str_c(collapse = "") } ``` * or `|` ```r alt("a|e") ``` ``` ## [1] "ae" ``` ```r alt("ab|d") ``` ``` ## [1] "abd" ``` --- ## Alternates * one of `[bae]` ```r alt("[bae]") ``` ``` ## [1] "abe" ``` * anything but `[^bae]` ```r alt("[^bae]") ``` ``` ## [1] "cd" ``` * range of values `[a-c]` ```r alt("[a-c]") ``` ``` ## [1] "abc" ``` --- layout: false class: inverse .sectionhead[Part 2: Strings] --- # Strings ## What are strings? A **character** is a symbol from a written language, such as a letter, a numeric, a symbol, or otherwise. A **string** is a collection of characters grouped together (such as a word). Even when we do quantitative work, lots of important and interesting data often comes in the form of strings, and knowing how to work with them is essential for data scientists and statisticians. --- .middler[<img src="stringr.png" alt="" height="350"/>] --- layout: false class: inverse .sectionhead[String Lengths] --- # <TT>str_length()</TT>: number of characters ```r str_length("a") ``` ``` ## [1] 1 ``` ```r str_length("abc") ``` ``` ## [1] 3 ``` ```r str_length(c("a", "ab", "abc")) ``` ``` ## [1] 1 2 3 ``` --- # <TT>str_trim()</TT>: trim whitespace on ends ```r str_trim("cats and dogs") ``` ``` ## [1] "cats and dogs" ``` ```r str_trim(" cats and dogs") ``` ``` ## [1] "cats and dogs" ``` ```r str_trim("cats and dogs ") ``` ``` ## [1] "cats and dogs" ``` ```r str_trim(" cats and dogs ") ``` ``` ## [1] "cats and dogs" ``` ```r str_trim(c("cats", " dogs", "cows ", " chickens ")) ``` ``` ## [1] "cats" "dogs" "cows" "chickens" ``` --- class: inverse .sectionhead[Subsetting Strings] --- # <TT>str_sub()</TT>: substring by indices * Given one positive value: starting index * Given one negative value: starting index from end * Given two values: starting and ending index (can be positive or negative) ```r strings <- c("strawberry", "banana", "blueberry", "apple", "blackberry", "lemon") str_sub(strings, 3) ``` ``` ## [1] "rawberry" "nana" "ueberry" "ple" "ackberry" "mon" ``` ```r str_sub(strings, 1) ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_sub(strings, -2) ``` ``` ## [1] "ry" "na" "ry" "le" "ry" "on" ``` ```r str_sub(strings, -5) ``` ``` ## [1] "berry" "anana" "berry" "apple" "berry" "lemon" ``` --- # <TT>str_sub()</TT>: substring by indices * Given one positive value: starting index * Given one negative value: starting index from end * Given two values: starting and ending index (can be positive or negative) ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_sub(strings, 1, 3) ``` ``` ## [1] "str" "ban" "blu" "app" "bla" "lem" ``` ```r str_sub(strings, 2, 6) ``` ``` ## [1] "trawb" "anana" "luebe" "pple" "lackb" "emon" ``` ```r str_sub(strings, 3, -4) ``` ``` ## [1] "rawbe" "n" "uebe" "" "ackbe" "" ``` --- # <TT>str_subset()</TT>: subset by pattern ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_subset(strings, "a") ``` ``` ## [1] "strawberry" "banana" "apple" "blackberry" ``` ```r str_subset(strings, "berry") ``` ``` ## [1] "strawberry" "blueberry" "blackberry" ``` ```r str_subset(strings, "apple") ``` ``` ## [1] "apple" ``` ```r str_subset(strings, "appel") ``` ``` ## character(0) ``` --- # <TT>str_extract()</TT>: extract by pattern ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_extract(strings, "a") ``` ``` ## [1] "a" "a" NA "a" "a" NA ``` ```r str_extract(strings, "berry") ``` ``` ## [1] "berry" NA "berry" NA "berry" NA ``` ```r str_extract(strings, "apple") ``` ``` ## [1] NA NA NA "apple" NA NA ``` ```r str_extract(strings, "[aeiou]") ``` ``` ## [1] "a" "a" "u" "a" "a" "e" ``` --- class: inverse .sectionhead[Matching] --- # <TT>str_detect()</TT>: Booleans for matching ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_detect(strings, "a") ``` ``` ## [1] TRUE TRUE FALSE TRUE TRUE FALSE ``` ```r str_detect(strings, "berry") ``` ``` ## [1] TRUE FALSE TRUE FALSE TRUE FALSE ``` ```r str_detect(strings, "[aeiou]") ``` ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE ``` --- # <TT>str_which()</TT>: index for matching Note: this returns the index of the matching string, not the index of the matching character within the string! ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_which(strings, "a") ``` ``` ## [1] 1 2 4 5 ``` ```r str_which(strings, "berry") ``` ``` ## [1] 1 3 5 ``` ```r str_which(strings, "[aeiou]") ``` ``` ## [1] 1 2 3 4 5 6 ``` --- # <TT>str_locate()</TT>: position for matching Note: this returns the position index of the *first* matching string! ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_locate(strings, "a") ``` ``` ## start end ## [1,] 4 4 ## [2,] 2 2 ## [3,] NA NA ## [4,] 1 1 ## [5,] 3 3 ## [6,] NA NA ``` --- # <TT>str_locate()</TT>: position for matching Note: this returns the position index of the *first* matching string! ```r str_locate(strings, "berry") ``` ``` ## start end ## [1,] 6 10 ## [2,] NA NA ## [3,] 5 9 ## [4,] NA NA ## [5,] 6 10 ## [6,] NA NA ``` ```r str_locate(strings, "[aeiou]") ``` ``` ## start end ## [1,] 4 4 ## [2,] 2 2 ## [3,] 3 3 ## [4,] 1 1 ## [5,] 3 3 ## [6,] 2 2 ``` --- # <TT>str_count()</TT>: count matches ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_count(strings, "a") ``` ``` ## [1] 1 3 0 1 1 0 ``` ```r str_count(strings, "berry") ``` ``` ## [1] 1 0 1 0 1 0 ``` ```r str_count(strings, "[aeiou]") ``` ``` ## [1] 2 3 3 2 2 2 ``` --- class: inverse .sectionhead[Joining and Splitting] --- # <TT>str_c()</TT>: join multiple strings Use `sep = ` to set the separating string ```r str_c(c("a", "b", "c"), c("1", "2", "3")) ``` ``` ## [1] "a1" "b2" "c3" ``` ```r str_c(c("a", "b", "c"), c("1", "2", "3"), sep = "_") ``` ``` ## [1] "a_1" "b_2" "c_3" ``` ```r str_c(c("a", "b", "c"), c("1", "2", "3"), sep = "!@#$") ``` ``` ## [1] "a!@#$1" "b!@#$2" "c!@#$3" ``` --- # <TT>str_c()</TT>: collapse a string vector Use `collapse = ` to set the combining string ```r str_c(c("a", "b", "c"), collapse = "") ``` ``` ## [1] "abc" ``` ```r str_c(c("a", "b", "c"), collapse = "_") ``` ``` ## [1] "a_b_c" ``` ```r str_c(c("a", "b", "c"), c("1", "2", "3"), collapse = "") ``` ``` ## [1] "a1b2c3" ``` --- # <TT>str_split_fixed()</TT>: split string `str_split_fixed(string, pattern, n)`, where `n` is the maximum number of pieces after splitting. Use `Inf` for all possible splits. ```r str_split_fixed(c("a", "a b", "a b c"), " ", 2) ``` ``` ## [,1] [,2] ## [1,] "a" "" ## [2,] "a" "b" ## [3,] "a" "b c" ``` ```r str_split_fixed(c("a", "a b", "a b c"), " ", Inf) ``` ``` ## [,1] [,2] [,3] ## [1,] "a" "" "" ## [2,] "a" "b" "" ## [3,] "a" "b" "c" ``` --- # <TT>str_split_fixed()</TT>: split string `str_split_fixed(string, pattern, n)`, where `n` is the maximum number of pieces after splitting. ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_split_fixed(strings, "a", Inf) ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] "str" "wberry" "" "" ## [2,] "b" "n" "n" "" ## [3,] "blueberry" "" "" "" ## [4,] "" "pple" "" "" ## [5,] "bl" "ckberry" "" "" ## [6,] "lemon" "" "" "" ``` --- class: inverse .sectionhead[Mutate Strings] --- # <TT>str_replace()</TT>: replace first match `str_replace(string, pattern, replacement)` ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_replace(strings, "a", "A") ``` ``` ## [1] "strAwberry" "bAnana" "blueberry" "Apple" "blAckberry" ## [6] "lemon" ``` ```r str_replace(strings, "berry", "123") ``` ``` ## [1] "straw123" "banana" "blue123" "apple" "black123" "lemon" ``` ```r str_replace(strings, "[aeiou]", "y") ``` ``` ## [1] "strywberry" "bynana" "blyeberry" "ypple" "blyckberry" ## [6] "lymon" ``` --- # <TT>str_replace_all()</TT>: replace matches `str_replace_all(string, pattern, replacement)` ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_replace_all(strings, "a", "A") ``` ``` ## [1] "strAwberry" "bAnAnA" "blueberry" "Apple" "blAckberry" ## [6] "lemon" ``` ```r str_replace_all(strings, "berry", "123") ``` ``` ## [1] "straw123" "banana" "blue123" "apple" "black123" "lemon" ``` ```r str_replace_all(strings, "[aeiou]", "y") ``` ``` ## [1] "strywbyrry" "bynyny" "blyybyrry" "ypply" "blyckbyrry" ## [6] "lymyn" ``` --- # Changing case * `str_to_lower()` make lowercase ```r str_to_lower(c("A STRING", "A sTrInG", "A String", "a string", "A STRING!!1")) ``` ``` ## [1] "a string" "a string" "a string" "a string" "a string!!1" ``` * `str_to_upper()` make uppercase ```r str_to_upper(c("A STRING", "A sTrInG", "A String", "a string", "A STRING!!1")) ``` ``` ## [1] "A STRING" "A STRING" "A STRING" "A STRING" "A STRING!!1" ``` * `str_to_title()` make title case ```r str_to_title(c("A STRING", "A sTrInG", "A String", "a string", "A STRING!!1")) ``` ``` ## [1] "A String" "A String" "A String" "A String" "A String!!1" ``` --- class: inverse .sectionhead[Order Strings] --- # <TT>str_order()</TT>: get sorting vector Options: `decreasing`, `na_last`, `numeric` ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_order(strings) ``` ``` ## [1] 4 2 5 3 6 1 ``` ```r strings[str_order(strings)] ``` ``` ## [1] "apple" "banana" "blackberry" "blueberry" "lemon" ## [6] "strawberry" ``` ```r strings[str_order(strings, decreasing = TRUE)] ``` ``` ## [1] "strawberry" "lemon" "blueberry" "blackberry" "banana" ## [6] "apple" ``` --- # <TT>str_sort()</TT>: sort string vector Options: `decreasing`, `na_last`, `numeric` ```r strings ``` ``` ## [1] "strawberry" "banana" "blueberry" "apple" "blackberry" ## [6] "lemon" ``` ```r str_sort(strings) ``` ``` ## [1] "apple" "banana" "blackberry" "blueberry" "lemon" ## [6] "strawberry" ``` ```r str_sort(strings, decreasing = TRUE) ``` ``` ## [1] "strawberry" "lemon" "blueberry" "blackberry" "banana" ## [6] "apple" ``` --- # <TT>str_sort()</TT>: sort string vector Options: `decreasing`, `na_last`, `numeric` ```r nums <- c("1", "2", "3", NA, "11", "120", "010") str_sort(nums) ``` ``` ## [1] "010" "1" "11" "120" "2" "3" NA ``` ```r str_sort(nums, na_last = FALSE) ``` ``` ## [1] NA "010" "1" "11" "120" "2" "3" ``` ```r str_sort(nums, numeric = TRUE) ``` ``` ## [1] "1" "2" "3" "010" "11" "120" NA ``` --- # <TT>stringr</TT> cheatsheet * Manage lengths: `str_length()`, `str_trim()` * Subsetting: `str_sub()`, `str_subset()`, `str_extract()` * Matching: `str_detect()`, `str_which()`, `str_locate()`, `str_count()` * Joining and Splitting: `str_c()`, `str_split_fixed()` * Mutate: `str_replace()`, `str_replace_all()`, `str_to_lower()`, `str_to_upper()`, `str_to_title()` .pushdown[.center[[And more! Click me for a cheat sheet!](https://rstudio.com/resources/cheatsheets/) <img src="stringr.png" alt="" height="150"/>]] --- class: inverse .sectionhead[Part 3. Factors] --- .middler[<img src="forcats.png" alt="" height="350"/>] --- # Factors Recall... **factors** are categorical data that use integer representation. This can be efficient to store character vectors, because each string is only entered once. Because of this, creating data frames (but not tibbles!) in R often default to set strings as factors. --- # <TT>factor()</TT>: create a factor ```r (f1 <- factor(c("a", "b", "c", "a"), levels = c("a", "b", "c"))) ``` ``` ## [1] a b c a ## Levels: a b c ``` ```r factor(c("a", "b", "c", "a"), levels = c("a", "b", "d")) ``` ``` ## [1] a b <NA> a ## Levels: a b d ``` ```r (f2 <- factor(c("a", "b", "c", "a"), levels = c("a", "b", "c", "d"))) ``` ``` ## [1] a b c a ## Levels: a b c d ``` --- # <TT>factor()</TT>: create a factor ```r f1[5] <- "d" ``` ``` ## Warning in `[<-.factor`(`*tmp*`, 5, value = "d"): invalid factor level, NA ## generated ``` ```r f1 ``` ``` ## [1] a b c a <NA> ## Levels: a b c ``` ```r f2[5] <- "d" f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` --- # <TT>fct_count()</TT>: count levels ```r f1 ``` ``` ## [1] a b c a <NA> ## Levels: a b c ``` ```r fct_count(f1) ``` ``` ## # A tibble: 4 x 2 ## f n ## <fct> <int> ## 1 a 2 ## 2 b 1 ## 3 c 1 ## 4 <NA> 1 ``` --- # <TT>fct_count()</TT>: count levels ```r f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` ```r fct_count(f2) ``` ``` ## # A tibble: 4 x 2 ## f n ## <fct> <int> ## 1 a 2 ## 2 b 1 ## 3 c 1 ## 4 d 1 ``` --- # <TT>fct_unique()</TT>: unique levels ```r f1 ``` ``` ## [1] a b c a <NA> ## Levels: a b c ``` ```r fct_unique(f1) ``` ``` ## [1] a b c ## Levels: a b c ``` ```r f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` ```r fct_unique(f2) ``` ``` ## [1] a b c d ## Levels: a b c d ``` --- # <TT>fct_c()</TT>: combine factors This can be useful if all the levels were not included initially! ```r f_small_1 <- factor(c("b", "a"), levels = c("a", "b")) f_small_2 <- factor(c("a", "c"), levels = c("a", "c")) fct_c(f_small_1, f_small_2) ``` ``` ## [1] b a a c ## Levels: a b c ``` Compare to ```r c(f_small_1, f_small_2) ``` ``` ## [1] 2 1 1 2 ``` --- # <TT>fct_relevel()</TT>: manually relevel ```r f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` ```r fct_relevel(f2, c("b", "d", "a", "c")) ``` ``` ## [1] a b c a d ## Levels: b d a c ``` ```r fct_relevel(f2, c("b", "d", "a")) ``` ``` ## [1] a b c a d ## Levels: b d a c ``` --- # <TT>fct_relevel()</TT>: manually relevel ```r f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` ```r as.numeric(f2) ``` ``` ## [1] 1 2 3 1 4 ``` ```r fct_relevel(f2, c("b", "d", "a", "c")) %>% as.numeric ``` ``` ## [1] 3 1 4 3 2 ``` --- # <TT>fct_drop()</TT>: drop unused levels By default, drops all unused levels. Alternatively, supply levels to drop. ```r f3 <- factor(c("a", "b", "b", "a"), levels = c("a", "b", "c", "d")) fct_drop(f3) ``` ``` ## [1] a b b a ## Levels: a b ``` ```r fct_drop(f3, only = "d") ``` ``` ## [1] a b b a ## Levels: a b c ``` --- # <TT>fct_expand()</TT>: add levels By default, drops all unused levels. Alternatively, supply levels to drop. ```r f3 <- factor(c("a", "b", "b", "a"), levels = c("a", "b")) fct_expand(f3, "c") ``` ``` ## [1] a b b a ## Levels: a b c ``` ```r fct_expand(f3, "c", "d") ``` ``` ## [1] a b b a ## Levels: a b c d ``` --- # <TT>fct_recode()</TT>: recode levels ```r f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` ```r fct_recode(f2, x = "a") ``` ``` ## [1] x b c x d ## Levels: x b c d ``` ```r fct_recode(f2, x = "a", y = "b", z = "c", w = "d") ``` ``` ## [1] x y z x w ## Levels: x y z w ``` --- # <TT>fct_collapse()</TT>: collapse levels ```r f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` ```r fct_collapse(f2, x = c("a", "b")) ``` ``` ## [1] x x c x d ## Levels: x c d ``` --- # <TT>fct_other()</TT>: replace w/ "Other" ```r f2 ``` ``` ## [1] a b c a d ## Levels: a b c d ``` ```r fct_other(f2, keep = "a") ``` ``` ## [1] a Other Other a Other ## Levels: a Other ``` ```r fct_other(f2, keep = c("a", "b")) ``` ``` ## [1] a b Other a Other ## Levels: a b Other ``` --- # <TT>forcats</TT> cheatsheet * Create a factor: `factor(..., levels = ...)` * Count levels: `fct_count()` * Unique levels: `fct_unique()` * Combine factor vectors: `fct_c()` * Relevel: `fct_relevel()` * Drop levels: `fct_drop()` * Add levels: `fct_expand()` * Recode levels: `fct_recode()` * Collapse levels: `fct_collapse()` * "Other" level: `fct_other()` .center[[And more! Click me for a cheat sheet!](https://rstudio.com/resources/cheatsheets/) <img src="forcats.png" alt="" height="150"/>] --- class: inverse .sectionhead[Part 4: Dates and Times] --- .middler[<img src="lubridate.png" alt="" height="350"/>] --- ```r library(lubridate) ``` ``` ## ## Attaching package: 'lubridate' ``` ``` ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union ``` --- layout: true # Parsing Date-times --- **Dates** and **date-times** are special classes of objects in R. `lubridate` does a fantastic job of taking a variety of input and converting them into standardized format using for dates: * **y** for year * **m** for month * **d** for day * **q** for quarter and for times: * **h** for hour * **m** for minute * **s** for second You can combine these into more functions and inputs than we are able to show, but we'll go through several examples. --- Ordering can be changed arbitrarily. ```r mdy("01-29-2020") ``` ``` ## [1] "2020-01-29" ``` ```r dmy("29-01-2020") ``` ``` ## [1] "2020-01-29" ``` ```r ymd("2020-01-29") ``` ``` ## [1] "2020-01-29" ``` ```r ydm("2020-29-01") ``` ``` ## [1] "2020-01-29" ``` --- It accepts a variety of input formats. ```r mdy("Jan 29, 2020") ``` ``` ## [1] "2020-01-29" ``` ```r dmy("29th of January, 2020") ``` ``` ## [1] "2020-01-29" ``` ```r mdy("01/29/20") ``` ``` ## [1] "2020-01-29" ``` ```r ymd("20200129") ``` ``` ## [1] "2020-01-29" ``` ```r ymd("2020-01-29") ``` ``` ## [1] "2020-01-29" ``` --- We can add times, and even quarters. ```r yq("2020: Q1") ``` ``` ## [1] "2020-01-01" ``` ```r yq("2020 Quarter 1") ``` ``` ## [1] "2020-01-01" ``` ```r dmy_h("29 Jan 2020 at 2pm") ``` ``` ## [1] "2020-01-29 14:00:00 UTC" ``` ```r mdy_hms("Jan 29th 2020, 4:10:43") ``` ``` ## [1] "2020-01-29 04:10:43 UTC" ``` --- layout: false layout: true # Extracting Date-time Components --- When we have an object in date-time form, we can easily extract information. ```r (x <- ymd_hms("2020-01-29, 3:29:59 pm", tz = "US/Pacific")) ``` ``` ## [1] "2020-01-29 15:29:59 PST" ``` ```r date(x) ``` ``` ## [1] "2020-01-29" ``` ```r year(x) ``` ``` ## [1] 2020 ``` ```r month(x) ``` ``` ## [1] 1 ``` ```r day(x) ``` ``` ## [1] 29 ``` --- ```r hour(x) ``` ``` ## [1] 15 ``` ```r minute(x) ``` ``` ## [1] 29 ``` ```r second(x) ``` ``` ## [1] 59 ``` ```r tz(x) ``` ``` ## [1] "US/Pacific" ``` --- ```r wday(x) # day of week ``` ``` ## [1] 4 ``` ```r wday(x, label = TRUE) ``` ``` ## [1] Wed ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat ``` ```r week(x) # week of year ``` ``` ## [1] 5 ``` ```r quarter(x) # quarter of year ``` ``` ## [1] 1 ``` --- ```r dst(x) # is it Daylight Savings time? ``` ``` ## [1] FALSE ``` ```r leap_year(x) # is it a leap year? ``` ``` ## [1] TRUE ``` ```r am(x) ``` ``` ## [1] FALSE ``` ```r pm(x) ``` ``` ## [1] TRUE ``` --- We can also edit date-time objects. ```r x ``` ``` ## [1] "2020-01-29 15:29:59 PST" ``` ```r hour(x) <- 13 year(x) <- 2021 x ``` ``` ## [1] "2021-01-29 13:29:59 PST" ``` --- layout: false # Tell R when you have date-times! When working with date-time data, it is important that you tell R you are working with date-times using `lubridate`! If you do not, you may get an error that looks like this: ```r x <- "01/29/2020" day(x) ``` ``` ## Error in as.POSIXlt.character(x, tz = tz(x)): character string is not in a standard unambiguous format ``` ```r y <- mdy(x) day(y) ``` ``` ## [1] 29 ``` --- # <TT>lubridate</TT> cheatsheet * Dates: `y` year, `m` month, `d` day, `q` quarter * Times: `h` hour, `m` minute, `s` second * Extracting components: `date()`, `year()`, `month()`, `day()`, `hour()`, `minute()`, `second()` You can do much more that we didn't cover here, such as intervals, arithmetic, durations, rounding, and periods! .center[[Click me for a cheat sheet!](https://rstudio.com/resources/cheatsheets/) <img src="lubridate.png" alt="" height="150"/>]