--- title: "A First Course in Data Science and Machine Learning in R" author: "Professor Adam Kapelner, Queens College, CUNY" date: "Jan 28, 2022" --- This work is a textbook developed from teaching Math 342W (data science fundamentals) at Queens College between 2018-2022. This work is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). Acknowledgement: I would like to thank Professor Andreas Buja at Wharton as I cribbed lots of his notes (see http://stat.wharton.upenn.edu/~buja/STAT-961/) as well as Yonatan Rosenblatt of Ben Gurion University (see http://www.john-ros.com/Rcourse). There are countless others who helped along this journey. I would like to thank Queens College, CUNY (QC, my employer), Leila Walker (Assistant Professor at QC, Digital Scholarship Librarian and also runs the OER fellowship which I completed in the Spring of 2022), Alan Sultan (my boss and chair of the Mathematics Department at QC) who pushed me to create new curriculum for the school's nascent data science program and many more. ## Basic R What is R? * An open-source, free standard for statistical and numerical programming * interpreted, scripted and high level (like Python and not like e.g. C++) * Has a tremendous developer community and industry community * I joke that I'm an R developer for a living since I publish research in statistical methodology which comes with R packages * History: came from Bell Labs. C --> S --> R Upsides? * Very good for "one-shot" scripting - very common in data science * New stats algorithms and methodologies appear here first! Waaay before Python. Downsides? * Slow (but this can be fixed by inlining other languages). This is mostly due to the fact that it is an interpreted and dynamically-typed language. This means that the code is first "interpreted" and then converted to machine code and then executed (see https://subscription.packtpub.com/book/application_development/9781783989263/1/ch01lvl1sec09/r-is-interpreted-on-the-fly). This has advantages though. * Bad threading / parallelization support (ditto) * Bad support for hashmaps and other inconveniences What is RStudio? * An integrated developer environment (IDE). It is not the only IDE for R, but it is the most popular at the time of this writing. It consists of the file window (what you're reading) and the console window (below) and a few other convenient windows which will be explained later. File types: * The standard R script has a .R extension * This file if you note has an .Rmd extension meeaning "R Markdown". Markdown is a way of typestting simple documents with basic code directives. You can read up on this by yourself. "R Markdown" is a document type that combines R code within markdown documents where the R code is in "chunks". You can get started with this cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf Let's see some R! Since R is interpreted, you can run commands directly into its command line by using ctrl-enter line-by-line or ctrl-shift-enter for the entire "chunk": {r} 5  and {r} 7 8 9  These are not the most exciting commands. The above is called a "chunk" and it tells the R Markdown compiler that there is R code inside those tick marks. Here's some R code commented: {r} 5 #This will produce the number 5  Everything beyond the "#" is a comment as is ignored during interpretation. Let's start by using R as a calculator {r} 5 + 5 5 - 5 5 / 5 5 * 5 5^5 #or 5 ** 5 (not as common) 5 / 3 100 * (1 + 0.03)^5 5 %% 3 5 * 5 %% 3 (5 * 5) %% 3 #note order of operations - use parentheses  You can have multiple statements per line. The semicolon ends a statement like C++ but it is omitted usually because if the line ends it is implied. {r} 5 + 5; 5 - 5  Multiple statements per line is not generally done unless you are making a point. You can also extend expressions for many lines if you have a dangling operator at the end of the line e.g. this plus sign: {r} 5+ 5+ 5  Avoid this unless the line gets so long it's difficult to read. Note the "+" at the end of the line so the interpreter knows to keep looking to the next line. Let's take a look at how to represent numbers using scientific notation {R} 3,125 #commas are illegal and note that RStudio finds basic errors for us 3.125e3 #scientific notation 3.125E3  Every language has a means of invoking functions. Here are some included famous math functions. R uses parentheses to invoke functions. The quantities within the parentheses are known as "parameters" or "arguments" to the function. {r} sqrt(2) log(2) log(2, 2) log(2, base = 2) #This is preferred: specify the argument by name after the first argument. abs(-1) floor(1.2) ceiling(1.2) round(1.2345) round(1.2345, 2) log10(1000) sin(pi) #pi is a hard-wired numerical constant, the only one in the language (I think) #note the numerical error acos(1) exp(1)  We can control the number of digits via {r} options(digits = 3) 3.123456789E-6 options(digits = 7) #the default 3.123456789E-6  There are lots of options. Look at the help file for a function by using the question mark: {r} ?options  Let's talk about basic data types: {r} TRUE #logical FALSE T #please don't use this constant in your programs; use TRUE. F #please don't use this constant in your programs; use FALSE class(TRUE) 1 #numeric, think of double type in C++ or Java class(1) 1L class(1L) #what's the L doing? It's forcing a certain type  There are more formal ways to "force" AKA "cast": {r} class(as.integer(1)) class(as.numeric(1L)) #what's the L doing? It's forcing a certain type 1.L #what's this? Not an error but a warning 1e3 #scientific notation convenience 1E3 class(1E3) "A" class("A") as.integer(TRUE) as.numeric(TRUE) as.integer(FALSE) as.numeric(FALSE) as.logical(0) as.logical(1) as.logical(0.1) #anything else besides zero... as.character(1) as.character(0.1) as.logical("a") #more on this "character" later as.logical("FALSE") #more on this "character" later as.logical("T") #more on this "character" later  All of the above are actually 1-dim vectors. We will explore this later. To create vectors, we use the "c" function which is short for "combine". This function takes any number of arguments {r} c(1, 1, 2, 3, 5, 8)  We can also use shorthand "ladder" notation: {r} 1 : 3 1 : 100 -3 : 3 3 : -3 -(1 : 5)  The "seq" function (short for "sequence") is key for this too: {r} seq(1, 9) #short for 1 : 9 seq(1, 9, by = 2) seq(1, 9, length.out = 3) seq(1, 9, length.out = 4)  and so is the "rep" function (short for "replicate"): {r} rep(1, 10) rep(c(1, 2, 3), 10) rep(seq(1, 2, by = 0.1), 10)  Every language has a means to assign a piece of data to variables: {r} x = 1 #I use this... only because I like using one less character and I like my code to read like other popular languages x x <- 1 #but this is actually preferred according to the the R spec 2 -> x #even those this is legal, please don't do this!! x T = "my string" x_the_number = 1 #underscored variables preferred xTheNumber = 1 #camel case is a different style x.the.number = 1 #some people like this but I don't because it is confusing with other languages e.g. Java  And a means to see what variables have been assigned and remove these variables: {r} ls() #who's out there rm(x) x rm(x_the_number, xTheNumber, x.the.number) rm(list = ls()) #clear the workspace completely!  Printing them out is important: {r} print(x) #gives you the output numbered by the number of outputs cat(x) #just prints and it is preferred cat(x); cat(x) #it prints without a line break cat(x, "\n") #you can have many arguments to cat and it will concatenate them cat("x is", x, "\n"); cat("x is", x, "\n")  There is a means to use logical comparisons: {r} x = 1 x == 1 x == 0 x != 1 x != 0 x > 1 x < 1 x >= 1 x <= 1 x = TRUE; y = FALSE x & y #AND x | y #OR  Previously, we used vector types without even knowing. Let's talk about some simple operations on vectors: {r} x = seq(1, 100) length(x) #how long is this vector head(x) #the first elements tail(x) #the last elements head(x, 11) #the first 11 elements tail(x, 13) #the last 13 elements x[1] #square brackets are the "getter" - it gets the value at that index. R has one-based indexing while C, C++, Python, Java, have zero-based indexing. Get used to it. SORRY!!! x[1] = 1984 #square brackets also serve as the "setter" - it sets the value at that index x[1] x[6] x[length(x)] #unfortunately no "-1" last index convenience... x[c(2, 77)] x[2 : 77] x[seq(1, 100, by = 2)] #odd numbers x[-1] #everything but 1 x[-(1 : 10)] #everything but 1 - 10 x[-seq(2, 100, by = 2)] #another way to see odd numbers x = c(rep(1, 10), rep(2, 10), rep(3, 10)) unique(x) #only unique elements (in order in which they appear) x[1 : 10] = 10 : 1 # you can set in vectors as well x[1 : 10] x = c(1:10, 3:40)  and comparisons on vectors {r} x = 1 : 5 y = 1 : 5 x == y all.equal(x, y)  Most operations in R are vector-operations and these are preferred because they are optimized: {r} x = 1 : 5 x + 2 x * 2 x^2  A note about logical vectors: {r} x = c(TRUE, TRUE) y = c(TRUE, FALSE) z = c(FALSE, FALSE) !x x & y x | y x && y #vector AND - confusing - recommend not to use this x || y #vector OR - confusing - recommend not to use this as.numeric(x) sum(x) #does the numeric conversion for you prod(x) #does the numeric conversion for you any(x) #convenient function any(z) all(x) all(y) #sometimes a useful function xor(TRUE, FALSE) xor(TRUE, TRUE) xor(FALSE, FALSE)  Sampling is very important: {r} x = 1 : 100 sample(x, 3) sample(x, 101) sample(x, 3, replace = TRUE) sort(sample(x, 101, replace = TRUE)) sample(x) #default is length of the vector i.e. just shuffles  There are many "illegal" computation values in R: NaN, Inf, -Inf, {r} 1 / 0 #unlike C, Java -- no error... handles this natively 1 / 0 == Inf #this works 0 / 0 #unlike C, Java -- no error... handles this natively x = 0 / 0 x == NaN #beware!! is.nan(x) -1 / 0 1 + 1 / 0 1 + Inf 1 / Inf 1 / (-Inf) 1 / NaN log(0) log(-1) sqrt(-1)  There are a couple more "illegalish" values that are non-computational: NA, NULL, empty vectors {r} NA #this is the value of missing x = NA x == NA #beware!! is.na(x) x = NULL #special reserved primitive for data that is "undefined" x == NULL #strange... and beware is.null(x) #let's look at that strange thing x = 1 : 3 x[1] x[0] #a vector with zero elements -- a "null vector" of sorts length(x[0]) c() #remarkably... not what we expect... x[NA] x[NaN] x[Inf] x[4]  Now let's look at data types again. These types are the data types we learned in class. {r} x = c(1, 2, 3, 4, 5) class(x) x = seq(1, 5) class(x) #integer and numeric are approximately the same for purposes of this class x = sample(rep(c(0, 1), 50)) x #binary variable? x = c("a", "b", "c", "d") x class(x) x = rep(x, 5) x class(x) x = factor(x) x ?factor levels = c("low", "medium", "high") x_char = sample(rep(levels, 10)) x = factor(x_char) x #nominal categorical variable x = factor(x_char, levels = levels) x x = factor(x_char, levels = levels, ordered = TRUE) x #ordinal categorical variable as.numeric(x) #coerce this variable to a number... result makes sense  Data which is of class factor can be used in mathematical models that we build just like numeric. It will, by default be reduced to binary variables. ## Realizations from r.v.'s Let's now do some probability a la Math 241. Let's realize a few iid random variable models. Let's say $X_1, ..., X_{10} \iid$ ... * binomial * geometric * poisson * standard uniform --- all numbers between 0 and 1 equally likely * exponential * normal with mean 5 and standard deviation 2 {r} n = 10 #good programming practice to declare shared data x = rbinom(n, size = 5, prob = 0.1) x dbinom(1, size = 5, p = 0.1) pbinom(3, size = 5, p = 0.1) qbinom(.75, size = 3, p = 0.1) x #returns as a vector ?rbinom x = rgeom(n, prob = 0.1) x ?rgeom x = rpois(n, lambda = 2) x ?rpois x = runif(n) x ?runif x = rexp(n, rate = 1) x ?rexp x = rnorm(n, mean = 5, sd = 2) x ?rnorm  Randomness is an illustion in a computer. This is a whole topic in and of itself. But for our purposes, imagine random numbers being generated from a long but fixed list of numbers and the starting point is random. But you can make the randomness deterministic by setting the starting point. Then it's just the same numbers in the same order! This starting point is called a "seed" and it's an integer. {r} set.seed(1984) rnorm(10) set.seed(1984) #reset to that point in the list rnorm(10) #same exact numbers!!! set.seed(1984.1) rnorm(10) #same - needs to be a different integer set.seed(-1984) rnorm(10) #different set.seed(1983) rnorm(10) #different get.seed()  When writing programs that depend on randomness, you should set the seed at the top so the program is deterministic. Then it is easy to find bugs. When you switch the program to production, you just comment ou the set seed line. I will be doing this fairly often during demos for the rest of the semester. Sorting is also a key skill in R: {r} n = 50 x = rbinom(n, size = 20, prob = 0.2) x sort(x) sort(x, decreasing = TRUE)  ## Matrices Now that we finished vectors, let's learn about the matrix object. Let's use the dimension of the learning problem in class. {r} n = 100 #number of historical objects: the people p = 3 #number of features about each random_junk = round(runif(n * p), 2) random_junk X = matrix(random_junk, nrow = n, ncol = p) X class(X) X[1, 1] #the square brackets are still the getter but since there's two dimensions there's two indices X[1, ] #first row class(X[1, ]) #note: not matrix anymore X[1, , drop = FALSE] class(X[1, , drop = FALSE]) #still matrix... this matters sometimes and we will see an example of it X[, 1] #first column class(X[, 1]) #note: not matrix anymore X[, 1, drop = FALSE] class(X[, 1, drop = FALSE]) #other ways of subsetting matrices X[1 : 4, ] X[, 2 : 3] X[, c(1, 3)] X[seq(1, n, by = 2), ] X[1, 1] = 1984 #the squarae brackets are still the setter X[1, 1] = "1984" #warning: this will wind up casting all values in the matrix to string head(X)  Sometimes handy by not recommended is getting / setting with only one index. {r} head(X) X[1] #first entry i.e. the 1,1 entry X[2] #first entry i.e. the 2,1 entry (column major indexing, see https://cran.r-project.org/web/packages/reticulate/vignettes/arrays.html) X[n * p] #last entry i.e. n,p entry X[n * p + 1] #beyond last entry returns missing / illegal X[-1] #returns everything but the first entry and it automatically vectorizes X[1 : 100] #returns the first 100 i.e. the first column X[X > 0.5] = NA #replace all values no matter what row / col they exist in X  A few handy functions for numerical matrices: {r} X = matrix(random_junk, nrow = n, ncol = p) #reset rowSums(X) colSums(X) rowMeans(X) colMeans(X)  ## Programming! Conditional statements, loops, functions We will now learn how to program in R. * Conditional Statements (and ways to print to screen too). {r} x = 3 if (x > 0){ cat("The value of x is", x, "\n") #concatenate and print to screen } else { print("The value of x is"); print(x) #each gets its own line } if (x > 0){ ## } else { ##NOTE: ELSE MUST GO ON SAME LINE AS THE IF'S CLOSE BRACE!!! This is not Java or C++! ## } x = -5 if (x > 0){ cat("The value of x is", x, "\n") #concatenate and print to screen } else if (x < -3) { ##NOTE: ELSE MUST GO ON SAME LINE AS THE IF'S CLOSE BRACE!!! print(paste("The value of x is", x, collapse = " ")) #paste with collapse is like a string join } else { print("The value of x is"); print(x) #each gets its own line }  There is a one line convenience function called ifelse: {r} x = -10 : 10 b = -10 : 10 a = ifelse(x > 0, b, -1) a x = -5 : 5 a = ifelse(x > 0, 1, -1) a  There is also a switch in R. It is _not_ an official keyword (despite RStudio's endorsement), but it is useful for non-vector character expressions: {r} ch = "two" # ch = c("one", "two") switch(ch, "one" = 1, "two" = 2, "three" = 3, -99 )  * Looping A for loop contains three ingredients: (a) a looping variable joined with keyword "in" to a (b) vector or list (which we will discuss later) and executes over an (c) expression recommended to be enclosed in curly braces: {r} for (i in 1 : 10){ j = i print(j) }  Note that both the looping variable i still exists and the variable within the looped expression j still exists: {r} i j  Be careful of scoping in R. You may want to "clean up your namespace" whenever you can. {r} for (i in 1 : 10){ j = i print(j) } rm(i, j)  There is no need to save the values of the temp variables i, j going forward. They would vanish in C++ / Java when the for loop ended as it goes out of scope. You can iterate over a vector as well Python-style or Java-generic-style: {r} vec = seq(1, 10) for (i in vec){ print(i) }  And the vector does not have to be numeric: {r} vec = c("a", "b", "c") for (ch in vec){ print(ch) }  There is also the while loop but not a "do-while" loop {r} x = 1 while (x <= 10){ print(x) x = x + 1 #note: there is no "++" incrementing in R }  This can also be done using a break: {r} x = 1 while (TRUE){ print(x) x = x + 1 #note: there is no "++" incrementing in R if (x > 10){ break } }  The "repeat" keyword is the same as the while-true. It is slightly more esoteric but still good to know: {r} x = 1 repeat { print(x) x = x + 1 #note: there is no "++" incrementing in R if (x > 10){ break } }  The final control keyword tells the loop to break the current iteration but continue to the next: {r} xs = rep(NA, 10) xs[3] = 3 xs[7] = 7 xs tot = 0 for (x in xs){ if (is.na(x)){ next #move to next element in the list in the for loop } tot = tot + x } tot  Note: "real" R programmers avoid loops since they are slow. Instead they use the "apply" function (but not "sapply" or "lapply" which are not faster). We will cover these in the future. The latest trend is to use in-line C++ code (we will get to this). ## More data types Before we get back to modeling, it is worth knowing a couple more data structures in R. These are not "data [science] types", these are "[computer science] data types" that are used frequently in data science applications. The first are "lists" which are "ordered hashmaps" or "ordered dictionaries" or "hash tables". if You don't know what this is, you should read about this online as it should have been covered in a intro to CS class. {r} dict = list() dict$a = "first" dict$b = "second" dict$c = "third" dict length(dict) names(dict) #keys dict_unlisted = unlist(dict) #values dict_unlisted class(dict_unlisted) #i.e. a vector #three ways to access values by key / ordered location dict$a dict[["a"]] dict[[1]] #value of first entered key class(dict[[1]]) dict[[1 : 2]] #bombs #now let's try to access a value for a non-existent key / ordered location dict$q dict[["q"]] dict[[4]] #bombs #convenient means to subset the list dict[1] class(dict[1]) dict[1 : 2] dict[1 : 4] #this is the reason this type of access is not recommended dict = list("first", "second", "third") #no key => value... what happens? dict #default keys are the numbers 1, 2, ... dict[[1]] dict = list("a" = "first", "b" = "second", "c" = "third") #key => value dict dict = list(a = "first", b = "second", c = "third") #key => value dict  Lists conveniently allow all sorts of data types (as values only). {r} varied_dict = list() varied_dict$a = "first" varied_dict$b = 2 varied_dict$c = 1 : 7 varied_dict$d = matrix(NA, nrow = 2, ncol = 2) varied_dict[["some function"]] = function(x){x^2} #this key is not recommended varied_dict varied_dict$some function #note the tick marks (sometimes seen) needed due to spaces in key name length(varied_dict) names(varied_dict)  They have lots of uses in data science applications. We will likely see them in class and if not, you'll definitely see them in the real world. Note that data.frame objects are implemented as lists as well as many other common R objects. Unfortunately, list can only truly accept characters as keys. If you really need more flexibility here, we will need a library (coming soon). We will now discuss arrays i.e. multidimensional vectors {r} x = array(1 : 5, 5) x X = array(1 : 25, dim = c(5, 5)) X X = array(1 : 125, dim = c(5, 5, 5)) X X[1, , ] X[, 1, ] X[, , 1] X[1, 1, 1]  These can be associative arrays too and operate like a hash of vectors across arbitrary dimensions: {r} X = array(1 : 125, dim = c(5, 5, 5), dimnames = list( c("A", "B", "C", "D", "E"), c("I", "II", "III", "IV", "V"), c("blue", "red", "green", "yellow", "orange") )) X X["A", , ] X[, "III", ] X[, , "orange"] X["C", , "orange"] X["C", "IV", "orange"]  * Functions {r} my_function = function(x){ x } ##You may be used to: # int my_function(int x){ # //something # return x; # }  * Functions are objects in R. This is why you are actually assigning the function to an object. * We don't need to declare type since R is not "statically typed" (higher level languages usually are not statically typed). Objects can be coerced into different types on the fly (R is "dynamically typed"). * No need for a "return" statement as the last line is the data that is returned. It is considered bad style to use "return" in R. Let's make sure this works: {r} my_function(3) my_function("asd") my_function(x = 3) #you can specify that 3 is the value for argument "x" my_function(y = 3) #illegal argument (function(x){x + 1})(3) #anonymous function or "lambda" (see https://en.wikipedia.org/wiki/Anonymous_function)  R is somewhat user friendly as it allows for default argument values, making those arguments optional to specify when calling the function: {r} my_function = function(x = 1, y = 2, z = 3, p = 4, q = 5, r = 6){ (x + y + z) / (p + q + r) } my_function() #default execution my_function(p = 0) #one optional argument specified, others defaulted my_function(y = -2, q = 0) #two optional arguments specified, others defaulted my_function = function(x = 1, y, z = 3, p = 4, q = 5, r = 6){ (x + y + z) / (p + q + r) } my_function() #no dice my_function(1, 0) #required argument specified my_function(y = 0, q = 7) #required argument specified and one optional argument rm(my_function) #can be deleted since it's an object  There are also common functional programming functions. * Reduce uses a binary function to successively combine the elements of a given vector and a possibly given initial value. * Filter extracts the elements of a vector for which a predicate (logical) function gives true. * Find and Position give the first or last such element and its position in the vector, respectively. * Map applies a function to the corresponding elements of given vectors. If you like this, there are many packages that extend this and organize it nicely e.g. purrr (we will get to packages next class). {r} x = c(1, 2, 3, 4, 5) Reduce(sum, x) Filter(function(x){x <= 3}, x) Find(function(x){x > 4}, x) unlist(Map(function(x){x + 100}, x)) #what happened here?? Map will return a list (for flexibility)  ## First modeling exercise Before we model, let's fabricate the training data! Let's try to build a data matrix similar to the one in the class example. Let's imagine $n = 100$ and $x_1$ is salary, $x_2$ is a dummy variable for missing loan payment in their credit history, $x_3$ is a ordinal variable for crime type coded 0, 1, 2 or 3. We can "make up" a dataset using the sampling we just learned. {r} n = 100 #number of historical objects: the people p = 3 #number of features about each X = matrix(NA, nrow = n, ncol = p) X  Some more useful matrix functions: {r} nrow(X) ncol(X) dim(X) length(X) #countless bugs! c(X)  Why should we fill up this matrix with NA's? No technical reason; it is done for a practical reason. Every value is currently "missing" until it's filled in i.e. it will let you know if you didn't fill any of the values. We can also name rows and columns. Each row is a historical person and each column is a feature about that person. {r} colnames(X) = c( "salary", "has_past_unpaid_loan", "past_crime_severity" ) colnames(X) #setter and a getter fake_first_names = c( "Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", "Landon", "David", "Christian", "Andrew", "Brayden", "John", "Lincoln" ) rownames(X) = fake_first_names rownames(X) #setter and getter X  Let's pretend "salary" is normally distributed with mean \$50,000 and standard error \$20,000. Let's make up some data {r} X[, 1] = round(rnorm(n, 50000, 20000)) X #another way to set this feature: X[, "salary"] = round(rnorm(n, 50000, 20000)) X  A quick sidebar about vectors within matrices with row or column names: {r} salaries = X[, 1] salaries #it's a vector with names names(salaries) #access to its names names(salaries)[3] = "Adam" salaries salaries[3] salaries["Adam"] sort(salaries) #how do we sort salaries by name? salaries[order(names(salaries))] ?order #it's like sort, but it returns indices rm(salaries)  Are the salary values independent? Yes, or at least we assume so... Hopefully we will get to models in this semester where they are not independent. We will eventually do visualization, but first let's take a look at a summary of this data: {r} summary(X[, "salary"])  There are other base functions to know: {r} mean(X[, "salary"]) #mean should be "average"!! sum(X[, "salary"]) / length(X[, "salary"]) var(X[, "salary"]) sd(X[, "salary"]) median(X[, "salary"]) min(X[, "salary"]) max(X[, "salary"]) IQR(X[, "salary"])  There is also the convenient quantile and inverse quantile function {r} quantile(X[, "salary"], probs = 0.5) quantile(X[, "salary"], probs = c(.1, .9)) inverse_quantile_obj = ecdf(X[, "salary"]) #the "empirical" CDF inverse_quantile_obj(50000) inverse_quantile_obj(70000) inverse_quantile_obj(0) inverse_quantile_obj(-10000) inverse_quantile_obj(200000)  Let's pretend "has_past_unpaid_loan" is benoulli distributed with probability 0.2 {r} X[, "has_past_unpaid_loan"] = rbinom(n, size = 1, prob = 0.2) X  Is this a reasonable fabrication of this dataset? No... since salary and not paying back a loan are dependent r.v.'s. But... we will ignore this now. It would be nice to see a summary of values. Would median and mean be appropriate here? No. For categorical variables, you should "table" them: {r} table(X[, "has_past_unpaid_loan"])  Also, 50\% of people have no crime, 40\% have an infraction, 8\% a misdimeanor and 2\% a felony. Let's try to add this to the matrix. We first need to simulate this. Here's how: {r} X[, "past_crime_severity"] = sample( c("no crime", "infraction", "misdimeanor", "felony"), size = n, replace = TRUE, prob = c(.50, .40, .08, .02) ) X  Oh no - what happened?? Our matrix went all characters... The matrix type cannot handle numeric and categorical variables simultaneously! It would be nice to keep factor or character information in a matrix but this is not the spec. Enter the key data type, the "data.frame" - this is the object that is used for modeling in the R ecosystem. It is essentially an upgraded matrix. {r} X = data.frame( salary = round(rnorm(n, 50000, 20000)), has_past_unpaid_loan = rbinom(n, size = 1, prob = 0.2), past_crime_severity = sample( c("no crime", "infraction", "misdimeanor", "felony"), size = n, replace = TRUE, prob = c(.50, .40, .08, .02) ) ) rownames(X) = fake_first_names X  RStudio gives us a nicer rendering of the information. You can open it up in a separate tab via: {r} View(X)  and you can view summaries of each feature and data type of each feature via {r} summary(X) str(X)  Again, summary doesn't work for "has_past_unpaid_loan". We should convert it to factor and try again. Note the "$" operator which is now valid for data.frame objects. {r} X$has_past_unpaid_loan = factor(X$has_past_unpaid_loan, labels = c("Never", ">=1")) head(X) #make sure that worked summary(X) #much better now! str(X)  Now that we have two categorical variables, we can do a "cross tab": {r} table(X$has_past_unpaid_loan) table(X$past_crime_severity) table(X$has_past_unpaid_loan, X$past_crime_severity) / 100 #to avoid needing the "X$" over and over, use the convenience "with" with(X, table(has_past_unpaid_loan, past_crime_severity) )  In our training set D, we are missing one final variable, the response! Let's add it and say that 90\% of people are creditworthy i.e. they paid back their loan: {r} X$paid_back_loan = factor(rbinom(n, size = 1, prob = 0.9), labels = c("No", "Yes")) head(X) #make sure that worked summary(X) #much better now!  Conceptually - why does this make no sense at all??? y is independent of X --- what happens then? No function f can ever have any predictive / explanatory power! This is just a silly example to show you the data types. We will work with real data soon. Don't worry. ## Libraries in R So far we've only made use of "base R". This is the funcionality included from a vanilla installation of R. R has a huge worldwide community of contributors. Some contribute to newer version of base R, but most create open-source "R packages" or "libraries". Many libraries come preinstalled. For instance, the MASS library which stands for "Modern Applied Statistics with S" (a famous textbook of R). We can call a function from the MASS library via the following: {r} MASS::as.fractions(0.99) MASS::as.fractions(pi)  Note we cannot just execute {r} as.fractions(pi)  We made use of the scope operator "::" to access a namespace beyond the usual "global namespace" which we've been used to. Parenthetically, you can use the ":::" to access the private / internal functions and variables. Anyone who understands object-oriented programming with public interfaces / APIs would cringe at this!!! If we are using the MASS library a lot, using the scope operator may get annoying. So similar to the "with" command, we can call {r} library(MASS)  which loads all public methods (aka "exported" functions) into the public namespace. Now, after the library invocation we can do the following and treat it as a normal function: {r} as.fractions(pi)  Is this always a good idea? No... everytime you call library it "dirties" the namespace by putting all the functions there and rewriting over functions there previously. This is bad because you are more likely to get namespace conflicts. For instance. Let's say package kapelner had a weird sample function. This would be clear: {r} v = rnorm(100) kapelner::sample(v) sample(v)  The first line is doing the special sample function and the second is using base R's sample. But if I do this: {r} library(kapelner) #... #... ###10,000 lines of code in which you forget about the fact that the kapelner library is loaded #... #... sample(v)  You may think you're getting base R sample function, but you're not and now you're in bug-city! You would have to do the following to be explicit: {r} library(kapelner) sample(v) base::sample(v)  This is not a recommended thing to do. It's also not recommended for package developers to name functions the same as common base R functions. But this doesn't stop them! Back to real packages... the content for the MASS package was sitting on the hard drive since it comes with R. But what if you want to use a package that does not come with R? We'll have to install the package just like pip for Python, Rubygems for Ruby, R has a package management system built in. For example, here's a useful package for time series / finance stuff: {r} install.packages("tseries")  Note that it knew where to go online - it went to a CRAN mirror. CRAN is the official repository for R packages. Now that it's installed, step 2 is to load it into namespace so we can more seamlessly access its functionality. {r} library(tseries)  That was a welcome message. This library is really cool e.g. {r} ibm_stock_history = get.hist.quote(instrument = "IBM", start = "2018-01-01", end = "2018-02-01") ibm_stock_history  Is this a data frame? {r} class(ibm_stock_history)  Nope - they made their own data type. If we had a unit on "writing your own R packages", I would explain how this is done but alas there is no time... Let's say you're sharing your code with someone and one of your lines is loading a library e.g. {r} library(a_library_my_computer_does_not_have_installed_yet)  And my computer doesn't have this library. Then we need to stop what we're doing and install. This could be annoying. Here is a convenience: use the pacman package that installs if necessary: {r} if (!require("pacman")){install.packages("pacman")} #installs pacman if necessary but does not load it! pacman::p_load(devtools) #ensures that devtools gets installed and loaded into the workspace but pacman does not (very tidy)!  It is typical to then have a few lines declaring all packages on top of your R/Rmd script file. Here is an example header from one of my projects. {r} #if (!require("pacman")){install.packages("pacman")} #pacman::p_load(knitr, randomForest, dplyr, tidyverse, doParallel, xtable, pracma, yaml)  We will be seeing this in pretty much all our demos in the future. It is very rare to be coding in R without making use of packages beyond base R. I'm going to require the use of pacman for HW / projects, etc. It just makes code easier to share, run, etc. The devtools package is important for modern R usage. It allows downloading R packages directly from source that are not even on CRAN. This allows you to get "bleeding edge" features. For example: {r} install_github("yihui/knitr")  However this doesn't always work! {r} install_github("hadley/ggplot2")  Why can this fail? Because the computer you're running this on is not setup for compiling C++. Admittedly, MAC's usually succeed here and Windows usually fails here. To make it succeed you need to install a separate program beyond R called Rtools. This is one of the big advantages of using Linux and MAC over Windows - Windows just is more buggy when it comes to "real coding" and it gets in the way when you're out there trying to get stuff done. Linux absolutely is the best here and because Linux is usually the production environment anyway, it may make sense to use it for all your assignments and coding anyway. Note, you can use the pacman library for this type of installation too. So your header becomes: {r} if (!require("pacman")){install.packages("pacman")} pacman::p_load(devtools) pacman::p_load_gh("hadley/ggplot2")  # Convenient Mapping Function for Lists with the purrr package We first load the library. {r} pacman::p_load(purrr)  We will see later that the library purrr is part of a collection of libraries called the tidyverse. Now imagine you have a collection of objects in a list. For example, let's let the object be matrices with different sizes: {r} my_matrix_list = list() my_matrix_list[["first"]] = matrix(rnorm(9), nrow = 3) my_matrix_list[["second"]] = matrix(rnorm(12), nrow = 2) my_matrix_list[["third"]] = matrix(rnorm(8), nrow = 4) my_matrix_list  And you want to operate on each of those objects and return a list. Let's say I want to get back the dimensions, or the first rows, or the average values and return the same keys: {r} my_dims_list = modify(my_matrix_list, ~ dim(.x)) my_dims_list my_first_rows_list = modify(my_matrix_list, ~ .x[1, ]) my_first_rows_list my_avgs_list = modify(my_matrix_list, ~ mean(.x)) my_avgs_list  This is a very convenient function known as "mapping" in functional programming. It saves a few lines of code e.g. the first modify would be: {r} my_dims_list = list() #make new list to store keys --> dimensions of original matrices for (key in names(my_matrix_list)){ #iterate over all list by key .x = my_matrix_list[[key]] #get value at the key for this iteration my_dims_list[[key]] = dim(.x) #run function on value and save it to new list } my_dims_list  The above which takes 5 lines and is repeated again and again and again in code all takes one line using the modify function. The modify function uses funky syntax which is not standard R. And it doesn't have to be; packages are allowed to extend the language and use symbols to create their own little mini-language. The .x above is a dummy variable for the value in the iteration in the imagined for loop (like in my rewritten boilerplate code above). The "~" tilde symbol we will be seeing in base R later on in class but in a completely different context. Here it just means "run the following function". Modify is just one of the functions in the purrr package. See the following cheatsheet for more convenient functions: https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf. ## Loading datasets from R packages Since R is a language built for data and statistics, it has a ton of interesting data sets by default and even more that are contained in packages. There is really just one command to know: {r} rm(list = ls()) data(iris) #load the iris dataset (as a data frame). This dataset is included in the package "datasets" which is autoloaded by default class(iris) ?iris #3 things I always do immediately upon getting a dataset head(iris) str(iris) summary(iris)  Here is another very famous dataset {r} MASS::Boston #this just references the object but does not load it into the environment data(Boston) #error since package MASS is not loaded by default data(Boston, package = "MASS") #package argument not needed if package loaded head(Boston)  Most data sets are names some descriptive name like "loandata" or "cars". R has so many datasets. Here they all are by package installed: {r} data(package = .packages(all.available = TRUE))  ## Continue discussion concerning data frames and the modeling from class We quickly recreate our data frame from last class: {r} n = 100 X = data.frame( salary = round(rnorm(n, 50000, 20000)), has_past_unpaid_loan = rbinom(n, size = 1, prob = 0.2), past_crime_severity = sample( c("no crime", "infraction", "misdimeanor", "felony"), size = n, replace = TRUE, prob = c(.50, .40, .08, .02) ) ) row.names(X) = c( "Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", "Landon", "David", "Christian", "Andrew", "Brayden", "John", "Lincoln" ) X  Remember our cross tab? Now we can get fancier using our new libary skills. Any Stata fans out there? {r} pacman::p_load(gmodels) CrossTable(X$has_past_unpaid_loan, X$past_crime_severity, chisq = TRUE)  And add a new variable, the response to the data frame: {r} X$paid_back_loan = factor(rbinom(n, size = 1, prob = 0.9), labels = c("No", "Yes"))  Note that our matrix is now no longer just $X$; it includes $y$. I could make a renamed copy, but I want to show off dropping this column and create a new object that's both features and response column-binded together: {r} y = X$paid_back_loan X$paid_back_loan = NULL #drop column Xy = cbind(X, y) #an aside: what do you think the "rbind" function does? head(Xy) #make sure that worked summary(Xy) #much better now! #Note: Xy = X; rm(X) would've been easier  I prefer calling the full training set ${X, y}$ a data frame called $Xy$. The object $X$ is now extraneous, so we should clean up our workspace now. {r} rm(list = setdiff(ls(), "Xy"))  ## The Null Model {r} #There's no standard R function for sample mode!!! sample_mode = function(data){ mode_name = names(sort(-table(data)))[1] if (class(data) == "factor"){ factor(mode_name, levels = levels(data)) } else if (class(data) == "numeric"){ as.numeric(mode_name) } else if (class(data) == "integer"){ as.integer(mode_name) } else { mode_name } } g0 = function(){ sample_mode(Xy$y) #return mode regardless of x } g0()  ## The Threshold Model Let's compute the threshold model and see what happens. Here's an inefficent but quite pedagogical way to do this: {r} n = nrow(Xy) num_errors_by_parameter = matrix(NA, nrow = n, ncol = 2) colnames(num_errors_by_parameter) = c("threshold_param", "num_errors") y_logical = Xy$y == "Yes" for (i in 1 : n){ threshold = Xy$salary[i] num_errors = sum((Xy$salary > threshold) != y_logical) num_errors_by_parameter[i, ] = c(threshold, num_errors) } num_errors_by_parameter #look at all thresholds in order num_errors_by_parameter[order(num_errors_by_parameter[, "num_errors"]), ] #now grab the smallest num errors best_row = order(num_errors_by_parameter[, "num_errors"])[1] x_star = c(num_errors_by_parameter[best_row, "threshold_param"], use.names = FALSE) x_star  Let's program g, the model that is shipped as the prediction function for future x_* {r} g = function(x){ ifelse(x > x_star, 1, 0) } g(15000)  ## The Perceptron Time for some new data first... we are bored of the fabricated creditworthiness data. {r} rm(list = ls()) Xy = na.omit(MASS::biopsy) #The "breast cancer" data ?MASS::biopsy head(Xy) X = Xy[, 2 : 10] #V1, V2, ..., V9 head(X) y_binary = as.numeric(Xyclass == "malignant") table(y_binary)  First question. Let\mathcal{H}$be the set$\{0, 1\}$meaning$g = 0$or$g = 1$. What are the error rates then on$\mathbb{D}$? {r} #If always 0, all the 1's are errors 239 / (444 + 239) #If always 1, all the 0's are errors 444 / (444 + 239) g0 = function(){ sample_mode(y_binary) #return mode regardless of x's } g0()  If your$g$can't beat that, either your features$x_1, \ldots, x_p$are terrible, and/or$\mathcal{H}$was a terrible choice and/or$\mathcal{A}$can't pull its weight. Okay... back to the "perceptron learning algorithm". Let's do so for one dimension - just "V1" in the breast cancer data. You will do an example with more features for the lab. {r} # y_binary = ifelse(y_binary == 1, 0, 1) MAX_ITER = 1000 w_vec = rep(0, 2) #intialize a 2-dim vector X1 = as.matrix(cbind(1, X[, 1, drop = FALSE])) for (iter in 1 : MAX_ITER){ for (i in 1 : nrow(X1)){ x_i = X1[i, ] yhat_i = ifelse(sum(x_i * w_vec) > 0, 1, 0) y_i = y_binary[i] w_vec[1] = w_vec[1] + (y_i - yhat_i) * x_i[1] w_vec[2] = w_vec[2] + (y_i - yhat_i) * x_i[2] } } w_vec  What is our error rate? {r} yhat = ifelse(X1 %*% w_vec > 0, 1, 0) sum(y_binary != yhat) / length(y_binary)  Looks like the perceptron fit to just the first feature beat the null model (at least on the data in$\mathbb{D}$). Is this expected? Yes if the first feature is at all predictive of y. ## Errors and Warnings You can write better functions if you make use of errors and warnings. Java forces you to catch likely errors via the "throws" designation for a method but there is no such requirement in R. * Errors are unrecoverable, they halt execution i.e. red lights * Warnings (under usual execution) do not halt execution, but they display a message, i.e. yellow lights Here's how they work: {r} my_vector_sum = function(xs){ if (!(class(xs) %in% c("numeric", "integer"))){ #short for class(xs) == "numeric" | class(xs) == "integer" stop("You need to pass in a vector of numbers not a vector of type \"", class(xs), "\".\n") #throw error! # warning("Your vector of type \"", class(xs), "\" will be coerced to numbers.\n") #throw warning! # xs = as.numeric(as.factor(xs)) } tot = 0 for (x in xs){ tot = tot + x } tot } my_vector_sum(c(1, 2, 3)) my_vector_sum(c("a", "b", "c"))  There is a try-catch as well: {r} xs = c("a", "b", "c") tot = my_vector_sum(xs) tot = tryCatch( { my_vector_sum(xs) }, error = function(e){ print("xs was non-numeric... coercing xs to numeric...") my_vector_sum(as.numeric(as.factor(xs))) } ) tot  The recommended thing to do of course is to query if it is non-numeric within the function my_vector_sum and cast it then. Possibly create an argument toggling this behavior on/off with a default of off. ## Matrix operations in R R can do all the standard matrix operations. Let's go through them quickly. First initialize two example matrices: {r} A = matrix(rep(1, 4), nrow = 2) A B = array(seq(1, 4), dim = c(2, 2)) B I = diag(2) #create an identity matrix of size 2x2 I  Now we show off some operations: {r} A * B #element-wise multiplication A %*% B #matrix multiplication B %*% I t(B) #transpose solve(B) solve(A) #BOOM - why? tryCatch( { solve(A) }, error = function(e){ print("matrix not invertible, doing Moore-Penrose generalized pseudoinverse instead...") MASS::ginv(A) } ) #how would you wrap this and handle it in the real world? solve(I) #rank(A) = 1 #no such function... but... there are tons of add-on libraries for matrix computations e.g. pacman::p_load(Matrix) #load the Matrix library rankMatrix(B) rankMatrix(A) rankMatrix(I)  Note that vectors and matrices are not the same: {r} v = c(1, 2, 3) #3-d vector t(v) #converts to 1x3 vector... unsure why t(t(v)) v %*% v #seems to default to dot product t(v) %*% t(t(v)) #dot product I = diag(3) I %*% v #seems to default correctly! I %*% t(v) #actually uncomformable  ## Support Vector Machines (SVM) You will code a basic SVM for homework. Here we use the e1071 library. {r} pacman::p_load(e1071)  We make a simple dataset first. {r} Xy_simple = data.frame( response = factor(c(0, 0, 0, 1, 1, 1)), #nominal first_feature = c(1, 1, 2, 3, 3, 4), #continuous second_feature = c(1, 2, 1, 3, 4, 3) #continuous )  We haven't spoken about visualization yet, but it is important we do some of it now. First we load the visualization library we're going to use: {r} pacman::p_load(ggplot2)  And let's plot the data: {r} simple_viz_obj = ggplot(Xy_simple, aes(x = first_feature, y = second_feature, color = response)) + geom_point(size = 5) simple_viz_obj  Now we fit a linear SVM. Since it is linearly separable, we can make$\lambda = 0$. Since the package doesn't allow zero, we make it trivially small. {r} Xy_simple_feature_matrix = as.matrix(Xy_simple[, 2 : 3]) lambda = 1e-9 n = nrow(Xy_simple_feature_matrix) svm_model = svm(Xy_simple_feature_matrix, Xy_simple$response, kernel = "linear", cost = (2 * n * lambda)^-1, scale = FALSE)  The model object can be queried to find the "support vectors" i.e. the observations that lie on the wedge. Let's visualize them too. {r} svm_model$index Xy_simple$is_support_vector = rep("no", n) Xy_simple$is_support_vector[svm_model$index] = "yes" simple_viz_obj = ggplot(Xy_simple, aes(x = first_feature, y = second_feature, color = response, shape = is_support_vector)) + geom_point(size = 5) simple_viz_obj  Now we calculate the weight vector. This is technical and not covered in the class yet (or maybe never): {r} w_vec_simple_svm = c( svm_model$rho, #the b term -t(svm_model$coefs) %*% Xy_simple_feature_matrix[svm_model$index, ] # the other terms ) w_vec_simple_svm w_norm = sqrt(sum(w_vec_simple_svm^2)) w_norm  We can also plot it. We have to convert from Hesse Normal form back into point-intercept form. Note that$b$is the first entry of the w_vec_simple_svm vector {r} simple_svm_line = geom_abline( intercept = -w_vec_simple_svm[1] / w_vec_simple_svm[3], slope = -w_vec_simple_svm[2] / w_vec_simple_svm[3], color = "purple") simple_viz_obj + simple_svm_line  We can also plot the wedge by plotting the top line (where b is augmented by 1) and the bottom line (where b is diminished by 1). {r} simple_svm_top_line = geom_abline( intercept = -(w_vec_simple_svm[1] + 1) / w_vec_simple_svm[3], slope = -w_vec_simple_svm[2] / w_vec_simple_svm[3], color = "yellow") simple_svm_bottom_line = geom_abline( intercept = -(w_vec_simple_svm[1] - 1) / w_vec_simple_svm[3], slope = -w_vec_simple_svm[2] / w_vec_simple_svm[3], color = "yellow") simple_viz_obj + simple_svm_line + simple_svm_top_line + simple_svm_bottom_line  To understand the hyperparameter, let's introduce another data point so the training data is no longer linearly separable. {r} Xy_simple = rbind(Xy_simple, c(0, 3.2, 3.2))  and plot it: {r} simple_viz_obj = ggplot(Xy_simple, aes(x = first_feature, y = second_feature, color = response)) + geom_point(size = 5) simple_viz_obj  Let's try SVM at different$\lambda$values. {r} lambda = 0.001#0.001 #1.3 Xy_simple_feature_matrix = as.matrix(Xy_simple[, 2 : 3]) n = nrow(Xy_simple_feature_matrix) svm_model = svm(Xy_simple_feature_matrix, Xy_simple$response, kernel = "linear", cost = (2 * n * lambda)^-1, scale = FALSE) summary(svm_model) w_vec_simple_svm = c( svm_model$rho, #the b term -t(svm_model$coefs) %*% Xy_simple_feature_matrix[svm_model$index, ] # the other terms ) sqrt(sum(w_vec_simple_svm^2)) Xy_simple$is_support_vector = rep("no", n) Xy_simple$is_support_vector[svm_model$index] = "yes" simple_viz_obj = ggplot(Xy_simple, aes(x = first_feature, y = second_feature, color = response, shape = is_support_vector)) + geom_point(size = 5) simple_svm_line = geom_abline( intercept = -w_vec_simple_svm[1] / w_vec_simple_svm[3], slope = -w_vec_simple_svm[2] / w_vec_simple_svm[3], color = "purple") simple_svm_top_line = geom_abline( intercept = -(w_vec_simple_svm[1] + 1) / w_vec_simple_svm[3], slope = -w_vec_simple_svm[2] / w_vec_simple_svm[3], color = "yellow") simple_svm_bottom_line = geom_abline( intercept = -(w_vec_simple_svm[1] - 1) / w_vec_simple_svm[3], slope = -w_vec_simple_svm[2] / w_vec_simple_svm[3], color = "yellow") simple_viz_obj + simple_svm_line + simple_svm_top_line + simple_svm_bottom_line  What lesson did we learn here? This hyperparameter really matters! (In the method's documentation note, it says "Parameters of SVM-models usually must be tuned to yield sensible results!") And it is not clear to me what the "wedge" is anymore when there isn't separability, nor what the support vectors are, and nor how cost parameter has much to do with the lambda from the notes. What is clear is that we need to figure out a way to deal with selecting the "right" hyperparameter value automatically. So far neither the perceptron nor the SVM is an algorithm for binary classification that comes without flaws. ## Nearest Neighbor algorithm Load up the breast cancer data set again. {r} Xy = na.omit(MASS::biopsy) #The "breast cancer" data with all observations with missing values dropped X = Xy[, 2 : 10] #V1, V2, ..., V9 y_binary = as.numeric(Xyclass == "malignant")  Let's say we want to build a nearest neighbor model with the first covariate only. We are then looking for the label (response) of the closest x_1. Here is a simple function that does it: {r} nn_function = function(x_star){ y_binary[which.min((X[, 1] - x_star)^2)] } nn_function(7.8) nn_function(5.2)  Why is this silly for this dataset? {r} str(X)  The features are not truly continuous. Would it make sense in higher dimensions? Your homework... Has this been coded before? Definitely... {r} pacman::p_load(class) ?knn  We can fit a knn model *and* predict in one shot via: {r} y_hat = knn(X, c(4, 2, 1, 1, 2, 1, 2, 1, 1), y_binary, k = 1) y_hat  Why is build model and predict in one shot natural in knn? Now for an interesting exercise that will setup future classes: {r} y_hat = knn(X, X, y_binary, k = 1) y_hat all.equal(y_hat, factor(y_binary))  No errors! Can this be a good model? No... "something" must be wrong! It is too good to be true. Something is wrong. This is the first example of "overfitting". We will explore this later in depth (it is one of the core concepts of this course). Let's seeK > 1${r} y_hat = knn(X, X, y_binary, k = 10) y_hat all.equal(y_hat, factor(y_binary))  Why would there be difference now between predictions and the actual data? {r} rm(list = ls())  ## Simple Linear Regression (p = 1) To understand what the algorithm is doing - best linear fit by minimizing the squared errors, we can draw a picture. First let's make up some very simple training data$\mathbb{D}$. {r} set.seed(1984) n = 20 x = runif(n) beta_0 = 3 beta_1 = -2 h_star = beta_0 + beta_1 * x epsilons = rnorm(n, mean = 0, sd = 0.33) y = h_star + epsilons  And let's plot the data: {r} pacman::p_load(ggplot2) simple_df = data.frame(x = x, y = y) simple_viz_obj = ggplot(simple_df, aes(x, y)) + geom_point(size = 2) simple_viz_obj  And its true$h^*$line: {r} true_hstar_line = geom_abline(intercept = beta_0, slope = beta_1, color = "green") simple_viz_obj + true_hstar_line  Now let's calculate the simple least squares coefficients: {r} r = cor(x, y) s_x = sd(x) s_y = sd(y) ybar = mean(y) xbar = mean(x) b_1 = r * s_y / s_x b_0 = ybar - b_1 * xbar b_0 b_1  Note how$b_0$and$b_1$are not exactly the same as$\beta_0$and$\beta_1$. Why? And we can plot it: {r} simple_ls_regression_line = geom_abline(intercept = b_0, slope = b_1, color = "red") simple_viz_obj + simple_ls_regression_line + true_hstar_line  Review of the modeling framework: The difference between the green line and red line is the "estimation error". The difference between the green line and the points is a combination of error due to ignorance and error due to misspecification of$f$as a straight line. In most real-world applications, estimation error is usually small relative to the other two. In the era of "big data",$n$is usually big so estimation error is pretty small. Recall that the noise (epsilons) are the difference between the data and the green line: {r} simple_df$hstar = beta_0 + beta_1 * simple_df$x simple_viz_obj = ggplot(simple_df, aes(x, y)) + geom_point(size = 2) epsilon_line_segments = geom_segment(aes(xend = x, yend = hstar), position = position_nudge(x = 0.002)) simple_viz_obj + epsilon_line_segments + true_hstar_line  And that the residuals (e's) are the difference between the measurements of the response in the actual data and the green line: {r} simple_df$gs = b_0 + b_1 * simple_df$x simple_viz_obj = ggplot(simple_df, aes(x, y)) + geom_point(size = 2) e_line_segments = geom_segment(aes(xend = x, yend = gs), color = "purple") simple_viz_obj + simple_ls_regression_line + e_line_segments  Examining both at the same time: {r} simple_viz_obj + simple_ls_regression_line + true_hstar_line + e_line_segments + epsilon_line_segments  ## Assessing quality of a simple linear regression Regenerate the data from last week: {r} set.seed(1984) n = 20 x = runif(n) beta_0 = 3 beta_1 = -2 h_star = beta_0 + beta_1 * x epsilons = rnorm(n, mean = 0, sd = 0.33) y = h_star + epsilons pacman::p_load(ggplot2) simple_df = data.frame(x = x, y = y) simple_viz_obj = ggplot(simple_df, aes(x, y)) + geom_point(size = 2) simple_viz_obj  Note: our$\mathcal{A}$was ordinary least squares. What follows below is a method of assessing model fit quality not only for the least squares line, or any linear fit, but any regression fit. {r} simple_df$yhat = b_0 + b_1 * simple_df$x simple_df$e = y - simple_df$yhat sse = sum(simple_df$e^2) mse = sse / (n - 2) rmse = sqrt(mse) sse mse rmse s_sq_y = var(y) s_sq_e = var(simple_df$e) rsq = (s_sq_y - s_sq_e) / s_sq_y rsq #calculated in a different, but equivalent way sse_0 = (n - 1) * s_sq_y (sse_0 - sse) / sse_0  Let's take a look at$R^2$visually. We compute null residuals (the$e_0$'s) and model residuals (the$e$'s) and plot a them. {r} simple_df$e_0 = y - mean(y) ggplot(simple_df) + geom_histogram(aes(x = e), fill = "darkgreen", alpha = 0.3) + geom_histogram(aes(x = e_0, fill = "red", alpha = 0.3)) + theme(legend.position = "none") ggplot(simple_df) + stat_density(aes(x = e), fill = "darkgreen", alpha = 0.3) + stat_density(aes(x = e_0, fill = "red", alpha = 0.3)) + theme(legend.position = "none")  Note residuals always have sample average = 0 (modulo numeric error): {r} mean(simple_df$e_0) mean(simple_df$e)  We will prove this fact later in class. Let's take a look at predictions by truth: {r} ggplot(simple_df, aes(x = yhat, y = y)) + geom_point() + xlim(0, max(simple_df$yhat, y)) + ylim(0, max(simple_df$yhat, y)) + xlab("yhat") + coord_fixed() + geom_abline(intercept = 0, slope = 1, color = "orange")  Linear regression is pretty popular so there's obviously support for this in R. Before we talk about this, we need to discuss another object type in R. It is called the "formula" object. Here's an example: {r} simple_model_formula = as.formula("y ~ x") simple_model_formula  You can use a convenience: {r} simple_model_formula = y ~ x simple_model_formula  How did this work? R interprets this as a formula because it sees the tilde inside (you have to dig pretty deep into the R language to understand the tilde operator but feel free to ignore it now). By default the formula object when executed prints out the string you supplied. But obviously it is not just a string. This object contains instructions to model y with x. This may seem opaque now but you'll see how this works soon. Getting back to support for the default linear model. The popular function for that implements least squares linear regression is loaded into R automatically in the package stats. Here is a list of packages that are loaded into R by default (save the ggplot2 and RStudio's addition): {r} search()  The function lm runs least squares. Let's see it: {r} simple_linear_model = lm(simple_model_formula) simple_linear_model  You can skip a step by putting the formula in-line (this is how it's usually done): {r} simple_linear_model = lm(y ~ x) simple_linear_model class(simple_linear_model)  What did this do? By specifying the formula that $y$ should be modeled with $x$ sing the simple linear model $y = w_0 + w_1 x$ By default it prints out $b_0$ and $b_1$. You can store the vector via: {r} b = coef(simple_linear_model) b names(b) class(b) #i.e. a vector of numbers dimension 2 where each entry is named  You can query the linear model about its fit as well: {r} names(summary(simple_linear_model)) summary(simple_linear_model)$r.squared #the R^2 summary(simple_linear_model)$sigma #the RMSE  Cleanup... {r} rm(list = ls())  ## Simple Linear Regression with an example data set Load up the famous Boston Housing data {r} ?MASS::Boston Xy = MASS::Boston head(Xy)  We would like to see how each feature relates to the response, medv. This is a quick and dirty way to do it: {r} for (feature in setdiff(colnames(Xy), "medv")){ plot(ggplot(Xy, aes(x = Xy[, feature], y = medv)) + geom_point() + xlab(feature)) }  Let's try to explain medv using the feature rm in a simple linear regression (least squares) model. {r} x = Xy$rm y = Xy$medv r = cor(x, y) s_x = sd(x) s_y = sd(y) ybar = mean(y) xbar = mean(x) b_1 = r * s_y / s_x b_0 = ybar - b_1 * xbar b_0 b_1  and we can plot this line atop the data: {r} simple_viz_obj = ggplot(Xy, aes(x = rm, y = medv)) + geom_point() simple_ls_regression_line = geom_abline(intercept = b_0, slope = b_1, color = "red") simple_viz_obj + simple_ls_regression_line  And how well did we do? {r} yhat = b_0 + b_1 * x #this is the g(x^*) function! e = y - yhat sse = sum(e^2) mse = sse / length(y) rmse = sqrt(mse) sse mse rmse s_sq_y = var(y) s_sq_e = var(e) rsq = (s_sq_y - s_sq_e) / s_sq_y rsq  SSE is not a super useful number alone. MSE is not super useful alone. RMSE is... what does it mean? What does $R^2$ mean? {r} Xy$null_residuals = y - mean(y) Xy$residuals = e ggplot(Xy) + stat_density(aes(x = residuals), fill = "darkgreen", alpha = 0.6, adjust = 0.5) + stat_density(aes(x = null_residuals, fill = "red", alpha = 0.6, adjust = 0.5)) + theme(legend.position = "none")  This is not a great model. Why? Three sources of error... what do you think are the biggest sources of error? {r} rm(list = ls())  Let's do this again using R's lm function. Here we can leverage the data frame object by using the names of the variables in our model formula: {r} mod = lm(medv ~ rm, data = MASS::Boston)  I read this as "build a linear model where we explain median household value using the average number of rooms in the Boston housing dataset". One line! We can of course ping the model for everything we've been talking about: {r} coef(mod) summary(mod)$r.squared summary(mod)$sigma  And ggplot has amazing integration. Here it runs the model internally and gives smoothed confidence bands (we did not discuss this): {r} ggplot(MASS::Boston, aes(rm, medv)) + geom_point() + geom_smooth(method = 'lm')  {r} rm(list = ls())  Let's take a look at another dataset. {r} ?MASS::Cars93 cars = MASS::Cars93 cars  Usually, we are trying to build a model for Price. Let's see how Horsepower is related to price: {r} pacman::p_load(ggplot2) ggplot(cars, aes(Horsepower, Price)) + geom_point() + geom_smooth(method = 'lm')  {r} simple_linear_model = lm(Price ~ Horsepower, data = cars) coef(simple_linear_model) summary(simple_linear_model)$r.squared summary(simple_linear_model)$sigma  62\% is pretty good $R^2$! But the RMSE is about \$6,000. Using the empirical rule heuristic, that means you can only predict within$\pm \$12,000$ around 95\% of the time. Not so good! # Predictions with linear models in R After the model is fit, you may want to predict with it using the $g$ function. Of course R can do this: {r} predict(simple_linear_model, data.frame(Horsepower = 200)) #i.e. yhat = g(400) predict(simple_linear_model, data.frame(Horsepower = c(200, 300, 500))) #i.e. the yhat vector = [g(200), g(300), g(500)]  ##Simple regression with nominal variables Let's take a look at the factor variable as the predictor. We'll use Origin in this example: {r} table(cars$Origin)  We can plot how this looks: {r} ggplot(cars, aes(Origin, Price)) + geom_point() + geom_smooth(method = 'lm')  Note that ggplot cannot handle fitting a line with a factor variable. However... {r} simple_linear_model = lm(Price ~ Origin, data = cars) coef(simple_linear_model) summary(simple_linear_model)$r.squared summary(simple_linear_model)$sigma  What happened here? The lm function can handle factors. It picks a level to be the reference category (in this case, it's origin = USA) and the fitted slope$b_1$would be the difference between non-USA and USA. Does it make sense that the slope is positive? Yes - foreign cars charge transport fees and there are a lot of luxury foreign cars. Why is$R^2$so low?? Remember the null model is one$\bar{y}$, this model is just two$\bar{y}$'s. How much variance can you possible expect to explain with just two possible prediction values?? Now take a look at RMSE. It's about \$10,000. Before, it was about \$6,000 and the$R^2$= 62\%. Yes the RMSE has gone up in this case.$R^2$is not proportion standard error explained, it's proportion variance and that squared error is very different than its square root. Let's cast this predict as numeric and run it again just to make sure it will be the same. We first code a new dummy variable: {r} cars$origin_is_not_usa = ifelse(cars$Origin == "non-USA", 1, 0)  and then we model using this new dummy variable: {r} simple_linear_model = lm(Price ~ origin_is_not_usa, data = cars) coef(simple_linear_model) summary(simple_linear_model)$r.squared summary(simple_linear_model)$sigma  Note the reference category is USA and the "non-USA" coefficient indicates the difference in sample averages. Note that now ggplot can handle the line: {r} ggplot(cars, aes(origin_is_not_usa, Price)) + geom_point() + geom_smooth(method = 'lm')  Let's code the dummy variable differently to take a look at the equivalent regression but this time with the reference category as non-USA {r} cars$origin_is_usa = ifelse(cars$Origin == "USA", 1, 0)  {r} simple_linear_model = lm(Price ~ origin_is_usa, data = cars) coef(simple_linear_model) summary(simple_linear_model)$r.squared summary(simple_linear_model)$sigma  The coefficients here are like "the opposite" in some sense of what we just saw. And of course$R^2$and RMSE are equivalent - it's the same linear model with the same information, just coded differently. What is the intercept is left out? You can remove the intercept from the formula by adding "0 + ..." which means "no intercept but ..." {r} simple_linear_model = lm(Price ~ 0 + Origin, data = cars) coef(simple_linear_model) summary(simple_linear_model)$r.squared summary(simple_linear_model)$sigma  What did$R^2$seem like it changed?? The lm method is calculating$R^2$differently here if there is no intercept. The null model it is comparing to is$g_0 = 0$and not our$g_0 = \bar{y}$. This is a small point about the implementation in R and I won't ever test you on this trivia. Let's do an example of a categorical nominal variable with L>2 levels. {r} summary(cars) table(cars$Type) mean(cars[cars$Type == "Compact", "Price"]) mean(cars[cars$Type == "Large", "Price"]) mean(cars[cars$Type == "Midsize", "Price"]) mean(cars[cars$Type == "Small", "Price"]) mean(cars[cars$Type == "Sporty", "Price"]) mean(cars[cars$Type == "Van", "Price"]) lm(Price ~ 0 + Type, cars) lm(Price ~ Type, cars)  What is the reference level in this variable? Can we change it? {r} levels(cars$Type) cars$Type = relevel(cars$Type, ref = "Van") lm(Price ~ Type, cars) lm(Price ~ 0 + Type, cars)  ## Correlation and Covariance Let's load up the Boston Housing data again. {r} boston = MASS::Boston str(boston)  Let us take a look at some covariances and correlations with the response, the median home value. {r} pacman::p_load(ggplot2) cov(boston$rm, boston$medv) cor(boston$rm, boston$medv) ggplot(boston, aes(rm, medv)) + geom_point() + geom_smooth(method = 'lm') cov(boston$indus, boston$medv) cor(boston$indus, boston$medv) ggplot(boston, aes(indus, medv)) + geom_point() + geom_smooth(method = 'lm')  Ever wonder why it's called$R^2$? {r} summary(lm(medv ~ rm, boston))$r.squared cor(boston$rm, boston$medv)^2 summary(lm(medv ~ indus, boston))$r.squared cor(boston$indus, boston$medv)^2  ## Multivariate linear regression We want to run a multivariate linear regression$\mathcal{H}$employing the least squares$\mathcal{A}$manually using our derived linear algebra. Let us first pull out$\mathbb{D}$as$y$and$X$. Let's ensure we augment the X to include the 1 vector in front. We need this for the intercept in the$w$vector in our spec,$\mathcal{H}$. {r} y = MASS::Boston$medv X = cbind(1, MASS::Boston[, 1 : 13]) head(X)  Can we find $X^\top X$? {r} XtX = t(X) %*% X  The data frame is great, but unfortunately R does not allow us to use matrix algebra on it. So let's create a matrix. Note: there are no factor variables with more than one level. chas is a binary variable and that's okay. If there were factors with more than level, the following will not work. We will explore this later. {r} X = as.matrix(cbind(1, MASS::Boston[, 1 : 13])) head(X)  So $p = 12$ and $p + 1 = 14$. Let's make each predictor name nice just for aesthetic value: {r} colnames(X) colnames(X)[1] = "(intercept)" #this is the standard way lm denotes it (which we will compare to later) colnames(X)  Can we find $X^\top X$? {r} XtX = t(X) %*% X dim(XtX)  Is it full rank? {r} XtXinv = solve(XtX)  It worked. This means $X$ is full rank i.e. there is no linear duplication of information over the 13 + 1 predictors. In case we're in doubt: {r} pacman::p_load(Matrix) rankMatrix(X)[[1]] rankMatrix(t(X))[[1]] rankMatrix(XtX)[[1]] rankMatrix(XtXinv)[[1]]  Let's calculate the LS solution then: {r} b = XtXinv %*% t(X) %*% y b  Interpretation: if crim "increases" by 1, $\hat{y}$ increases by... etc etc. How would crim increase? Big philosophical topic which we are punting on (for now). If all predictors are 0, then $y$ would be predicted to be the intercept, 20.65. Strange concept... not usually important. What would $g$ look like? {r} g_predict_function = function(x_star){ x_star %*% b } g_predict_function(X[7, ]) y[7] #good prediction!  Pretty simple... and x_star could be a matrix of n_star * (p + 1) - where n_star is however many new observations you wish to predict. We can compute all predictions: {r} yhat = X %*% b head(yhat, 10)  Can you tell this is projected onto a 14 dimensionsal space from a 506 dimensional space? Not really... but it is... We can calculate the residuals: {r} e = y - yhat head(e, 10)  What is RMSE? {r} SSE = t(e) %*% e MSE = 1 / (nrow(X) - ncol(X)) * SSE RMSE = sqrt(MSE) SSE MSE RMSE  Interpret the RMSE... We can calculate $R^2$ two ways: {r} s_sq_y = var(y) s_sq_e = var(e) Rsq = (s_sq_y - s_sq_e) / s_sq_y Rsq n = length(e) SST = (n - 1) * s_sq_y Rsq = 1 - SSE / SST Rsq  Let's look at distribution of $y$ and $e$ to get an idea about $R^2$ as we did before: {r} pacman::p_load(ggplot2) ggplot(data.frame(null_residuals = y - mean(y), residuals = e)) + stat_density(aes(x = residuals), fill = "darkgreen", alpha = 0.3) + stat_density(aes(x = null_residuals, fill = "red", alpha = 0.3)) + theme(legend.position = "none")  What does this tell you about $R^2$? Now, of course, R has its own function to do all this. We've already seen them! To run a multivariate least squares linear model, {r} mult_lin_mod = lm(medv ~ ., MASS::Boston)  No need to (a) create a matrix from the data frame (b) append a 1's column (c) do the linear algebra. It's all done for you. What is this formula medv ~ .? Previously we've seen medv ~ rm to indicate "fit phenomenon medv using predictor rm". Here, it's "fit phenomenon medv using all available predictors in the data frame". This is a very powerful formula! Let's extract the estimates $b$ as well as $R^2$ and RMSE: {r} coef(mult_lin_mod) summary(mult_lin_mod)$r.squared summary(mult_lin_mod)$sigma  Does R offer a simple way to do $g$? Sure... {r} x_star = MASS::Boston[7, ] y_hat_star = predict(mult_lin_mod, newdata = x_star) y_hat_star y[7]  If you care about the internals of what R is doing, it retraces our steps perfectly. It first creates the "model matrix" we called the "design matrix" and denoted it X: {r} Xmm = model.matrix(medv ~ ., MASS::Boston) head(Xmm) head(X) #same  Then it uses an internal function to compute the linear algebra: {r} raw_mod = lm.fit(Xmm, y) raw_mod$coefficients  We will soon see the internals of the lm.fit algorithm when we do the linear algebra of QR decomposition. ## OLS using categorical predictors Note that historically this is called "Analysis of Variance" or "ANOVA" for short. But there is no difference to the computer, it still crunches the same matrices. Let's get the cars data again: {r} cars = MASS::Cars93 str(cars)  Let's try to model Type, a factor with 6 levels. {r} table(cars$Type)  What will $\hay{y}$ look like? Should be the $\bar{y}$'s for each level. What is $p$? 6. Let' see: {r} anova_mod = lm(Price ~ Type, cars) coef(anova_mod) summary(anova_mod)$r.squared  The one categorical variable got blown up into 5 features. How to interpret? First need to know the "reference category" i.e. which level is missing in the list. We can see from cross-referencing the coefficient names with the table of the raw feature that the reference category is Compact. So what is prediction for the compact type? The intercept. What is prediction of Large type? Intercept + Large, etc. What actually happened to get the OLS estimates? Let's see the model matrix: {r} Xmm = model.matrix(Price ~ Type, cars) head(Xmm, 20) table(rowSums(Xmm))  The predictor Type got "dummified" (remember we spoke about this in lecture 1 or 2). There are now 5 dummy variables each representing one of the levels and the reference level is omitted because it is accounted for in the intercept. Let's make sure this is exactly what's going on. {r} y = cars$Price Xt = t(Xmm) XtX = Xt %*% Xmm XtXinv = solve(XtX) b = XtXinv %*% Xt %*% y b yhat = Xmm %*% b e = y - yhat Rsq = (var(y) - var(e)) / var(y) Rsq sqrt(sum(e^2) / (nrow(cars) - 6))  And of course the coefficients and $R^2$ are identical to the output from lm. If we want to do a more "pure ANOVA", we can get rid of the intercept and see the $\bar{y}$'s immediately. This is handled in R's formula designation by adding a zero: {r} anova_mod = lm(Price ~ 0 + Type, cars) coef(anova_mod)  Is this correct? {r} mean(cars$Price[cars$Type == "Compact"]) mean(cars$Price[cars$Type == "Large"]) mean(cars$Price[cars$Type == "Midsize"]) mean(cars$Price[cars$Type == "Small"]) mean(cars$Price[cars$Type == "Sporty"]) mean(cars$Price[cars$Type == "Van"])  What does $R^2$ look like? {r} summary(anova_mod)$r.squared  Remember this from last time? What happened? The$R^2$calculation in lm is not accurate without the intercept. Keep this in mind. What does the design matrx (model matrix) look like? we can use the model.matrix function to generate the columns of$X$from the data frame. The argument is the formula we wish to generate the model matrix for. Since model matrices don't require {r} Xmm = model.matrix(~ 0 + Type, cars) head(Xmm, 20) table(rowSums(Xmm))  Very similar. Regressions without an intercept are not recommended. Here's why. What if we were doing two factors? I want a linear model with both Type and Airbags: {r} table(cars$AirBags)  Airags is another nominal categorical variable, this time with three levels. We invoke the model as follows. {r} anova_mod = lm(Price ~ Type + AirBags, cars) coef(anova_mod) summary(anova_mod)$r.squared summary(anova_mod)$sigma  What are interpretations now? What is the "reference level"? It's actually two levels in one: Type = compact and Airbags = Driver \& Passenger. A deeper question: can we read off Type = Midsize and AirBags = none? No... this is a modeling "enhancement" we will discuss in a few lectures from now. If we model it without an intercept, {r} anova_mod = lm(Price ~ 0 + AirBags + Type, cars) coef(anova_mod)  we only get $\bar{y}$'s for the first factor predictor crossed with the reference category of the second. So above TypeCompact refers to the average of Type = Compact and Airbags = Driver \& Passenger. Now let's create a linear model using one categorical predictor and one continuous predictor. The combination is called for historical reasons "Analysis of Covariance" or "ANCOVA" for short. Let's use Type and Horsepower: {r} ancova_mod = lm(Price ~ Type + Horsepower, cars) coef(ancova_mod) summary(ancova_mod)$r.squared summary(ancova_mod)$sigma  Interpretation of estimated coefficients? Why did $R^2$ increase? (We will be explaining this in detail in the next unit). What's going on the design / model matrix? {r} head(model.matrix(Price ~ Type + Horsepower, cars))  Same as model matrix with just Type. Since Horsepower is continuous, it doesn't get dummified to more features. What if we went back to the Type regression, left out the intercept, dummified and added the intercept back in? {r} Xmm = model.matrix(Price ~ 0 + Type, cars) Xmm = cbind(1, Xmm) head(Xmm,20)  Are the columns linearly independent? No ... so when we try to get the hat matrix, {r} Xmm %*% solve(t(Xmm) %*% Xmm) %*% t(Xmm)  You can't invert a non-invertible matrix!! What does R do when using the linear model function: {r} coef(lm(cars$Price ~ 0 + Xmm))  SOMEHOW: it doesn't complain since it handles the non-invertibility (we don't know why...) but we do see that it's busted. Look at the coefficients! One is missing! What is it doing?? It's just arbitrarily dropping one (just like recommended). ## Multivariate linear regression with the Hat Matrix First let's do the null model to examine what the null hat matrix looks like. In this exercise, we will see that$g_0 = \bar{y}$is really the OLS solution in the case of no features, only an intercept i.e.$b_0 = \bar{y}$. {r} y = MASS::Boston$medv mean(y)  We can use lm. {r} mod = lm(y ~ 1) coef(mod) mod$fitted.values  Let's do a simple example of projection. Let's project$y$onto the intercept column, the column of all 1's. What do you think will happen? {r} ones = rep(1, length(y)) H = ones %*% t(ones) / sum(ones^2) H[1 : 5, 1 : 5] #in fact unique(c(H)) 1 / 506  The whole matrix is just one single value for each element! What is this value? It's 1 / 506 where 506 is$n$. So what's going to happen? {r} y_proj_ones = H %*% y head(y_proj_ones)  Projection onto the space of all ones makes the null model ($g = \bar{y}$). It's the same as the model of response = intercept + error, i.e.$y = \mu_y + \epsilon$. The OLS intercept estimate is clearly$\bar{y}$. {r} y = MASS::Boston[, 14] X = as.matrix(cbind(1, MASS::Boston[, 1 : 13])) Xt = t(X) XtXinv = solve(Xt %*% X) b = XtXinv %*% t(X) %*% y b  We can compute all predictions: {r} H = X %*% XtXinv %*% t(X) dim(h) yhat = H %*% y head(yhat)  Can you tell this is projected onto a 13 dimensionsal space from a 506 dimensional space? Not really... but it is... Now let's project over and over... {r} head(H %*% H %*% H %*% H %*% H %*% H %*% H %*% H %*% H %*% y)  Same thing! Once you project, you're there. That's the idempotency of$H$. Let's make sure that it really does represent the column space of$X$. Let's try to project different columns of$X$: {r} head(X[, 1, drop = FALSE]) head(H %*% X[, 1, drop = FALSE]) head(X[, 2, drop = FALSE]) head(H %*% X[, 2, drop = FALSE]) head(X[, 3, drop = FALSE]) head(H %*% X[, 3, drop = FALSE]) #why?? Numerical error... head(X[, 1, drop = FALSE] * 3 + X[, 2, drop = FALSE] * 17) head(H %*% (X[, 1, drop = FALSE] * 3 + X[, 2, drop = FALSE] * 17)) #etc....  We can calculate the residuals: {r} e = y - yhat head(e) I = diag(nrow(X)) e = (I - H) %*% y head(e)  Let's do that projection over and over onto the complement of the column space of$X$: {r} head((I - H) %*% (I - H) %*% (I - H) %*% (I - H) %*% (I - H) %*% (I - H) %*% y)  # QR Decomposition Let's go back to the Boston data and regenerate all our quantities: {r} y = MASS::Boston$medv ybar = mean(y) SST = sum((y - ybar)^2) SST X = as.matrix(cbind(1, MASS::Boston[, 1 : 13])) n = nrow(X) p_plus_one = ncol(X) Xt = t(X) XtXinv = solve(Xt %*% X) b = XtXinv %*% Xt %*% y b yhat = X %*% b head(yhat) # e = y - yhat # SSE = sum(e^2) SSR = sum((yhat - ybar)^2) SSR Rsq = SSR / SST Rsq  Now let's do the QR decomposition and see if the projections work. {r} nrow(X) p_plus_one qrX = qr(X) class(qrX) Q = qr.Q(qrX) R = qr.R(qrX) dim(Q) dim(R) Matrix::rankMatrix(Q) Matrix::rankMatrix(R) head(Q[, 1], 50) head(Q[, 2], 50) 1 / sqrt(nrow(X)) sum(Q[, 1]^2) #normalized? sum(Q[, 2]^2) #normalized? Q[, 1] %*% Q[, 2] #orthogonal? Q[, 7] %*% Q[, 13] #orthogonal? Qt = t(Q) yhat_via_Q = Q %*% Qt %*% y head(yhat) head(yhat_via_Q) testthat::expect_equal(c(yhat), c(yhat_via_Q)) #needed to vectorize to make dimensions equal  Can we get the $b$ vector from the $Q$ matrix? {r} solve(R) %*% Qt %*% y b_Q = Qt %*% y b_Q head(Q %*% b_Q) head(X %*% b)  Nope - this is not the same! Why not? Let's go back to the Boston data and regenerate all our quantities: {r} y = MASS::Boston$medv ybar = mean(y) SST = sum((y - ybar)^2) SST X = as.matrix(cbind(1, MASS::Boston[, 1 : 13])) n = nrow(X) p_plus_one = ncol(X) Xt = t(X) XtXinv = solve(Xt %*% X) b = XtXinv %*% Xt %*% y b yhat = X %*% b head(yhat) # e = y - yhat # SSE = sum(e^2) SSR = sum((yhat - ybar)^2) SSR Rsq = SSR / SST Rsq  Now let's do the QR decomposition and see if the projections work. {r} nrow(X) p_plus_one qrX = qr(X) class(qrX) Q = qr.Q(qrX) R = qr.R(qrX) dim(Q) dim(R) Matrix::rankMatrix(Q) Matrix::rankMatrix(R) head(Q[, 1], 50) head(Q[, 2], 50) 1 / sqrt(nrow(X)) sum(Q[, 1]^2) #normalized? sum(Q[, 2]^2) #normalized? Q[, 1] %*% Q[, 2] #orthogonal? Q[, 7] %*% Q[, 13] #orthogonal? Qt = t(Q) yhat_via_Q = Q %*% Qt %*% y head(yhat) head(yhat_via_Q) testthat::expect_equal(c(yhat), c(yhat_via_Q)) #needed to vectorize to make dimensions equal  Can we get the$b$vector from the$Q$matrix? {r} solve(R) %*% Qt %*% y b_Q = Qt %*% y b_Q head(Q %*% b_Q) head(X %*% b)  Nope - this is not the same! Why not? Each dimension gives one piece of SSR and thus one piece of R^2 i.e. SSR = SSR_1 + ... + SSR_p and R^2 = R^2_1 + ... + R^2_p Our definition of SSR removed the ybar i.e. the contribution of the intercept. So we will do so here. That is the first column of$Q$. Now we add up all the features besides the intercept {r} partial_SSRs = array(NA, p_plus_one) for (j in 2 : p_plus_one){ qj = Q[, j, drop = FALSE] yhat_j = qj %*% t(qj) %*% y #the projection onto the jth dimension of Q partial_SSRs[j] = sum(yhat_j^2) } round(partial_SSRs) SSR sum(partial_SSRs, na.rm = TRUE) SST partial_Rsqs = partial_SSRs / SST round(partial_Rsqs, 2) sum(partial_Rsqs, na.rm = TRUE)  Some dimensions in this subspace matter more than others. We can do approximately the same regression with less than p features. Let's try this: {r} partial_Rsqs_sorted = sort(partial_Rsqs, decreasing = TRUE) partial_Rsqs_sorted_cumul = cumsum(partial_Rsqs_sorted) partial_Rsqs_sorted_cumul #sort Q by Rsq Qsorted = Q[, order(partial_Rsqs, na.last = FALSE, decreasing = TRUE)] #let's take the first 8 Qreduced = Qsorted[, 1 : 8] mod = lm(y ~ Qreduced) summary(mod)$r.squared  Why was the first column of Qsorted dropped? ## Bonus Material: Eigendecomposition in R and the Projection Matrix {r} B = array(seq(1, 4), dim = c(2, 2)) B eigen_decomp = eigen(B) V = eigen_decomp$vectors V v1 = V[, 1, drop = FALSE] v2 = V[, 2, drop = FALSE] lambdas = eigen_decomp$values lambdas lambda1 = lambdas[1] lambda2 = lambdas[2] B %*% v1 lambda1 * v1 B %*% v2 lambda2 * v2 B %*% v2 == lambda2 * v2 #why not? #diagonolization V %*% diag(lambdas) %*% solve(V)  Let's take a look at a projection matrix: {r} X = model.matrix(medv ~ ., MASS::Boston) head(X) H = X %*% solve(t(X) %*% X) %*% t(X) H[1:5,1:5] dim(H) eigen_decomp = eigen(H) lambdas = eigen_decomp$values head(lambdas, 50) lambdas = as.numeric(eigen_decomp$values) #coerce to real numbers (good enough approximation) head(lambdas, 50) lambdas_rounded = round(lambdas, 10) head(lambdas_rounded, 50) table(lambdas_rounded) sum(lambdas_rounded) #i.e. the rank(H) = # cols in X  It turns out all eigenvalues of a projection matrix are either 0 or 1. Why is this? It makes sense since: H 1-vec = 1-vec H x_{dot 1} = x_{dot 1} H x_{dot 2} = x_{dot 2} . . . H x_{dot p} = x_{dot p} All these p+1 eigenvalues of H are 1 and there are p+1 of them. Further, {r} sum(diag(H))  The trace (sum of the diagonal entries) is also = p+1 i.e. the rank[X]. This is true since the trace of a matrix is always equal to the sum of the eigenvalues since tr(H) = tr(V^-1 D V) = tr(V^-1 V D) = tr(D) = sum lambda_i which in this case is rank[X]. What are the eigenvectors? Only the first ncol(X) = 14 reconstruct the space. Are they the same as the X columns? {r} V = eigen_decomp$vectors V = matrix(as.numeric(V), ncol = ncol(V)) dim(V) V[1 : 5, 1 : ncol(X)] V_X_angles = matrix(NA, ncol(X), ncol(X)) for (i in 1 : ncol(X)){ for (j in 1 : ncol(X)){ if (j > i){ V_X_angles[i, j] = acos(V[, i] %*% X[, j] / sqrt(sum(V[, i]^2) * sum(X[, j]^2))) * 180 / pi } } } V_X_angles  No... and they don't need to be. They just need to represent the same space i.e. their columns need to span the same 14-dimensional subspace. You can check this by just comparing the orthogonal projection matrix which is unique for each subspace. {r} V_X = V[, 1 : ncol(X)] H_V = V_X %*% solve(t(V_X) %*% V_X) %*% t(V_X) dim(H_V) testthat::expect_equal(c(H_V), c(H))  And they're the same as expected. # Correlation zero means orthogonality Let's generate some fake data. In this example we'll have one predictor which will be orthogonal to the centered response. We enforce the response to be centered by adding a column of 1's: {r} n = 100; p = 2 Q = qr.Q(qr(cbind(1, matrix(rnorm(n * p), nrow = n)))) y = Q[, p + 1] x = Q[, 2]  Let's make sure it's orthogonal: {r} x %*% y  If they're orthogonal and y is mean-centered, what is the correlation? {r} cor(x, y)  If the correlation is 0, what is$b_1$, the slope? It has to be zero. Thus$b_0$has to be$bar{x}$. Since x was also orthogonalized to the vector of 1's, it's centered and hence has average = 0. So both intercept and slope are 0: What is$b$? {r} mod = lm(y ~ x) coef(mod)  What is$R^2$? Since$x$and$y$are orthogonal... a projection onto the colspace of$X$gets annhilated. {r} summary(mod)$r.squared  # The Source of Overfitting: random correlations are *non-zero*! {r} set.seed(1984) n = 100 x = rnorm(n) x = x - mean(x) y = rnorm(n) y = y - mean(y) x y  In this setup, $x$ and $y$ are centered Gaussian random vectors. Are they orthogonal? {r} x %*% y theta_in_rad = acos(x %*% y / sqrt(sum(x^2) * sum(y^2))) theta_in_rad theta_in_rad * 180 / pi abs(90 - theta_in_rad * 180 / pi)  Nope... what about correlated? {r} cor(x, y) cor(x, y)^2  They *nearly* uncorrelated but they still have some correlation. How is this possible? There is "random chance"" AKA "chance capitalization"! What about the best fitting line? {r} mod = lm(y ~ x) coef(mod)  Slope is about 0.09 which is small but non-zero. What is $R^2$? Since $x$ and $y$ are nearly orthogonal... a projection onto the colspace of $X$ gets nearly annhilated. {r} summary(mod)$r.squared  but not entirely. Lesson learned: random noise can be correlated with the response$y$and give you the illusion of fit! # The monotonicity of SSR (or$R^2$) with more features As p increases,$R^2$goes up. Here's a nice exercise: {r} n = 200 y = rnorm(n) Rsqs = array(NA, n) #we know that Rsq = 0 for the null model (i.e. just regressing on the intercept) Rsqs[1] = 0 #create a matrix with the correct number of rows but no columns X = matrix(NA, nrow = n, ncol = 0) X = cbind(1, X) #for every new p, tack on a new random continuos predictor: for (p_plus_one in 2 : n){ X = cbind(X, rnorm(n)) Rsqs[p_plus_one] = summary(lm(y ~ X))$r.squared } all(diff(Rsqs) > 0) #additional Rsq per new feature is positive mean(diff(Rsqs)) #added Rsq per new feature mean(diff(Rsqs)) * (n - 1) #check to ensure it's 100%  Now let's plot it and see what happens: {r} pacman::p_load(ggplot2) base = ggplot(data.frame(p_plus_one = 1 : n, Rsq = Rsqs)) base + geom_line(aes(x = p_plus_one, y = Rsq))  With each additional predictor, what happens to $R^2$? {r} pacman::p_load(latex2exp) base + geom_line(aes(x = p_plus_one, y = c(0, diff(Rsq)))) + xlab("p + 1") + ylab(TeX("$\\Delta R^2$")) + geom_hline(yintercept = mean(diff(Rsqs)), col = "blue")  How can this possibly be?? The $x$'s are not related to $y$ whatsoever!! Chance capitalization prevails. Each additional predictor picks up another dimension to add to the column space of $X$. Eventually, the projection explains *all* the variance. If $n = p + 1$, that matrix is square and of full rank, hence $\hat{y} = y$ and all residuals $e = 0$ since it is merely solving $n$ linearly independent equations. So here's an idea. To get a perfect fit, just augment your design matrix with $n - (p + 1)$ random vectors and you get $R^2 = 100\%$!! There must be something wrong with this!! Even if $p$ is large and $5, ][1, ] xstar$price predict(mod, xstar) exp(predict(log_linear_mod, xstar))  That's a pretty bad residual! How about log-log model? {r} log_log_linear_mod = lm(log(price) ~ log(carat), diamonds) b = coef(log_log_linear_mod) b  Let's see what it looks like: {r} ggplot(diamonds, aes(x = log(carat), y = log(price))) + geom_point() + geom_abline(intercept = b[1], slope = b[2], col = "green")  Well look at that! That's a nice looking model. (Note that the slope coefficients in log-log models, i.e. b_2 here, are called "elasticity" in Econ 382 as it measures how the relative change in x affects the relative change in y). How are our metrics? {r} summary(log_log_linear_mod)$r.squared summary(log_log_linear_mod)$sigma  Let's see apples-to-apples to the natural y model. {r} log_y_hat = log_log_linear_mod$fitted.values y_hat = exp(log_y_hat) e = diamonds$price - y_hat SSE = sum(e^2) Rsq = 1 - sum(e^2) / SST Rsq RMSE = sqrt(SSE / (nrow(diamonds) - 2)) RMSE  This is on-par with the vanilla OLS model, but still doesn't "beat it". There was no guarantee that we would be "beat it" even though we used procedures that are reasonable and popular! My belief is that we really should be wary of the maximum price and employ that in the model. Maybe we'll do this in a lab exercise? Let's repeat this entire exercise using the length of the diamond. The length of the diamond feature is confusingly named "x" in the dataset. It is an "x" but it's also the diamond's "x"!!! {r} ggplot(diamonds, aes(x = x, y = price)) + geom_point()  Besides the non-linear relationship, what else do you see? Mistakes in the dataset! Can a real diamond have zero length?? Yes. This is the real world. There are mistakes all the time. Let's kill it! How many are we dealing with here? {r} nrow(diamonds[diamonds$x == 0, ])  {r} diamonds = diamonds[diamonds$x != 0, ]  What's the deal with the x variable now? {r} skimr::skim(diamonds$x)  How good does a best guess linear relationship do? {r} mod = lm(price ~ x, diamonds) b = coef(mod) b summary(mod)$r.squared summary(mod)$sigma  Let's see the best fit line$g(x)$visually: {r} ggplot(diamonds, aes(x = x, y = price)) + geom_point() + geom_abline(intercept = b[1], slope = b[2], col = "green")  Again we got some bad extrapolation going on which we can't fix using a purely linear modeling strategy. Let's log-linearize it and see how we do. {r} log_linear_mod = lm(log(price) ~ x, diamonds) b = coef(log_linear_mod) ggplot(diamonds, aes(x = x, y = log(price))) + geom_point() + geom_abline(intercept = b[1], slope = b[2], col = "green")  How did we do? Ensure it's apples-apples. {r} log_y_hat = log_linear_mod$fitted.values y_hat = exp(log_y_hat) e = diamonds$price - y_hat SSE = sum(e^2) SST = sum((diamonds$price - mean(diamonds$price))^2) Rsq = 1 - sum(e^2) / SST Rsq RMSE = sqrt(SSE / (nrow(diamonds) - 2)) RMSE  Still not better. Log-log? {r} log_log_linear_mod = lm(log(price) ~ log(x), diamonds) b = coef(log_log_linear_mod) ggplot(diamonds, aes(x = log(x), y = log(price))) + geom_point() + geom_abline(intercept = b[1], slope = b[2], col = "green")  How did we do? {r} log_y_hat = log_log_linear_mod$fitted.values y_hat = exp(log_y_hat) e = diamonds$price - y_hat SSE = sum(e^2) SST = sum((diamonds$price - mean(diamonds$price))^2) Rsq = 1 - sum(e^2) / SST Rsq RMSE = sqrt(SSE / (nrow(diamonds) - 2)) RMSE  We did it. We found a log transformation that seems to give higher predictive power than the vanilla linear model on the raw repsonse and raw feature. This brings up the whole idea of "model selection". We went hunting for models until we found one that's better. We will hopefully do model selection today... Transforming y is a big decision as it changes the response metric! The rule of thumb is it is easier to model a response metric that has less extreme values (especially when using linear models) as the extreme values have a big impact on slope coefficients and can distort the best fit line due to the least squares minimization (hence the popularity of logging the response). Let's see if we get anywhere with this using all the features in this model. {r} lm_y = lm(price ~ ., diamonds) lm_ln_y = lm(log(price) ~ ., diamonds) summary(lm_y)$r.squared summary(lm_y)$sigma #now for the log-linea model y_hat = exp(lm_ln_y$fitted.values) e = diamonds$price - y_hat SSE = sum(e^2) SST = sum((diamonds$price - mean(diamonds$price))^2) Rsq = 1 - sum(e^2) / SST Rsq RMSE = sqrt(SSE / (nrow(diamonds) - 2)) RMSE  This is pretty convincing evidence that this transformation does a better job (at least in our linear modeling context). Let's look at one prediction: {r} predict(lm_y, diamonds[12345, ]) exp(predict(lm_ln_y, diamonds[12345, ])) diamonds$price[12345]  Again, we should be careful when you use $g$ after logging, you will have to exponentiate the result (middle line above). Small point: this exponentiation is known to create bias because $E[Y]$ is different from $exp(E[ln(y)])$ (for those who took 368 - remember Jensen's inequality?) For the purposes of this class, this can be ignored since we are evaluating g on its own merits and we're doing so honestly. If you like this stuff, there are a whole bunch of transformations out there that are even cooler than the natural log. # Linear Models with Feature Interactions Let's go back to modeling price with weight. Let us add a third variable to this plot, color, a metric about the "yellowness" of the diamond. This is an ordinal categorical variable ranging from D (most clear i.e. best) to J (most yellow in this dataset i.e. worst). {r} pacman::p_load(ggplot2) base = ggplot(diamonds, aes(x = carat, y = price)) base + geom_point(aes(col = color)) + scale_color_brewer(type = "div")  We can split the data on color to see it more clearly: {r} base + geom_point() + facet_wrap(~ color, ncol = 3) + aes(color = color) + scale_color_brewer(type = "div")  What do we see here? It looks like the slope of the price vs. carat linear model is slightly affected by color. For instance, the "D" color diamonds' price increases much faster as weight increases than the "E" color diamonds' price increases in weight, etc. Why do you think this is? We can picture two of these linear models below by fitting two submodels, one for D and one for J: {r} mod_D = lm(price ~ carat, diamonds[diamonds$color == "D", ]) b_D = coef(mod_D) mod_J = lm(price ~ carat, diamonds[diamonds$color == "J", ]) b_J = coef(mod_J) b_D b_J  Let's see it on the plot: {r} base + geom_point(aes(col = color)) + scale_color_brewer(type = "div") + geom_abline(intercept = b_D[1], slope = b_D[2], col = "blue", lwd = 2) + geom_abline(intercept = b_J[1], slope = b_J[2], col = "red", lwd = 2)  This indicates a separate intercept and carat-slope for each color. How is this done? Interacting carat and slope. The formula notation has the * operator for this. It is multiplication in formula land after all! {r} mod = lm(price ~ color, diamonds) coef(mod) mod = lm(price ~ carat * color, diamonds) coef(mod) #beware: strange naming convention on the interaction terms for an ordered factor diamonds$color = factor(diamonds$color, ordered = FALSE) mod = lm(price ~ carat * color, diamonds) coef(mod) #much better...  The reference category is color D. This means every other color should start lower and have a lower slope. This is about what we see above. How much of a better model is this than a straight linear model? {r} mod_vanilla = lm(price ~ carat + color, diamonds) summary(mod_vanilla)$r.squared summary(mod_vanilla)$sigma summary(mod)$r.squared summary(mod)$sigma  You can get more predictive accuracy out of this. We added a degree of freedom? Is this gain real? Yes. With one more feature and $n = 54,000$ there is no chance this gain came from overfitting noise. Add 10,000 garbage features, yes, there will be overgitting. Let's take a look at carat with another variable, depth, a continuous predictor. High depth indicates diamonds are skinny and tall; low depth indicates diamonds are flat like a pancake. {r} ggplot(diamonds, aes(x = carat, y = price)) + geom_point(aes(col = depth), lwd = 1, alpha = 0.5) + scale_colour_gradientn(colours = rainbow(5))  It seems people like flatter diamonds and are willing to pay more per carat. Let's see this in the regression: {r} mod = lm(price ~ carat + depth, diamonds) coef(mod) summary(mod)$r.squared summary(mod)$sigma mod = lm(price ~ carat * depth, diamonds) coef(mod) summary(mod)$r.squared summary(mod)$sigma  If carat increases by one unit, how much does price increase by? A tiny amount of increase. How about cut? {r} ggplot(diamonds, aes(x = carat, y = price)) + geom_point(aes(col = cut), lwd = 0.5) + scale_color_brewer(type = "div")  Likely something here. {r} mod = lm(price ~ carat, diamonds) coef(mod) summary(mod)$r.squared summary(mod)$sigma mod = lm(price ~ carat + cut, diamonds) summary(mod)$r.squared summary(mod)$sigma mod = lm(price ~ carat * cut, diamonds) coef(mod) summary(mod)$r.squared summary(mod)$sigma  Yes. Can we include all these interactions? {r} mod = lm(price ~ carat + color + depth + cut, diamonds) summary(mod)$r.squared summary(mod)$sigma mod = lm(price ~ carat * (color + depth + cut), diamonds) summary(mod)$r.squared summary(mod)$sigma coef(mod)  A decent gain once again. What does the design matrix look like there? What is $p$? {r} diamonds$cut = factor(diamonds$cut, ordered = FALSE) Xmm = model.matrix(price ~ carat * (color + depth + cut), diamonds) head(Xmm)  Can we take a look at interactions of two categorical variables? {r} plot1 = ggplot(diamonds, aes(x = cut, y = color)) + geom_jitter(aes(col = price), lwd = 0.5) + scale_colour_gradientn(colours = rainbow(5)) plot1  Cool animation possible. May not work because it needs a ton of packages... {r} pacman:::p_load_gh("dgrtwo/gganimate") plot1 + transition_time(price)  Not so clear what's going on here. Let's see what the regressions say: {r} mod = lm(price ~ color + cut, diamonds) summary(mod)$r.squared summary(mod)$sigma mod = lm(price ~ color * cut, diamonds) coef(mod) summary(mod)$r.squared summary(mod)$sigma  Not too much gain. # More advanced modeling {r} rm(list = ls()) pacman::p_load(ggplot2) diamonds$cut = factor(diamonds$cut, ordered = FALSE) diamonds$color = factor(diamonds$color, ordered = FALSE) diamonds$clarity = factor(diamonds$clarity, ordered = FALSE) #drop some obviously nonsense observations diamonds = diamonds[diamonds$carat <= 2 & diamonds$x != 0 & diamonds$y != 0 & diamonds$z != 0 & diamonds$depth != 0 & diamonds$table != 0,] diamonds$ln_price = log(diamonds$price) diamonds$ln_carat = log(diamonds$carat) diamonds$ln_x = log(diamonds$x) diamonds$ln_y = log(diamonds$y) diamonds$ln_z = log(diamonds$z) diamonds$ln_depth = log(diamonds$depth) diamonds$ln_table = log(diamonds$table) n = nrow(diamonds) set.seed(1984) diamonds = diamonds[sample(1 : n), ]  Note: I will now model price, not ln_price as ln_price yields residuals that are orders of magnitude. Fitting a good ln_price model will take more time. {r} #All model formulas for reuse later model_formulas = list( A = price ~ ln_carat, B = price ~ ln_carat * clarity, C = price ~ ln_carat * (clarity + cut + color), D = price ~ (ln_carat + ln_x + ln_y + ln_z + ln_depth + ln_table) * (clarity + cut + color), E = price ~ (ln_carat + ln_x + ln_y + ln_z + ln_depth + ln_table + carat + x + y + z + depth + table) * (clarity + cut + color) ) #Model A mod = lm(model_formulas[["A"]], diamonds) summary(mod)$sigma mod$rank #i.e. degrees of freedom = # vectors in colsp[X] to project onto #Model B mod = lm(model_formulas[["B"]], diamonds) summary(mod)$sigma mod$rank #i.e. degrees of freedom = # vectors in colsp[X] to project onto #Model C mod = lm(model_formulas[["C"]], diamonds) summary(mod)$sigma mod$rank #i.e. degrees of freedom = # vectors in colsp[X] to project onto #Model D mod = lm(model_formulas[["D"]], diamonds) summary(mod)$sigma mod$rank #i.e. degrees of freedom = # vectors in colsp[X] to project onto #Model E mod = lm(model_formulas[["E"]], diamonds) summary(mod)$sigma mod$rank #i.e. degrees of freedom = # vectors in colsp[X] to project onto  Big win on E... the reason I think is that now we have a flexible curve for each continuous feature and that curve changes with all the categorical features. Create model (F) which is the same as before except also include also third degree polynomials of the continuous features interacted with the categorical features and gauge performance against (E). By this time you're getting good with R's formula syntax! {r} #Model F model_formulas[["F"]] = price ~ (ln_carat + ln_x + ln_y + ln_z + ln_depth + ln_table + poly(carat, 3) + poly(x, 3) + poly(y, 3) + poly(z, 3) + poly(depth, 3) + poly(table, 3)) * (cut + color + clarity) mod = lm(model_formulas[["F"]], diamonds) summary(mod)$sigma mod$rank  Can you think of any other way to expand the candidate set curlyH? Discuss. We now have a very flexible curve for each continuous feature and that curve is fit individually for each level of all categories. We could indeed go further by allowing the levels among cut, color and clarity to interact and further allowing the continous features to interact. There is another strategy to increase the complexity of H without using function transforms of the features and interactions... we will have to see after winter break... We should probably assess oos performance now. Sample 4,000 diamonds and use these to create a training set of 3,600 random diamonds and a test set of 400 random diamonds. Define K and do this splitting: {r} K = 10 n_sub = 4000 set.seed(1984) n_test = 1 / K * n_sub n_train = n_sub - n_test test_indicies = sample(1 : n, n_test) train_indicies = sample(setdiff(1 : n, test_indicies), n_train) all_other_indicies = setdiff(1 : n, c(test_indicies, train_indicies))  Compute in and out of sample performance for models A-F. Use s_e as the metric (standard error of the residuals). Create a list with keys A, B, ..., F to store these metrics. Remember the performances here will be worse than before since before you're using nearly 52,000 diamonds to build a model and now it's only 3600. {r} oos_se = list() all_models_train = list() for (model_idx in LETTERS[1 : 6]){ all_models_train[[model_idx]] = lm(model_formulas[[model_idx]], diamonds[train_indicies, ]) summary(all_models_train[[model_idx]])$sigma oos_se[[model_idx]] = sd(diamonds$price[test_indicies] - predict(all_models_train[[model_idx]], diamonds[test_indicies, ])) } oos_se  You computed oos metrics only on n_* = 400 diamonds. What problem(s) do you expect in these oos metrics? They are variable. And something is wrong with F! Possibly Runge's phenomenon? To do the K-fold cross validation we need to get the splits right and crossing is hard. I've developed code for this already. Run this code. {r} set.seed(1984) temp = rnorm(n_sub) folds_vec = cut(temp, breaks = quantile(temp, seq(0, 1, length.out = K + 1)), include.lowest = TRUE, labels = FALSE) head(folds_vec, 100)  Comment on what it does and how to use it to do a K-fold CV: This code tells you which index is inside of which fold's test set. Do the K-fold cross validation for all models and compute the overall s_e and s_s_e. {r} oos_se = list() oos_s_se = list() for (model_idx in LETTERS[1 : 6]){ e_vec_k = list() #for each one for (k in 1 : K){ test_indicies_k = which(folds_vec == k) train_indicies_k = which(folds_vec != k) mod = lm(model_formulas[[model_idx]], diamonds[train_indicies_k, ]) e_vec_k[[k]] = sd(diamonds$price[test_indicies_k] - predict(mod, diamonds[test_indicies_k, ])) } oos_se[[model_idx]] = mean(unlist(e_vec_k)) #note: not exactly the overall sd, but close enough oos_s_se[[model_idx]] = sd(unlist(e_vec_k)) } res = rbind(unlist(oos_se), unlist(oos_s_se)) rownames(res) = c("avg", "sd") res  Does K-fold CV help reduce variance in the oos s_e? Discuss. Yes because it is an average so its standard error is approximately s_e / sqrt(K). It's only approximate since this is the formula for the std err or independent samples and the K oos samples are not exactly independent. Model F seems more varialbe than all of them. Why? Possibly Runge's phenomenon? Imagine using the entire rest of the dataset besides the 4,000 training observations divvied up into slices of 400. Measure the oos error for each slice and also plot it. {r} n_step = 1 / K * n_sub oos_se = list() ses = list() starting_ks = seq(from = 1, to = (length(all_other_indicies) - n_step), by = n_step) for (model_idx in LETTERS[1 : 6]){ se_k = list() #for each one for (k in 1 : length(starting_ks)){ diamonds_k = diamonds[all_other_indicies[starting_ks[k] : (starting_ks[k] + n_step - 1)], ] se_k[[k]] = sd(diamonds_k$price - predict(all_models_train[[model_idx]], diamonds_k)) } ses[[model_idx]] = se_k oos_se[[model_idx]] = unlist(se_k) } pacman::p_load(reshape2) ggplot(reshape2::melt(oos_se)) + geom_boxplot(aes(x = L1, y = value)) + xlab("model") ggplot(reshape2::melt(oos_se)) + geom_boxplot(aes(x = L1, y = value)) + xlab("model") + ylim(0, 5000)  What do we learn from this? Model F is very risky. What exactly is going wrong? Part of the agony of data science is the "debugging" just like when debugging software. But here the errors can be in the modeling too! {r} max(unlist(ses[["F"]])) k_BAD = which.max(unlist(ses[["F"]])) diamonds_BAD = diamonds[all_other_indicies[starting_ks[k_BAD] : (starting_ks[k_BAD] + n_step - 1)], ] diamonds_BAD e_BAD = diamonds_BAD$price - predict(all_models_train[[model_idx]], diamonds_BAD) tail(sort(abs(e_BAD))) diamonds_BAD[which.max(abs(e_BAD)), ] summary(diamonds[train_indicies, ])  Is it extrapolation? Yes... y = 58.9 and in a cubic function, that would be astronomical. The real mistake is that y = 58.9 is impossible. The data wasn't cleaned. We will have an exercise in that. But this happens all the time and you don't want to be using polynomial functions for this reason since the new data may extrapolate very poorly. Luckily we won't be dealing with these much longer since there are less risky ways of creating a transformed basis of Xraw. ## The K tradeoff K determines how large the training set is relative to the test set when you're doing honest validation for an algorithm. For now, let's not use K-fold CV, but only examine one split at a time. Consider this simulated dataset with 50 observations: {r} n = 50 xmin = 0 xmax = 4 sigma_e_ignorance = 0.8 # set.seed(1) set.seed(1984) x = runif(n, xmin, xmax) y = 2 + 3 * x^2 + rnorm(n, 0, sigma_e_ignorance) #f(x) + delta Xy = data.frame(x = x, y = y) pacman::p_load(ggplot2) data_plot = ggplot(Xy) + aes(x = x, y = y) + geom_point() data_plot  Note how$f(x)$is quadratic and there is random noise which is "ignorance error". The random noise will be part of generalization error and can never go away. If we use OLS with no derived features, then we can at most get$h*(x)$. Let's see what$h^*(x) = \beta_0 + \beta_1 x$truly is. To do this, we imagine we see an absolute ton of data and run OLS on it. {r} n_hidden = 1e6 x_hidden = seq(from = xmin, to = xmax, length.out = n_hidden) y_hidden = 2 + 3 * x_hidden^2 + rnorm(n_hidden, 0, sigma_e_ignorance) h_star_mod = lm(y_hidden ~ x_hidden) coef(h_star_mod)  The fact that$\beta = [-6~12]^\top$can actually be solved with calculus:$\int_0^4 ((2 + 3x^2) - (b_0 + b_1 x))^2 dx$and solve for$b0$and$b1$explicitly by minimizing. Plotting that over$\mathbb{D}$we obtain {r} data_plot + geom_abline(intercept = coef(h_star_mod)[1], slope = coef(h_star_mod)[2], color = "green")  That is the best we're going to get. However,$g_{final}$falls far short of it due to estimation error since n is only 50: {r} g_final_mod = lm(y ~ x) coef(g_final_mod) data_plot + geom_abline(intercept = coef(h_star_mod)[1], slope = coef(h_star_mod)[2], color = "green") + geom_abline(intercept = coef(g_final_mod)[1], slope = coef(g_final_mod)[2], color = "red")  The actual error of g_final can be estimated by imagining tons of future observations: {r} y_hat_g_final = predict(g_final_mod, data.frame(x = x_hidden)) gen_error_true = sd(y_hidden - y_hat_g_final) gen_error_true  The model$g$can vary quite a bit as we subsample$\mathbb{D}$which is what happens when you do train-test splits. It varies a lot because there is large misspecification error. If the model was correctly specified, the results of everything that follows will be less impressive. But in the real world - is your model ever correctly specified? is$f \in \mathcal{H}$?? NO. So this is more realistic. Now let's let K be small. Let K = 2 meaning even 50-50 split of test and train. {r} K = 2 prop_train = (K - 1) / K n_train = round(prop_train * n) set.seed(123) index_train = sample(1 : n, n_train, replace = FALSE) index_test = setdiff(1 : n, index_train) pacman::p_load(testthat) expect_equal(sort(c(index_test, index_train)), 1:n) x_train = x[index_train] y_train = y[index_train] Xytrain = data.frame(x = x_train, y = y_train) x_test = x[index_test] y_test = y[index_test] g_mod = lm(y ~ ., Xytrain) y_hat_g = predict(g_mod, data.frame(x = x_test)) g_s_e_K_2 = sd(y_test - y_hat_g) g_s_e_K_2 gen_error_true  Although I cooked the books by setting the seed, this realization makes sense. If K=2, I build the model g with half the data than the model g_final. Less data to train on => higher generalization error. How about if K is large. Let's say$K = n / 2$meaning n_train = 48 and n_test = 2. {r} K = n / 2 prop_train = (K - 1) / K n_train = round(prop_train * n) set.seed(123) index_train = sample(1 : n, n_train, replace = FALSE) index_test = setdiff(1 : n, index_train) pacman::p_load(testthat) expect_equal(sort(c(index_test, index_train)), 1:n) x_train = x[index_train] y_train = y[index_train] Dtrain = data.frame(x = x_train, y = y_train) x_test = x[index_test] y_test = y[index_test] g_mod = lm(y ~ ., Dtrain) y_hat_g = predict(g_mod, data.frame(x = x_test)) g_s_e_K_n_over_2 = sd(y_test - y_hat_g) g_s_e_K_2 g_s_e_K_n_over_2 gen_error_true  Although I cooked the books again by setting the seed, this also makes sense. More data to train on = less error but still more error than all the data. In reality, there is massive variance over specific splits! Let's run the simulation with these two K values many times. While we're at it, let's do all K's! Well, what are all the valid K's? If you want to keep the sizes the same, any factorization of n except the trivial 1 since n = 1 * n. A K = 1 would mean there's no split!!! How to find divisors? Of course a package for this. {r} pacman::p_load(numbers) setdiff(divisors(n), 1)  But should we also include the trivial n? Yes K = n is indeed a valid divisor. And this type of CV is called the "leave one out cross validation" (LOOCV). Now we compute the errors over K: {r} Nsim_per_K = 5000 Kuniqs = setdiff(divisors(n), 1) num_Kuniqs = length(Kuniqs) Ks = rep(Kuniqs, Nsim_per_K) results = data.frame(s_e = rep(NA, Nsim_per_K * num_Kuniqs), K = rep(NA, Nsim_per_K * num_Kuniqs)) for (i in 1 : length(Ks)){ K = Ks[i] prop_train = (K - 1) / K n_train = round(prop_train * n) index_train = sample(1 : n, n_train, replace = FALSE) index_test = setdiff(1 : n, index_train) expect_equal(sort(c(index_test, index_train)), 1:n) x_train = x[index_train] y_train = y[index_train] Xytrain = data.frame(x = x_train, y = y_train) x_test = x[index_test] y_test = y[index_test] g_mod = lm(y ~ ., Xytrain) y_hat_g = predict(g_mod, data.frame(x = x_test)) g_s_e = sum(abs(y_test - y_hat_g)) / length(y_test) results[i, ] = c(g_s_e, K) } save(results, file = "K_results.RData")  Let's take the average error over each simulated split and also its variability: {r} load("K_results.RData") #don't worry about the following code... we will learn dplyr later... pacman::p_load(dplyr) results_summary = results %>% group_by(K) %>% summarize(avg_s_e = mean(s_e), s_s_e = sd(s_e), s_avg_s_e = sd(s_e) / sqrt(Nsim_per_K)) results_summary  Now let's see what the distributions look like to visualize the means and variances. {r} sim_plot = ggplot(results) + aes(x = s_e) + geom_density(aes(fill = factor(K)), alpha = 0.3) + xlim(0, NA) + geom_vline(data = results_summary, aes(xintercept = avg_s_e, color = factor(K)), size = 2) sim_plot  The main takeaways are (1) the std err of generalization error estimate is much lower for low K than high K With high K, the test set is small meaning the estimate has high variance; with low K, the test set is large meaning you can measure it with low variance. (2) the average of generalization error estimate is lower for high K than low K With high K, the training set is large meaning$g$is closer to g_final and thus has higher expected accuracy; with low K, the training set is small meaning$g$is further from g_final and thus has lower expected accuracy. Thus, the tradeoff is bias vs. variance. There are many similar tradeoffs in statistics. We will see one later when we do machine learning. Is the estimates' accuracy for what we really care about? No... the generalization error of g_final which we picture below: {r} sim_plot + geom_vline(xintercept = gen_error_true, col = "black", size = 2)  Remember, g_final's error should be lower than both averages since it uses all the data. But we see above it's higher! So what happened? Simple... we are mixing apples and oranges. We calculated the real generalization error by looking at one million future observations. We calculated the red and blue distributions by looking at our data only which is a random realization of many such datasets! Thus, our generalization errors are biased based on the specific n observations in D we received. We will see that K-fold helps a bit with this. But there is nothing we can do about it beyond that (besides collect more observations). If you get a weird sample, you get a weird sample! How would we be able to generate the picture we really want to see? We would run this simulation over many datasets and average. That would be a giant simulation. To show that this is the case, go back and change the seed in the first chunk and rerun. You'll see a different white bar. What is the main takeaway? K matters because it induces a tradeoff. It shouldn't be too large or too small (as we believe at the moment). And, generalization error estimation is very variable in low n. To see this, go back and increase n. Let's do the demo again. This time, we generate all samples from the data generating process. {r} rm(list = ls()) pacman::p_load(numbers, testthat, ggplot2) n = 50 xmin = 0 xmax = 4 sigma_e_ignorance = 0.8 set.seed(1984) Ntest = 1e6 Kuniqs = setdiff(divisors(n), 1) Nsim = 5e6 results = data.frame(s_e = numeric(), K = character()) for (nsim in 1 : Nsim){ x = runif(n, xmin, xmax) y = 2 + 3 * x^2 + rnorm(n, 0, sigma_e_ignorance) #f(x) + delta #do all the oos validation for (i in 1 : length(Kuniqs)){ K = Kuniqs[i] prop_train = (K - 1) / K n_train = round(prop_train * n) index_train = sample(1 : n, n_train, replace = FALSE) index_test = setdiff(1 : n, index_train) expect_equal(sort(c(index_test, index_train)), 1:n) x_train = x[index_train] y_train = y[index_train] Xytrain = data.frame(x = x_train, y = y_train) x_test = x[index_test] y_test = y[index_test] g_mod = lm(y ~ ., Xytrain) y_hat_g = predict(g_mod, data.frame(x = x_test)) g_s_e = sum(abs(y_test - y_hat_g)) / length(y_test) results = rbind(results, data.frame(s_e = g_s_e, K = K)) } #now estimate generalization error on g_final g_final_mod = lm(y ~ x) x_star = seq(from = xmin, to = xmax, length.out = Ntest) y_star = 2 + 3 * x_star^2 + rnorm(Ntest, 0, sigma_e_ignorance) y_hat_g_final_star = predict(g_final_mod, data.frame(x = x_star)) gen_error_true = sum(abs(y_star - y_hat_g_final_star)) / Ntest results = rbind(results, data.frame(s_e = gen_error_true, K = "true")) save(results, file = "K_dgp_results.RData") }  Let's take the average error over each simulated split and also its variability: {r} load("K_dgp_results.RData") #don't worry about the following code... we will learn dplyr later... pacman::p_load(dplyr) results_summary = results %>% group_by(K) %>% summarize(avg_s_e = mean(s_e), s_s_e = sd(s_e), s_avg_s_e = sd(s_e) / sqrt(Nsim)) %>% arrange(as.numeric(K)) results_summary  Now let's see what the distributions look like to visualize the means and variances. {r} sim_plot = ggplot(results) + aes(x = s_e) + geom_density(aes(fill = factor(K)), alpha = 0.3) + xlim(0, NA) + geom_vline(data = results_summary, aes(xintercept = avg_s_e, color = factor(K)), size = 1) sim_plot  #K-fold CV variance reduction Let's run this same demo again except K-fold within the runs. {r} rm(list = ls()) pacman::p_load(numbers, testthat, ggplot2) n = 50 xmin = 0 xmax = 4 sigma_e_ignorance = 0.8 set.seed(1984) Ntest = 1e6 Kuniqs = setdiff(divisors(n), 1) Nsim = 3e6 #we will add to the previous experiment's results load("K_dgp_results.RData") for (nsim in 1 : Nsim){ x = runif(n, xmin, xmax) y = 2 + 3 * x^2 + rnorm(n, 0, sigma_e_ignorance) #f(x) + delta #do all the oos validation for (i in 1 : length(Kuniqs)){ K = Kuniqs[i] temp = rnorm(n) k_fold_idx = cut(temp, breaks = quantile(temp, seq(0, 1, length.out = K + 1)), include.lowest = TRUE, labels = FALSE) oos_residuals = array(NA, n) for (k in 1 : K){ index_test = which(k_fold_idx == k) index_train = setdiff(1 : n, index_test) x_train = x[index_train] y_train = y[index_train] Xytrain = data.frame(x = x_train, y = y_train) x_test = x[index_test] y_test = y[index_test] g_mod = lm(y ~ ., Xytrain) y_hat_g = predict(g_mod, data.frame(x = x_test)) oos_residuals[index_test] = y_test - y_hat_g } g_k_fold_cv_s_e = sum(abs(oos_residuals)) / n results = rbind(results, data.frame(s_e = g_k_fold_cv_s_e, K = paste0(K, "_cv"))) } #now estimate generalization error on g_final g_final_mod = lm(y ~ x) x_star = seq(from = xmin, to = xmax, length.out = Ntest) y_star = 2 + 3 * x_star^2 + rnorm(Ntest, 0, sigma_e_ignorance) y_hat_g_final_star = predict(g_final_mod, data.frame(x = x_star)) gen_error_true = sum(abs(y_star - y_hat_g_final_star)) / Ntest results = rbind(results, data.frame(s_e = gen_error_true, K = "true")) } save(results, file = "K_dgp_results_K_fold.RData")  Let's take the average error over each simulated split and also its variability: {r} load("K_dgp_results_K_fold.RData") #don't worry about the following code... we will learn dplyr later... pacman::p_load(dplyr) results_summary = results %>% group_by(K) %>% summarize(avg_s_e = mean(s_e), s_s_e = sd(s_e), s_avg_s_e = sd(s_e) / sqrt(Nsim)) %>% arrange(as.numeric(K)) results_summary  Now let's see what the distributions look like for K = 2 {r} sim_plot = ggplot(results) + aes(x = s_e) + geom_density(aes(fill = factor(K)), alpha = 0.3) + xlim(0, NA) + geom_vline(data = results_summary, aes(xintercept = avg_s_e, color = factor(K)), size = 1) sim_plot  ## Piping Take a look at this one-liner: {r} set.seed(1984) mean(head(round(sample(rnorm(1000), 100), digits = 2)))  This is hard to read. Of course we can make it easier by using breaklines e.g. {r} mean( head( round( sample( rnorm(1000), 100 ), digits = 2 ) ) )  But it doesn't make it much easier to read. And it probably makes it harder to write. Enter an idea taken from unix / linux. Output of one function is input to next function. It is the inverse of the usual "order of operations". Let's see how this works. We first load the piping library: {r} pacman::p_load(magrittr)  The package is named after Rene Magritte, the Belgian surrealist artist because he wrote [(Ceci n'est pas un pipe)](https://en.wikipedia.org/wiki/The_Treachery_of_Images) on a painting of a pipe. In pipe format this would look like: {r} set.seed(1984) rnorm(1000) %>% #the pipe operator sample(100) %>% round(digits = 2) %>% #the first argument is passed in automatically. head %>% mean  That's it! There's nothing more to it other than a gain in readability. Here's a cute joke based on this idea: https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/efficient-data-r/figures/basertidyverse.png What if we wanted to do something like mean(rnorm(1000) + 1)? This rnorm(1000) %>% +1 %>% mean doesn't work because I imagine because the basic arithmetic operators couldn't be parsed like normal while there was a pipe. So they invented special pipe functions for this: {r} rnorm(1000) %>% add(1) %>% mean  There are other exceptions to the rule too which you'll figure out if you adopt the pipe. Unfortunately... the world at large hasn't completely accepted this as a way to write R. So feel free to use for yourself. But be careful when using this style with others. There are places where everyone uses the pipes (we will see this when we get to dplyr). Also note the code you write with pipes will be slower than the normal syntax. # The Grammar of graphics and ggplot First load the package and the dataset of interest as a dataframe: {r} pacman::p_load(ggplot2, quantreg) cars = MASS::Cars93 #dataframe  ggplot is based on the "Grammar of Graphics", a concept invented by the Statistician / Computer Scientist Leland Wilkinson who worked on SPSS, Tableau and now he works at H20, software that analyzes big data. The reference of interest is [here](http://papers.rgrossman.com/proc-094.pdf). He drew on ideas from John Tukey (one of the great statistician of the previous generation) while he was at Bell Labs, Andreas Buja (one of my professors at Penn) and Jerome Friedman (the professor that taught my data mining course when I was in college at Stanford). It is a language that allows us to describe the components of a graphic. Previously, graphics were done in one shot and it was clunky. ggplot is a library written by Hadley Wickham based on this concept. Wickham is probably the most famous person in statistical computing today. He has commit rights in R and is one of the architects of RStudio. He calls grammar of graphics "graphical poems". Here are the basic components: * an underlying data frame * an "aesthetic" that maps visualization axes in the eventual plot(s) to variables in the data frame * a "layer" which is composed of - a geometric object - a statistical transformation - a position adjustment * a "scale" for each aesthetic * a "coordinate" system for each aesthetic * optional "facets" (more graphics) * optional "labels" for the title, axes title, points, etc. Don't worry - everything has "smart defaults" in Wickham's implementation so you don't have to worry about most things. We will explore some of the features below. Here's a good [cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf). ggplot is layered where each component is an object. The objects are added to each other since the "+" operator is overloaded to accept these additions. This is nice because each component can be saved and reused. The following initialized the graphics: {r} ggplot(cars)  Nothing happened - except the plot window went gray (the smart default). This is already rendering a graphic, but since it hasn't been given any of the required information, it has nothing to display. Next we create an aesthetics indicating a one-way plot (one variable only). {r} ggplot(cars) + aes(Price)  Notice how it can understand the variable name as an object name. Since we've given it an aesthetics object, it now knows which variable is the x axis (default). It already knows the ranges of the variable (a smart default) and a default scale and coordinate system (smart defaults). Usually this is done in one step by passing the aesthetics object into the ggplot: {r} ggplot(cars, aes(Price))  Now we need to pick a layer by specifying a geometry. This is a type of plot. Since the predictor type of price is continuous, let's pick the "histogram" using the geom_histogram function: {r} ggplot(cars, aes(Price)) + geom_histogram()  This can be customized: {r} ggplot(cars, aes(Price)) + geom_histogram(binwidth = 1, col = "darkred", fill = "blue", alpha = 0.4)  Want to save it for your latex? {r} ggsave("plot.png") system("open plot.png") ggsave("plot.pdf") system("open plot.pdf")  Here are some other options besides the histogram: {r} ggplot(cars, aes(Price)) + geom_dotplot() ggplot(cars, aes(Price)) + geom_area(stat = "bin", binwidth = 2) ggplot(cars, aes(Price)) + geom_freqpoly() ggplot(cars, aes(Price)) + geom_density(fill = "green", alpha = 0.1) summary(cars)  Can we compare price based on different conditions? Yes, we can subset the data and use color and alpha: {r} ggplot(cars, aes(Price)) + geom_density(data = subset(cars, Man.trans.avail == "Yes"), col = "grey", fill = "darkgreen", alpha = 0.4) + geom_density(data = subset(cars, Man.trans.avail == "No"), col = "grey", fill = "red", alpha = 0.4)  Sidebar: why are cars that have manual transmissions available cheaper? We can look at this also using a histogram of the conditional distributions: {r} ggplot(cars, aes(Price)) + geom_histogram(data = subset(cars, Man.trans.avail == "Yes"), binwidth = 1, col = "grey", fill = "darkgreen", alpha = 0.4) + geom_histogram(data = subset(cars, Man.trans.avail == "No"), binwidth = 1, col = "grey", fill = "red", alpha = 0.4)  What if the variable is not continuous e.g. Cylinders? We can use a bar graph / bar plot. {r} ggplot(cars, aes(Cylinders)) + geom_bar()  This is essential frequency by level of the categorical variable. Now let's move on to looking at one variable versus another variable. For example price by engine power: {r} ggplot(cars, aes(x = Horsepower, y = Price))  Since we've given it an aesthetics object, it now knows which variable is the x axis and which variable is the y axis. It already knows the ranges of the variables (a smart default) and a default scale and coordinate system (smart defaults). Just as before, now we need to pick a layer by specifying a geometry. This is a type of plot. Let's pick the "scatterplot" using the geom_point function: {r} ggplot(cars, aes(x = Horsepower, y = Price)) + geom_point()  Now we have a nice scatterplot. This function uses the inherited data, the inherited aesthetics. Since this "geometry" is a "layer", we can pass in options to the layer. {r} base_and_aesthetics = ggplot(cars, aes(x = Horsepower, y = Price)) base_and_aesthetics + geom_point(col = "red", fill = "green", shape = 23, size = 3, alpha = 0.3)  Let's handle names of axes, title and ranges: {r} base_and_aesthetics_with_titles = base_and_aesthetics + ggtitle("Average Car Price vs. Engine Power", subtitle = "in the Cars93 dataset") + ylab("Price (in$1000's)") base_and_aesthetics_with_titles + geom_point() + xlim(0, 400) + ylim(0, 50)  Let's transform the variables: {r} base_and_aesthetics_with_titles + geom_point() + scale_x_continuous(trans = "log2")  Each unit increase on the x-axis now represent a doubling increase in x (although the whole scale only spans 3 units). But look at how the grid didn't keep up. Let's fix this: {r} base_and_aesthetics_with_titles + geom_point() + scale_x_continuous(trans = "log2", breaks = round(seq(0, max(cars$Horsepower), length.out = 6)))  We can do the same to the y axis: {r} base_and_aesthetics_with_titles + geom_point() + scale_y_continuous(trans = "log10")+ scale_x_continuous(trans = "log10")  Let's look at some more geometries. {r} base_and_aesthetics_with_titles + geom_point() + geom_smooth()  Here, I've added two geometries on the same aesthetic! This attempts to explain the relationship$f(x)$using smoothing. Let's go for more. {r} base_and_aesthetics_with_titles + geom_point() + geom_smooth() + geom_rug()  This allows us to also see the marginal distributions of the two variables. {r} base_and_aesthetics_with_titles + geom_point() + geom_smooth() + geom_quantile(col = "red") + geom_rug()  This fits a line and tries to indicate statistical significance of the line. We have *not* covered any statistics in this class yet (ironic!) ... so ignore how the window is generated. Can we display more than two dimensions? Yes. Let's indicate a third dimension with shape (only works with factors). {r} base_and_aesthetics_with_titles = base_and_aesthetics_with_titles + ggtitle("Average Car Price by Power and Transmission") base_and_aesthetics_with_titles + geom_point(aes(shape = Man.trans.avail)) + geom_smooth() + geom_rug()  Can we display more than three dimensions? Yes. Let's indicate a fourth dimension with color. {r} base_and_aesthetics_with_titles = base_and_aesthetics_with_titles + ggtitle("Average Car Price by Power, Transmission & Drivetrain") base_and_aesthetics_with_titles + geom_point(aes(shape = Man.trans.avail, col = DriveTrain)) + geom_smooth() + geom_rug()  Can we go to a fifth dimension? Maybe? {r} base_and_aesthetics_with_titles = base_and_aesthetics_with_titles + ggtitle("Average Car Price by Power, Transmission & Drivetrain") base_and_aesthetics_with_titles + geom_point(aes(shape = Man.trans.avail, col = DriveTrain, size = Weight), alpha = 0.5) + #size? geom_smooth() + geom_rug()  A seventh? We can use text labels adjacent to the scatterplot's points. {r} base_and_aesthetics_with_titles = base_and_aesthetics_with_titles + ggtitle("Average Car Price by Power, Transmission, Drivetrain, Weight & #Cylinders") base_and_aesthetics_with_titles + geom_point(aes(shape = Man.trans.avail, col = DriveTrain, alpha = Weight)) + #size? geom_text(aes(label = Cylinders), vjust = 1.5, col = "darkgrey", lineheight = 0.3, size = 3) + geom_smooth() + geom_rug()  Getting difficult to see what's going on. Let's move away from the scatterplot to just density estimation: {r} base_and_aesthetics_with_titles = base_and_aesthetics_with_titles + ggtitle("Average Car Price by Power") #reset the title base_and_aesthetics_with_titles + geom_density2d()  Other alternatives: {r} base_and_aesthetics_with_titles + geom_bin2d(binwidth = c(8, 3)) pacman::p_load(hexbin) base_and_aesthetics_with_titles + geom_hex()  This is like a two-dimensional histogram where the bar / hexagon heights are seen with color. What if the x-axis is categorical for example Cylinders versus price? Typical is the "box and whiskers" plot: {r} ggplot(cars, aes(x = Cylinders, y = Price)) + geom_boxplot()  Clear relationship! How about multiple subplots based on the subsetting we did in the histograms? This is called "faceting". Here are two bivariate visualizations laid horizontally: {r} ggplot(cars, aes(x = Horsepower, y = Price)) + geom_point() + geom_smooth() + facet_grid(. ~ Man.trans.avail)  Or alternatively, vertically: {r} ggplot(cars, aes(x = Horsepower, y = Price)) + geom_point() + geom_smooth() + facet_grid(Man.trans.avail ~ .)  And we can even double-subset: {r} ggplot(cars, aes(x = Horsepower, y = Price)) + geom_point() + facet_grid(Man.trans.avail ~ Origin)  And we can even triple-subset or more: {r} cars$MedWeight = ifelse(cars$Weight > median(cars$Weight), ">MedWeight", "% select(-ID) %>% #drop the useless ID column na.omit #drop all rows that are missing task = TaskClassif$new(id = "cancer", backend = cancer, target = "class")  We now create the learner. By default, SVM is not included, so we need to load an extension package. We ensure the SVM is linear like the one we studied in class (we don't have time to do nonlinear SVM's but it is basically an attempt to make the candidate space richer). {r} pacman::p_load(mlr3learners, e1071) learner = lrn("classif.svm") learner$param_set$values = list(kernel = "linear") learner$param_set$values$type = "C-classification" #unsure why I need this...  Now we create the inner loop where we try many different values of the hyperparameter via a grid search. This grid search functionality has been further decomped in the new mlr3 package into a subpackage called mlr3tuning. {r} pacman::p_load(mlr3tuning) resampling = rsmp("holdout") search_space = ps(cost = p_dbl(lower = 0.0001, upper = 1)) M = 30 Kinner = 5 terminator = trm("evals", n_evals = Kinner) tuner = tnr("grid_search", resolution = M) measure = msr("classif.ce") #misclassification error at = AutoTuner$new(learner, resampling, measure, terminator, tuner, search_space)  Now we create the outer loop and execute {r} Kouter = 3 resampling_outer = rsmp("cv", folds = Kouter) rr = resample(task = task, learner = at, resampling = resampling_outer)  Now we look at the results a bunch of different ways: {r} rr$score() rr$aggregate() rr$prediction()$confusion  That gives us the estimate of future performance. Now we create the model {r} at$train(task) at$tuning_result  This gives us the final hyperparameter value. Then the final model can be trained {r} learner$param_set$values = c(learner$param_set$values, list(cost = at$tuning_result$cost)) learner$train(task) g_final = learner$predict_newdata #this created g_final internally and you can predict on it for x-vec* via: g_final(cancer[1, ])  # Use Case (II) Forward Stepwise Model Construction There are many types of such stepwise models. Here we will look at Forward Stepwise Linear models. "Forward" meaning we start with a low complexity model and end with a high complexity model, "Stepwise" meaning we do so iteratively which each step consisting of one additional degree of freedom i.e. one incremental increase in complexity and "Linear" meaning that the model is linear. By default we use OLS. We will be using the diamonds data again as an example. Let's make sure we have unordered factors to avoid issues later: {r} pacman::p_load(tidyverse, magrittr) diamonds$cut = factor(diamonds$cut, ordered = FALSE) diamonds$color = factor(diamonds$color, ordered = FALSE) diamonds$clarity = factor(diamonds$clarity, ordered = FALSE)  What we're doing will be highly computational, so let's take a random sample of the dimaonds in$\mathbb{D}$for training and testing: {r} Nsamp = 1300 set.seed(1984) train_indices = sample(1 : nrow(diamonds), Nsamp) diamonds_train = diamonds[train_indices, ] test_indices = sample(setdiff(1 : nrow(diamonds), train_indices), Nsamp) diamonds_test = diamonds[test_indices, ]  Let's built a model with all second-order interactions e.g. all things that look like depth x table x clarity or depth^2 x color or depth^3. {r} mod = lm(price ~ . * . * ., diamonds_train)  How many variables is this? And what does it look like? {r} length(coef(mod)) coef(mod)[1000 : 1100]  For features that are non-binary, it's p_non_binary^3 features. Binary features are more complicated because its each level in feature A times each level in feature B. There are no squared or cube terms for binary features (since they're all the same i.e.${0,1}^d = {0,1}$). Remember we lkely overfit just using first order interactions? We'll certainly overfit using first-order interactions AND second order interactions {r} summary(mod)$r.squared sd(summary(mod)$residuals)  Is that believable? Well... let's try it on the another 10,000 we didn't see... {r} y_hat_test = predict(mod, diamonds_test) y_test = diamonds_test$price e_test = y_test - y_hat_test 1 - sum((e_test)^2) / sum((y_test - mean(y_test))^2) sd(e_test)  VERY negative oos $R^2$ --- why? What should that say about the relationship between $s_e$ and $s_y$? {r} sd(y_test) sd(e_test) / sd(y_test)  This is not only "overfitting"; it is an absolute trainwreck! This means you can do better using the null model (average of y) instead of this model. So let us employ stepwise to get a "good" model. We need our basis predictors to start with. How about the linear components of . * . * . --- there's nothing intrinsically wrong with that - it's probably a good basis for $f(x)$. Let's create the model matrix for both train and test: {r} Xmm_train = model.matrix(price ~ . * . * ., diamonds_train) y_train = diamonds_train$price p_plus_one = ncol(Xmm_train) p_plus_one Xmm_test = model.matrix(price ~ . * . * ., diamonds_test)  Now let's go through one by one and add the best one based on$s_e$gain i.e. the best new dimension to add to project the most of the vector$y$as possible onto the column space. {r} predictor_by_iteration = c() #keep a growing list of predictors by iteration in_sample_ses_by_iteration = c() #keep a growing list of se's by iteration oos_ses_by_iteration = c() #keep a growing list of se's by iteration i = 1 repeat { #get all predictors left to try all_ses = array(NA, p_plus_one) #record all possibilities for (j_try in 1 : p_plus_one){ if (j_try %in% predictor_by_iteration){ next } Xmm_sub = Xmm_train[, c(predictor_by_iteration, j_try), drop = FALSE] all_ses[j_try] = sd(lm.fit(Xmm_sub, y_train)$residuals) #lm.fit so much faster than lm! } j_star = which.min(all_ses) predictor_by_iteration = c(predictor_by_iteration, j_star) in_sample_ses_by_iteration = c(in_sample_ses_by_iteration, all_ses[j_star]) #now let's look at oos Xmm_sub = Xmm_train[, predictor_by_iteration, drop = FALSE] mod = lm.fit(Xmm_sub, y_train) y_hat_test = Xmm_test[, predictor_by_iteration, drop = FALSE] %*% mod$coefficients oos_se = sd(y_test - y_hat_test) oos_ses_by_iteration = c(oos_ses_by_iteration, oos_se) cat("i = ", i, "in sample: se = ", all_ses[j_star], "oos_se", oos_se, "\n predictor added:", colnames(Xmm_train)[j_star], "\n") i = i + 1 if (i > Nsamp || i > p_plus_one){ break #why?? } }  Now let's look at our complexity curve: {r} simulation_results = data.frame( iteration = 1 : length(in_sample_ses_by_iteration), in_sample_ses_by_iteration = in_sample_ses_by_iteration, oos_ses_by_iteration = oos_ses_by_iteration ) pacman::p_load(latex2exp) ggplot(simulation_results) + geom_line(aes(x = iteration, y = in_sample_ses_by_iteration), col = "red") + geom_line(aes(x = iteration, y = oos_ses_by_iteration), col = "blue") + ylim(0, max(c(simulation_results$in_sample_ses_by_iteration, simulation_results$oos_ses_by_iteration))) ylab(TeX("$s_e$"))  We can kind of see what the optimal model is above. If we want an exact procedure, we'd probably fit a separate smoothing regression to the oos results and analytically find the arg-minimum,$j^*$. That number will then be fed into the model matrix to create the right feature set and the final model will be produced with all the data. Or we can just stop as soon as oos error goes up. You can also obviously do CV within each iterations to stabilize this further (lab exercise). {r} p_opt = which.min(oos_ses_by_iteration) colnames(Xmm_train)[predictor_by_iteration[1 : p_opt]]  What is the "optimal model"? Can we honestly assess future performance now? No... why? Our test set was really our select set and we don't have a third test set (lab exercise). Inner and outer folding can be done too as we discussed. # C++ and R R goes back to 1995 when it was adapted from S (written in 1976 by John Chambers at Bell Labs) with minor modifications. The core of base R is written in C and Fortran. These two languages are the fastest known languages (how to measure "fastest" is a huge debate). Thus, base R is very fast. For instance the sort function is as fast as C/Fortran since it immediately calls compiled C/Fortran routines. However, R code itself that you write is "interpreted" which means it is not compiled until you run it. And it has to compile on-the-fly, making it very slow. Prior to v3.4 (April, 2017) it was even slower since the code wasn't JIT compiled. All this "real CS" stuff you can learn in another class.. One notable place to observe this slowness relative to other languages is in looping. For example: {r} SIZE = 1e6 v = 1 : SIZE  Take for example a simple function that computes square roots on each element {r} sqrt_vector = function(v){ v_new = array(NA, length(v)) for (i in 1 : length(v)){ v_new[i] = sqrt(v[i]) } v_new }  How fast does this run? Let's use a cool package called microbenchmark that allows us to do an operation many times and see how long it takes each time to get an average: {r} pacman::p_load(microbenchmark) microbenchmark( sqrt_vector(v), times = 10 )  Does the apply function help? {r} microbenchmark( apply(v, 1, FUN = sqrt), times = 10 )  Strange that this takes so long? So it doesn't help... it hurts A LOT. Unsure why... Be careful with apply! How much faster in C++ should this be? Enter the Rcpp package - a way to compile little bits (or lotta bits) of C++ on the fly. {r} pacman::p_load(Rcpp)  Let's write this for loop function to sqrt-ize in C++. We then compile it and then save it into our namespace to be called like a regular function. Note that we use C++ classes that are not part of standard C++ e.g. "NumericVector". Rcpp comes build in with classes that are interoperable with R. It's not hard to learn, just takes a small dive into the documentation. {r} cppFunction(' NumericVector sqrt_vector_cpp(NumericVector v) { int n = v.size(); NumericVector v_new(n); for (int i = 0; i < n; i++) { //indices from 0...n-1 not 1...n! v_new[i] = sqrt(v[i]); } return v_new; } ')  What do these two functions look like? {r} sqrt_vector sqrt_vector_cpp  The first one shows the R code and then says it is bytecode-compiled which means there are speedups used in R (go to an advanced CS class) but we will see these speedups aren't so speedy! The other just says we .Call some C++ function in a certain address (pointer) and the argument to be inputted. What is the gain in runtime? {r} microbenchmark( sqrt_vector_cpp(v), times = 10 )  WOW. 10x!!! Can't beat that with a stick... Let's do a not-so-contrived example... Matrix distance... Let's compute the distances of all pairs of rows in a dataset. I will try to code the R as efficiently as possible by using vector subtraction so there is only two for loops. The C++ function will have an additional loop to iterate over the features in the observations. {r} #a subset of the diamonds data SIZE = 1000 X_diamonds = as.matrix(ggplot2::diamonds[1 : SIZE, c("carat", "depth", "table", "x", "y", "z")]) compute_distance_matrix = function(X){ n = nrow(X) D = matrix(NA, n, n) for (i_1 in 1 : (n - 1)){ for (i_2 in (i_1 + 1) : n){ D[i_1, i_2] = sqrt(sum((X[i_1, ] - X[i_2, ])^2)) } } D } cppFunction(' NumericMatrix compute_distance_matrix_cpp(NumericMatrix X) { int n = X.nrow(); int p = X.ncol(); NumericMatrix D(n, n); std::fill(D.begin(), D.end(), NA_REAL); for (int i_1 = 0; i_1 < (n - 1); i_1++){ //Rcout << "computing for row #: " << (i_1 + 1) << "\\n"; for (int i_2 = i_1 + 1; i_2 < n; i_2++){ double sqd_diff = 0; for (int j = 0; j < p; j++){ sqd_diff += pow(X(i_1, j) - X(i_2, j), 2); //by default the cmath library in std is loaded } D(i_1, i_2) = sqrt(sqd_diff); //by default the cmath library in std is loaded } } return D; } ')  {r} microbenchmark( {D = compute_distance_matrix(X_diamonds)}, times = 10 ) round(D[1 : 5, 1 : 5], 2)  Slow... {r} microbenchmark( {D = compute_distance_matrix_cpp(X_diamonds)}, times = 10 ) round(D[1 : 5, 1 : 5], 2)  Absolutely lightning... ~200x faster on my laptop than R's runtime. Writing functions as strings that compile is annoying. It is better to have separate files. For instance... {r} sourceCpp("distance_matrix.cpp")  Here are a list of the data structures in Rcpp: https://teuder.github.io/rcpp4everyone_en/070_data_types.html#vector-and-matrix Another place where C++ pays the rent is recursion. Here is a quicksort implementation in R taken from somewhere on the internet. {r} quicksort_R <- function(arr) { # Pick a number at random. mid <- sample(arr, 1) # Place-holders for left and right values. left <- c() right <- c() # Move all the smaller values to the left, bigger values to the right. lapply(arr[arr != mid], function(d) { if (d < mid) { left <<- c(left, d) } else { right <<- c(right, d) } }) if (length(left) > 1) { left <- quicksort_R(left) } if (length(right) > 1) { right <- quicksort_R(right) } # Finally, return the sorted values. c(left, mid, right) }  Let's create a random array to test these sorts on: {r} n = 10000 x = rnorm(n)  Let's profile the pure R sort function: {r} microbenchmark( x_sorted_pure_R = quicksort_R(x), times = 10 )  Let's profile R's sort function. {r} microbenchmark( x_sorted_base_R = sort(x), times = 10 )  Let's just ensure our method worked... {r} x_sorted_pure_R = quicksort_R(x) x_sorted_base_R = sort(x) pacman::p_load(testthat) expect_equal(x_sorted_pure_R, x_sorted_base_R)  Basically infinitely faster. Let's make our own C++ implementation. {r} sourceCpp("quicksort.cpp")  and profile it: {r} microbenchmark( x_sorted_cpp = quicksort_cpp(x), times = 10 )  Let's just ensure this method worked... {r} pacman::p_load(testthat) expect_equal(x_sorted_cpp, x_sorted_base_R)  Why is our C++ slower than sort. Because sort is also in C++ or Fortran and it's been likely optimized and reoptimized up to wazoo for decades. Also, Rcpp's data structures may be slower than base R's data structures. There may be some speed lost to translating to NumericVector from double[] or something like that. Can you call R from Rcpp? You bet: {r} cppFunction(' NumericVector rnorm_cpp_R(int n, double mean, double sd){ // get a pointer to R\'s rnorm() function Function f("rnorm"); // Next code is interpreted as rnorm(n, mean, sd) return f(n, Named("sd")=sd, _["mean"]=mean); } ') rnorm_cpp_R(5, 1, .01)  A few math functions are implemented for you already: {r} evalCpp('R::qnorm(0.5, 0, 1, 1, 0)') evalCpp('R::qnorm(0.5, 0, 1)') #BOOM  Further, there are many common functions that are already wrapped for you via "Rcpp-sugar" which was the Rcpp's author's attempt to make Rcpp a whole lot easier, see [here](http://dirk.eddelbuettel.com/code/rcpp/Rcpp-sugar.pdf). {r} evalCpp('rnorm(10, 100, 3)')  If you want blazing fast linear algebra, check out package RcppArmadillo which is a wrapper around Apache's Armadillo (namespace is "arma" in the code), an optimized linear algebra package in C++. Here is an example taken from [here](https://scholar.princeton.edu/sites/default/files/q-aps/files/slides_day4_am.pdf). It involves solving for b-vec in a standard OLS. {r} pacman::p_load(RcppArmadillo) cppFunction(' arma::mat ols_cpp(arma::mat X, arma::mat y){ arma::mat Xt = X.t(); return solve(Xt * X, Xt * y); } ', depends = "RcppArmadillo") n = 500 Xy = data.frame(int = rep(1, n), x1 = rnorm(n), x2 = rnorm(n), x3 = rnorm(n), y = rnorm(n)) X = as.matrix(Xy[, 1 : 4]) y = as.matrix(Xy[, 5]) #does the function work? expect_equal(as.numeric(ols_cpp(X, y)), as.numeric(solve(t(X) %*% X) %*% t(X) %*% y))  Now how fast is it? {r} microbenchmark( R_via_lm = lm(y ~ 0 + ., data = Xy), R_matrix_multiplication = solve(t(X) %*% X) %*% t(X) %*% y, cpp_with_armadillo = ols_cpp(X, y), times = 100 )  About 4x faster than R's optimized linear algebra routines. Supposedly it can go even faster if you enable parallelization within Armadillo. I couldn't get that demo to work... Note lm is slow because it does all sorts of other stuff besides computing b-vec e.g. builds the model matrix, computes Rsq, computes residuals, does statistical testing, etc... Here are the places where Rcpp is recommended to be used (from https://teuder.github.io/rcpp4everyone_en/010_Rcpp_merit.html) * Loop operations in which later iterations depend on previous iterations. * Accessing each element of a vector/matrix. * Recurrent function calls within loops. * Changing the size of vectors dynamically. * Operations that need advanced data structures and algorithms (we don't do this in this class). # Java and R We just did C++ with R. Is there a bridge to Java? Yes (and there's bridges to many other languages too). Java and R can speak to each other through proper configuration of the rJava package. You need to have a full JDK of Java installed on your computer and have its binary executables in the proper path. This demo will be in Java JDK 8 (released in 2014 and not officially supported after 2020) since I haven't tested on the more modern Java JDK's yet. We first install rJava if necessary: {r} if (!pacman::p_isinstalled(rJava)){ pacman::p_load(pkgbuild) if (pkgbuild::check_build_tools()){ install.packages("rJava", type = "source") } install.packages("rJava") }  Now we load the package. Before we do, we set the JVM to have 4G of RAM. After we load it, we initialize te JVM. This should print out nothing or "0" to indicate success. {r} options(java.parameters = "-Xmx4000m") pacman::p_load(rJava) .jinit() #this initializes the JVM in the background and if this runs with no issues nor output, you probably have rJava installed and connected to the JDK properly.  Just like the whole Rcpp demo, we can do a whole demo with rJava, but we won't. Here's just an example of creating a Java object and running a method on it: {r} java_double = .jnew("java/lang/Double", 3.1415) java_double class(java_double) .jclass(java_double) #call an instance method .jcall(java_double, "I", "intValue") #java_double.intValue(); #call a static method J("java/lang/String", "valueOf", java_double)  A note on rJava vs Rcpp. * If you're doing quick and dirty fast functions for loops and recursion, do it in Rcpp since there is lower overhead of programming. * If you are programming more full-featured software, go with rJava. * Also, if you need full-featured parallelized execution and threading control e.g. thread pooling and the ease of debugging, my personal opinion is that rJava is easier to get working with less dependencies. My experience is that the Rcpp threading libraries just aren't there yet and neither is openMP directives within Rcpp. * Further, the JVM is fully asynchronous which means it runs completely independently of R. What this means is that you can execute something in Java, Java can "thread it off" and return you to the R prompt with a pointer to the object that houses its execution. You can then query the object. We will see dems of this. # Regression Trees Let's fit a regression tree. We will use the development package YARF which I've been hacking on now for a few years. The package internals are written in Java which we just installed above. Since YARF is not on CRAN, we install the package from my github including its dependency (if necessary) and then load it {r} if (!pacman::p_isinstalled(YARF)){ pacman::p_install_gh("kapelner/YARF/YARFJARs", ref = "dev") pacman::p_install_gh("kapelner/YARF/YARF", ref = "dev", force = TRUE) } options(java.parameters = "-Xmx4000m") pacman::p_load(YARF)  The data will be fitting with the regression tree is a sine curve plus noise: {r} pacman::p_load(tidyverse, magrittr) n = 500 x_max = 4 * pi x = runif(n, 0, x_max) sigma = 0.05 y = ifelse(x < 2 * pi, 0.5 * sin(x), 0) + rnorm(n, 0, sigma) ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point(lwd = 0.6)  Now we fit a regression tree to this model. Nevermind the calculate_oob_error argument for now. This will be clear why FALSE is NOT the default in a few classes. {r} tree_mod = YARFCART(data.frame(x = x), y, calculate_oob_error = FALSE)  How "big" is this tree model? {r} get_tree_num_nodes_leaves_max_depths(tree_mod)  What are the "main" splits? {r} illustrate_trees(tree_mod, max_depth = 4, open_file = TRUE)  What does$g(x)$look like? {r} Nres = 1000 x_predict = data.frame(x = seq(0, x_max, length.out = Nres)) g = predict(tree_mod, x_predict) ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point(lwd = 0.6) + geom_point(aes(x, y), data.frame(x = x_predict, y = g), col = "blue")  Obviously overfit - but not that bad... let's try lowering the complexity by stopping the tree construction at a higher node size. {r} tree_mod = YARFCART(data.frame(x = x), y, nodesize = 50, calculate_oob_error = FALSE) yhat = predict(tree_mod, x_predict) ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point(lwd = 0.6) + geom_point(aes(x, y), data.frame(x = x_predict, y = yhat), col = "blue")  Less overfitting now but now it's clearly underfit! We can play with the nodesize. Or we can use the model selection algorithm to pick the model (the nodesize). Let's ensure nodesize = 1 gives us perfect overfitting. {r} tree_mod = YARFCART(data.frame(x = x), y, nodesize = 1, calculate_oob_error = FALSE) yhat = predict(tree_mod, data.frame(x = x)) ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point(lwd = 0.6) + geom_point(aes(x, y), data.frame(x = x, y = yhat), col = "blue") mean((y - yhat)^2)  Are we sure we have a leaf node for each observation? {r} get_tree_num_nodes_leaves_max_depths(tree_mod)  Let's ensure nodesize > n gives us perfect underfitting. {r} tree_mod = YARFCART(data.frame(x = x), y, nodesize = n + 1, calculate_oob_error = FALSE) yhat = predict(tree_mod, data.frame(x = x)) ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point(lwd = 0.6) + geom_point(aes(x, y), data.frame(x = x, y = yhat), col = "blue") unique(yhat) mean(y)  Are we sure we have one root node? {r} get_tree_num_nodes_leaves_max_depths(tree_mod)  Yes. Let's try this using oos validation and trace out a performance curve to find the optimal hyperparameter. {r} K = 4 set.seed(1984) select_idx = sample(1 : n, round(1 / K * n)) x_select = x[select_idx] y_select = y[select_idx] train_idx = setdiff(1 : n, select_idx) x_train = x[train_idx] y_train = y[train_idx] n_train = length(train_idx) max_nodesize_to_try = 70 in_sample_errors = array(NA, max_nodesize_to_try) oos_errors = array(NA, max_nodesize_to_try) for (i in 1 : max_nodesize_to_try){ tree_mod = YARFCART(data.frame(x = x_train), y_train, nodesize = i, calculate_oob_error = FALSE) yhat = predict(tree_mod, data.frame(x = x_train)) in_sample_errors[i] = sd(y_train - yhat) yhat_select = predict(tree_mod, data.frame(x = x_select)) oos_errors[i] = sd(y_select - yhat_select) } ggplot(data.frame(nodesize = 1: max_nodesize_to_try, in_sample_errors = in_sample_errors, oos_errors = oos_errors)) + geom_point(aes(nodesize, in_sample_errors), col = "red") + geom_point(aes(nodesize, oos_errors), col = "blue")  For some reason, we do not see serious overfitting. # Regression Trees with Real Data Now let's look at a regression tree model predicting medv in the Boston Housing data. We first load the data and do a training-test split: {r} set.seed(1984) pacman::p_load(MASS) data(Boston) test_prop = 0.1 train_indices = sample(1 : nrow(Boston), round((1 - test_prop) * nrow(Boston))) Boston_train = Boston[train_indices, ] y_train = Boston_train$medv X_train = Boston_train X_train$medv = NULL n_train = nrow(X_train)  And fit a tree model. The default hyperparameter, the node size is$N_0 = 5$. {r} options(java.parameters = "-Xmx4000m") pacman::p_load(YARF) tree_mod = YARFCART(X_train, y_train, calculate_oob_error = FALSE)  What does the in-sample fit look like? {r} y_hat_train = predict(tree_mod, X_train) e = y_train - y_hat_train sd(e) 1 - sd(e) / sd(y_train)  Recall the linear model: {r} linear_mod = lm(medv ~ ., Boston_train) sd(y_train - linear_mod$fitted.values) summary(linear_mod)$r.squared  The tree seems to win in-sample. Why? Is this a "fair" comparison? Before we address this, let's illustrate the tree. {r} illustrate_trees(tree_mod, max_depth = 4, open_file = TRUE) get_tree_num_nodes_leaves_max_depths(tree_mod)  Let's make the comparison fair by seeing what happens oos. {r} test_indices = setdiff(1 : nrow(Boston), train_indices) Boston_test = Boston[test_indices, ] y_test = Boston_test$medv X_test = Boston_test X_test$medv = NULL  For the tree: {r} y_hat_test_tree = predict(tree_mod, X_test) e = y_test - y_hat_test_tree sd(e) 1 - sd(e) / sd(y_test)  For the linear model: {r} y_hat_test_linear = predict(linear_mod, Boston_test) e = y_test - y_hat_test_linear sd(e) 1 - sd(e) / sd(y_test)  The take-home message here is that the tree beats the linear model in future predictive performance but the only way to be truly convinced of this is to do the split over and over to get a sense of the average over the massive variability (like the previous demo) or to do CV to reduce the error of the estimate. Why does the regression tree beat the linear model? Let's see what's going on in the tree. {r} get_tree_num_nodes_leaves_max_depths(tree_mod)  About how many observations are in each leaf? {r} nrow(Boston_train) / get_tree_num_nodes_leaves_max_depths(tree_mod)$num_leaves  That's a very flexible model. Let's see overfitting in action. Let's set nodesize to be one. {r} tree_mod = YARFCART(X_train, y_train, nodesize = 1, calculate_oob_error = FALSE) get_tree_num_nodes_leaves_max_depths(tree_mod) nrow(Boston_train) / get_tree_num_nodes_leaves_max_depths(tree_mod)$num_leaves  Why is it not exactly 1 on average? I think it's because... {r} data.table::uniqueN(y_train) length(y_train)  Regardless of this point, this model is essentially giving each observation it's own y-hat, it's own personal guess which will be its own personal y. Just like linear modeling when$n = p + 1$and nearest neighbors when$K = 1$. Let's see how bad the overfitting is: {r} y_hat_train = predict(tree_mod, X_train) e = y_train - y_hat_train sd(e) 1 - sd(e) / sd(y_train)  This is the expected behavior in perfect fitting. {r} y_hat_test_tree = predict(tree_mod, X_test) e = y_test - y_hat_test_tree sd(e) 1 - sd(e) / sd(y_test)  It overfits but amazing it doesn't get clobbered completely! And its results are on-par with the non-overfit linear model probably because it made up for the overfitting by reducing misspecification error. Trees are truly amazing! # Classification Trees Let's get the cancer biopsy data: {r} rm(list = ls()) pacman::p_load(YARF, tidyverse, magrittr) data(biopsy, package = "MASS") biopsy %<>% na.omit %>% dplyr::select(-ID) #for some reason the "select" function is scoping elsewhere without this explicit directive colnames(biopsy) = c( "clump_thickness", "cell_size_uniformity", "cell_shape_uniformity", "marginal_adhesion", "epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitoses", "class" )  Let's do a training-test split to keep things honest: {r} test_prop = 0.1 train_indices = sample(1 : nrow(biopsy), round((1 - test_prop) * nrow(biopsy))) biopsy_train = biopsy[train_indices, ] y_train = biopsy_train$class X_train = biopsy_train X_train$class = NULL n_train = nrow(X_train) test_indices = setdiff(1 : nrow(biopsy), train_indices) biopsy_test = biopsy[test_indices, ] y_test = biopsy_test$class X_test = biopsy_test X_test$class = NULL  Let's fit a tree: {r} tree_mod = YARFCART(X_train, y_train, calculate_oob_error = FALSE) get_tree_num_nodes_leaves_max_depths(tree_mod) nrow(biopsy_train) / get_tree_num_nodes_leaves_max_depths(tree_mod)$num_leaves  Why would the average observations per node be larger than the nodesize which is 1? {r} illustrate_trees(tree_mod, max_depth = 5, length_in_px_per_half_split = 30, open_file = TRUE)  How are we doing in-sample? {r} y_hat_train = predict(tree_mod, X_train) mean(y_train != y_hat_train)  Why is this? {r} ?YARFCART  That's the default for classification tree algorithm. Don't be lazy: you probably should use a Dselect and optimize nodesize!!! Out of sample? {r} y_hat_test = predict(tree_mod, X_test) mean(y_test != y_hat_test)  It appears we overfit. But this is still pretty good! Now let's take a look at the linear SVM model. Let's use default cost (we should really CV over the cost parameter if we weren't lazy). {r} pacman::p_load(e1071) svm_model = svm(X_train, y_train, kernel = "linear") svm_model  A couple of points: * Reached max iterations to minimize the hinge loss. Seems like there are computational issues here. * Note that we are relying on the $\lambda$ hyperparameter value for the hinge loss. On the homework, you will answer the question we never answered: how should the value of the hyperparameter be chosen? Regardless, how did it do in-sample? {r} y_hat_train = predict(svm_model, X_train) mean(y_train != y_hat_train)  Out of sample? {r} y_hat_test = predict(svm_model, X_test) mean(y_test != y_hat_test)  Maybe the model truly was linearly separable? Meaning, you don't get any added benefit from the tree if there are no interactions or non-linearities. Let's try a harder dataset. First, get a bunch of datasets from the UCI repository: {r} rm(list = ls()) pacman::p_load_gh("coatless/ucidata") data(adult) adult = na.omit(adult) #kill any observations with missingness ?adult  Let's use samples of 2,000 to run experiments: {r} set.seed(1984) test_size = 2000 train_indices = sample(1 : nrow(adult), test_size) adult_train = adult[train_indices, ] y_train = adult_train$income X_train = adult_train X_train$income = NULL n_train = nrow(X_train) test_indices = sample(setdiff(1 : nrow(adult), train_indices), test_size) adult_test = adult[test_indices, ] y_test = adult_test$income X_test = adult_test X_test$income = NULL  Make a tree: {r} tree_mod = YARFCART(X_train, y_train, calculate_oob_error = FALSE) get_tree_num_nodes_leaves_max_depths(tree_mod) nrow(adult_train) / get_tree_num_nodes_leaves_max_depths(tree_mod)$num_leaves illustrate_trees(tree_mod, max_depth = 5, length_in_px_per_half_split = 30, open_file = TRUE)  In-sample? {r} y_hat_train = predict(tree_mod, X_train) mean(y_train != y_hat_train)  Out of sample? {r} y_hat_test = predict(tree_mod, X_test) mean(y_test != y_hat_test)  The warning was legit this time. What's it saying? Looks like we overfit quite a bit! That's what nodesize of 1 does! Why is it the default? People found that even though it overfits, you still get good performance (as we've seen even with regression). I doubt people still use CART in production models that require best possible accuracy these days since this issue of overfitting was fixed with bagging (we will get to this soon) and Random Forests (we'll get to that too). Let's see how the linear SVM does. Warning: this takes a while to compute: {r} svm_model = svm(model.matrix(~ ., X_train), y_train, kernel = "linear")  In-sample? {r} y_hat_train = predict(svm_model, model.matrix(~ ., X_train)) mean(y_train != y_hat_train)  Out of sample? {r} y_hat_test = predict(svm_model, model.matrix(~ ., X_test)) mean(y_test != y_hat_test)  It seems (at least when I ran it at home), the linear SVM does much worse. Likely there are a lot of interactions in this dataset that the linear SVM must ignore because it's$\mathcal{H}$candidate set is so limited! Note: SVM train error is approximtely = SVM test error? Why? That's a usual scenario during underfitting in high n situations. There is no estimation error - only misspecification error and error due to ignorance. And those are the same among the training and test set. # Bias-Variance Decomposition of Generalization Error Let's try to fit a quadratic$f$with a linear model and examine bias-variance tradeoff. {r} rm(list = ls()) xmin = 0 xmax = 5 n_train = 20 n_test = 1000 sigma = 1 ###### this squared is the irreducible error component through all of the bias-variance decomp formulas f = function(x){x^2} #the number of datasets generated (separate universes) and thus the number of models fitted in different universes Nsim = 1000 training_gs = matrix(NA, nrow = Nsim, ncol = 2) x_trains = matrix(NA, nrow = Nsim, ncol = n_train) y_trains = matrix(NA, nrow = Nsim, ncol = n_train) all_oos_residuals = matrix(NA, nrow = Nsim, ncol = n_test) for (nsim in 1 : Nsim){ #simulate dataset$\mathbb{D}$x_train = runif(n_train, xmin, xmax) delta_train = rnorm(n_train, 0, sigma) #Assumption I: mean zero and Assumption II: homoskedastic y_train = f(x_train) + delta_train x_trains[nsim, ] = x_train y_trains[nsim, ] = y_train #fit a model g | x's, delta's and save it g_model = lm(y_train ~ ., data.frame(x = x_train)) training_gs[nsim, ] = coef(g_model) #generate oos dataset according to the same data generating process (DGP) x_test = runif(n_test, xmin, xmax) delta_test = rnorm(n_test, 0, sigma) y_test = f(x_test) + delta_test #predict oos using the model and save the oos residuals y_hat_test = predict(g_model, data.frame(x = x_test)) all_oos_residuals[nsim, ] = y_test - y_hat_test }  Take a look at the irreducible error for one dataset: {r} pacman::p_load(ggplot2) resolution = 10000 #how many xstars to test x = seq(xmin, xmax, length.out = resolution) f_x_df = data.frame(x = x, f = f(x)) ggplot(f_x_df, aes(x, f)) + geom_line(col = "green") + geom_point(aes(x, y), data = data.frame(x = x_trains[1, ], y = y_trains[1, ]))  There is no way to fit those deviations from the green line using known information. That is the "irreducible error" (from ignorance). There is dataset to dataset variation (different universes) in these deltas. Graph a few: {r} ggplot(f_x_df, aes(x, f)) + geom_line(col = "green") + geom_point(aes(x, y), data = data.frame(x = x_trains[1, ], y = y_trains[1, ]), col = "blue") + geom_point(aes(x, y), data = data.frame(x = x_trains[2, ], y = y_trains[2, ]), col = "darkgreen") + geom_point(aes(x, y), data = data.frame(x = x_trains[3, ], y = y_trains[3, ]), col = "red")  The blue dataset is one possible$\mathbb{D}$, the green dataset is another possible$\mathbb{D}$and the red dataset is another possible$\mathbb{D}$. This illustrated "dataset-dataset variability". Take a look at the mse that's averaging over (1) all datasets D and (2) all xstars in the set [0,5]. This should be equal to the three terms: (a) irreducible error plus (b) bias-squared plus (c) variance. {r} mse = mean(c(all_oos_residuals)^2) mse  Let's visualize the bias {r} g_average = colMeans(training_gs) ggplot(f_x_df, aes(x, f)) + geom_line(col = "green") + geom_abline(intercept = g_average[1], slope = g_average[2], col = "red") + ylim(-3, 25)  What is the average bias of$g$? {r} x = seq(xmin, xmax, length.out = resolution) g_avg_x = g_average[1] + g_average[2] * x biases = f(x) - g_avg_x expe_bias_g_sq = mean(biases^2) expe_bias_g_sq  What is the variance? Let's look at all lines: {r} plot_obj = ggplot() + xlim(xmin, xmax) + ylim(xmin^2, xmax^2) for (nsim in 1 : min(Nsim, 100)){ #otherwise takes too long plot_obj = plot_obj + geom_abline(intercept = training_gs[nsim, 1], slope = training_gs[nsim, 2], col = "blue") } plot_obj + geom_abline(intercept = g_average[1], slope = g_average[2], col = "red", lwd = 2) # geom_line(data = f_x_df, aes(x, f), col = "green", size = 1)  Now calculate this average variance: {r} x = seq(xmin, xmax, length.out = resolution) expe_g_x = g_average[1] + g_average[2] * x var_x_s = array(NA, Nsim) for (nsim in 1 : Nsim){ g_x = training_gs[nsim, 1] + training_gs[nsim, 2] * x var_x_s[nsim] = mean((g_x - expe_g_x)^2) } expe_var_g = mean(var_x_s) expe_var_g  Now check the equivalence {r} mse sigma^2 expe_bias_g_sq expe_var_g sigma^2 + expe_bias_g_sq + expe_var_g  This is not exactly equal due to numerical error. If we increase resolution and Nsim, it will be equal. Let's try the whole thing again using a quadratic regression! {r} training_gs = matrix(NA, nrow = Nsim, ncol = 3) all_residuals = matrix(NA, nrow = Nsim, ncol = n_test) for (nsim in 1 : Nsim){ #simulate dataset$\mathbb{D}$x_train = runif(n_train, xmin, xmax) delta_train = rnorm(n_train, 0, sigma) y_train = f(x_train) + delta_train #fit a model g | x's, delta's and save it g_model = lm(y_train ~ poly(x, 2, raw = TRUE), data.frame(x = x_train)) training_gs[nsim, ] = coef(g_model) #generate oos dataset and save residuals on oos data x_test = runif(n_test, xmin, xmax) delta_test = rnorm(n_test, 0, sigma) y_test = f(x_test) + delta_test y_hat_test = predict(g_model, data.frame(x = x_test)) all_residuals[nsim, ] = y_test - y_hat_test }  Take a look at the mse: {r} mse = mean(c(all_residuals)^2) mse  Much lower! Why? Bias went down. Let's visualize the bias {r} g_average = colMeans(training_gs) x = seq(xmin, xmax, length.out = resolution) ggplot(f_x_df, aes(x, f)) + geom_line(col = "green") + stat_function(fun = function(x){g_average[1] + g_average[2] * x + g_average[3] * x^2}, col = "red")  Not much! What is the average bias of$g$? {r} x = seq(xmin, xmax, length.out = resolution) g_avg_x = g_average[1] + g_average[2] * x + g_average[3] * x^2 f_x = x^2 biases = f_x - g_avg_x expe_bias_g_sq = mean(biases^2) expe_bias_g_sq  What is the variance? Let's look at all lines: {r} plot_obj = ggplot(data.frame(x = x)) + xlim(xmin, xmax) + ylim(xmin^2, xmax^2) for (nsim in 1 : min(Nsim, 50)){ #otherwise takes too long plot_obj = plot_obj + geom_line(data = data.frame(x = x, y = training_gs[nsim, 1] + training_gs[nsim, 2] * x + training_gs[nsim, 3] * x^2), mapping = aes(x, y), col = "blue") } plot_obj + stat_function(fun = function(x){g_average[1] + g_average[2] * x + g_average[3] * x^2}, col = "red", lwd = 2) #+ #geom_line(data = f_x_df, aes(x, f), col = "green", size = 0.5)  Now calculate this average variance: {r} x = seq(xmin, xmax, length.out = resolution) expe_g_x = g_average[1] + g_average[2] * x + g_average[3] * x^2 var_x_s = array(NA, Nsim) for (nsim in 1 : Nsim){ g_x = training_gs[nsim, 1] + training_gs[nsim, 2] * x + training_gs[nsim, 3] * x^2 var_x_s[nsim] = mean((g_x - expe_g_x)^2) } expe_var_g = mean(var_x_s) expe_var_g  Now check the equivalence {r} mse sigma^2 expe_bias_g_sq expe_var_g sigma^2 + expe_bias_g_sq + expe_var_g  Try it again with quintic polynomials! {r} training_gs = matrix(NA, nrow = Nsim, ncol = 6) all_residuals = matrix(NA, nrow = Nsim, ncol = n_test) for (nsim in 1 : Nsim){ #simulate dataset$\mathbb{D}$x_train = runif(n_train, xmin, xmax) delta_train = rnorm(n_train, 0, sigma) y_train = f(x_train) + delta_train #fit a model g | x's, delta's and save it g_model = lm(y_train ~ poly(x, 5, raw = TRUE), data.frame(x = x_train)) training_gs[nsim, ] = coef(g_model) #generate oos dataset and save residuals on oos data x_test = runif(n_test, xmin, xmax) delta_test = rnorm(n_test, 0, sigma) y_test = f(x_test) + delta_test y_hat_test = predict(g_model, data.frame(x = x_test)) all_residuals[nsim, ] = y_test - y_hat_test }  Take a look at the mse: {r} mse = mean(c(all_residuals)^2) mse  Much higher! Why? Variance went up! Let's visualize the bias {r} g_average = colMeans(training_gs) f = function(x){x^2} x = seq(xmin, xmax, length.out = resolution) ggplot(f_x_df, aes(x, f)) + geom_line(col = "darkgreen") + stat_function(fun = function(x){g_average[1] + g_average[2] * x + g_average[3] * x^2 + g_average[4] * x^3 + g_average[5] * x^4 + g_average[6] * x^5}, col = "red")  Not much! Now compute the average bias squared of$g$: {r} x = seq(xmin, xmax, length.out = resolution) g_avg_x = g_average[1] + g_average[2] * x + g_average[3] * x^2 + g_average[4] * x^3 + g_average[5] * x^4 + g_average[6] * x^5 biases = f(x) - g_avg_x expe_bias_g_sq = mean(biases^2) expe_bias_g_sq  This appears to have increased over last time by a nominal amount ... but it's only because we're not running the regression infinite times. Remember this "expectation" is only an average. What is the variance? Let's look at all lines: {r} plot_obj = ggplot(data.frame(x = x)) + xlim(xmin, xmax) for (nsim in 1 : min(Nsim, 30)){ #otherwise takes too long plot_obj = plot_obj + geom_line(data = data.frame(x = x, y = training_gs[nsim, 1] + training_gs[nsim, 2] * x + training_gs[nsim, 3] * x^2 + training_gs[nsim, 4] * x^3 + training_gs[nsim, 5] * x^4 + training_gs[nsim, 6] * x^5), mapping = aes(x, y), col = "blue") } plot_obj + stat_function(fun = function(x){g_average[1] + g_average[2] * x + g_average[3] * x^2 + g_average[4] * x^3 + g_average[5] * x^4 + g_average[6] * x^5}, col = "red", lwd = 2) + ylim(0, 25) + geom_line(data = f_x_df, aes(x, f), col = "green", size = 0.5)  It looks awful!!! Now actually compute the average variance numerically: {r} x = seq(xmin, xmax, length.out = resolution) expe_g_x = g_average[1] + g_average[2] * x + g_average[3] * x^2 + g_average[4] * x^3 + g_average[5] * x^4 + g_average[6] * x^5 var_x_s = array(NA, Nsim) for (nsim in 1 : Nsim){ g_x = training_gs[nsim, 1] + training_gs[nsim, 2] * x + training_gs[nsim, 3] * x^2 + training_gs[nsim, 4] * x^3 + training_gs[nsim, 5] * x^4 + training_gs[nsim, 6] * x^5 var_x_s[nsim] = mean((g_x - expe_g_x)^2) } expe_var_g = mean(var_x_s) expe_var_g  Any more complexity than you need allows for overfitting! Overfitted models induce variance-error from the perspective of multiple universes. Now check the equivalence: {r} mse sigma^2 expe_bias_g_sq expe_var_g sigma^2 + expe_bias_g_sq + expe_var_g  # Bias - Variance Decomposition of MSE in Regression Trees Let's return to the simulated sine curve data which we used to introduce regression trees. {r} rm(list = ls()) n_train = 100 n_test = 500 xmin = 0 xmax = 10 sigma = 0.2 Nsim = 250  And load the tree package: {r} options(java.parameters = "-Xmx4000m") pacman::p_load(YARF, tidyverse, magrittr)  Now let's generate lots of different datasets and fit many tree models. Note there's a new argument calculate_oob_error = FALSE. This is here for speed only. Ignore this for now as we will go over what this means later in detail. {r} training_gs = list() #storing entire objects - need a hash Nsim = 200 #trees take longer to build than OLS models sigma = 1 all_residuals = matrix(NA, nrow = Nsim, ncol = n_test) #go back to the sine curve example from class when we introduced the tree model f = function(x){5 * sin(x)} for (nsim in 1 : Nsim){ cat ("nsim", nsim, "of", Nsim, "\n") #simulate dataset$\mathbb{D}$x_train = runif(n_train, xmin, xmax) delta_train = rnorm(n_train, 0, sigma) y_train = f(x_train) + delta_train # ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point(lwd=0.6) #fit a model g | x's, delta's and save it g_model = YARFCART(data.frame(x = x_train), y_train, calculate_oob_error = FALSE, verbose = FALSE) training_gs[[nsim]] = g_model #generate oos dataset and save residuals on oos data x_test = runif(n_test, xmin, xmax) delta_test = rnorm(n_test, 0, sigma) #mean zero, variance sigsq always (not dependent on value of x) y_test = f(x_test) + delta_test y_hat_test = predict(g_model, data.frame(x = x_test)) all_residuals[nsim, ] = y_test - y_hat_test }  Let's look at the last of the data sets to remind ourselves of the problem setting: {r} ggplot(data.frame(x = x_train, y = y_train)) + geom_point(aes(x, y))  What does the storage of all the models look like? {r} head(training_gs, 2)  Take a look at the mse: {r} mse = mean(c(all_residuals)^2) mse  Let's visualize the bias {r} resolution = 1000 #generate x and the truth x = seq(xmin, xmax, length.out = resolution) #now estimate the expectation of g by averaging all the different models g_avg_x = array(0, resolution) for (nsim in 1 : Nsim){ g_nsim = training_gs[[nsim]] g_avg_x = g_avg_x + predict(g_nsim, data.frame(x = x)) } g_avg_x = g_avg_x / Nsim #average of all models #now plot ggplot(data.frame(x = x, f = f(x), expe_g = g_avg_x)) + geom_line(aes(x, expe_g), col = "red", lwd = 2) + geom_line(aes(x, f), col = "green", lwd = 1)  Not much! Now actually compute the average bias squared of$g$: {r} biases = f(x) - g_avg_x expe_bias_g_sq = mean(biases^2) expe_bias_g_sq  This is small - why?? It's because trees are so expressive and have such model complexity that they can nail almost any true$f$function! That means the MSE save the irreducible noise is coming from the variance. Let's look at the variance vs f(x): {r} plot_obj = ggplot(data.frame(x = x)) + xlim(xmin, xmax) #+ ylim(xmin^2, xmax^2) num_trees_to_visualize = 30 for (nsim in 1 : min(Nsim, num_trees_to_visualize)){ g_nsim = training_gs[[nsim]] g_x = predict(g_nsim, data.frame(x = x)) plot_obj = plot_obj + geom_line(data = data.frame(x = x, y = g_x), aes(x = x, y = y), col = "blue") } plot_obj + geom_line(data = data.frame(x = x, expe_g = g_avg_x), mapping = aes(x, expe_g), col = "red", lwd = 2) + geom_line(aes(x, f(x)), col = "green", lwd = 1)  It looks awful!!! Why? Now actually compute the average variance numerically: {r} x = seq(xmin, xmax, length.out = resolution) var_x_s = array(NA, min(Nsim, 50)) for (nsim in 1 : min(Nsim, 50)){ #otherwise takes too long g_nsim = training_gs[[nsim]] g_x = predict(g_nsim, data.frame(x = x)) var_x_s[nsim] = mean((g_x - g_avg_x)^2) } expe_var_g = mean(var_x_s) expe_var_g  Any more complexity than you need allows for overfitting! Now check the equivalence {r} mse sigma^2 expe_bias_g_sq expe_var_g sigma^2 + expe_bias_g_sq + expe_var_g  # Model Averaging via Bootstrap Aggregation ("Bagging") Now let's generate one dataset (only ONE)! {r} x_train = runif(n_train, xmin, xmax) delta_train = rnorm(n_train, 0, sigma) y_train = f(x_train) + delta_train ggplot(data.frame(x = x_train, y = y_train)) + geom_point(aes(x, y))  Spend a moment to appreciate that this is all we're going to have. There is a lot of irreducible noise that is going to prevent us from finding the sine curve. Now we create many bootstrap samples to create similiar but different datasets and fit many tree models: {r} bootstrapped_gs = list() #storing entire objects - need a hash / list object num_trees = 250 #M = num_trees for (t in 1 : num_trees){ #fit a model g | x's, delta's and save it bootstrap_indices_t = sample(1 : n_train, replace = TRUE) g_model = YARFCART(data.frame(x = x_train[bootstrap_indices_t]), y_train[bootstrap_indices_t], calculate_oob_error = FALSE) #this is confusing! But will become clear later... bootstrapped_gs[[t]] = g_model }  Now let's aggregate all models constructed with bootstrap samples together by averaging and see what it looks like relative to real$f$: {r} #generate x and the truth x = seq(xmin, xmax, length.out = resolution) #create the bagged model predictions g_bagged = array(0, resolution) for (t in 1 : num_trees){ g_t = bootstrapped_gs[[t]] g_bagged = g_bagged + predict(g_t, data.frame(x = x)) } g_bagged = g_bagged / num_trees #average of all models #now plot ggplot(data.frame(x = x, f = f(x), g_bagged = g_bagged)) + geom_line(aes(x, f), col = "green") + geom_line(aes(x, g_bagged), col = "blue")  That's pretty good considering the plot of the raw data from above. We just compared one tree to a bagged-tree (i.e. average of many trees where each tree was fit on a bootstrap sample of the original data). Let's see if the bagged tree is truly better than one tree. The way to do this is generate lots of datasets and try to estimate mse (with bias and variance). We fit the bag of trees before manually with a for loop. But of course there is a convenience to do this, the function YARFBAG. And then we run the same simulation as we did above with one tree. {r} Nsim = 250 training_gs = list() #storing entire objects - need a hash all_residuals = matrix(NA, nrow = Nsim, ncol = n_test) for (nsim in 1 : Nsim){ #simulate dataset$\mathbb{D}$x_train = runif(n_train, xmin, xmax) delta_train = rnorm(n_train, 0, sigma) y_train = f(x_train) + delta_train # ggplot(data.frame(x = x, y = y), aes(x, y)) + geom_point(lwd=0.6) #fit a model bagged tree g | x's, delta's and save it g_model = YARFBAG(data.frame(x = x_train), y_train, num_trees = num_trees, calculate_oob_error = FALSE) #automatically does the bootstrap internally training_gs[[nsim]] = g_model #generate oos dataset and save residuals on oos data x_test = runif(n_test, xmin, xmax) delta_test = rnorm(n_test, 0, sigma) #mean zero, variance sigsq always (not dependent on value of x) y_test = f(x_test) + delta_test y_hat_test = predict(g_model, data.frame(x = x_test)) all_residuals[nsim, ] = y_test - y_hat_test }  What do all the the bagged models look like? {r} head(training_gs, 2)  Take a look at the mse: {r} mse = mean(c(all_residuals)^2) mse  Let's visualize the bias {r} resolution = 1000 #generate x and the truth x = seq(xmin, xmax, length.out = resolution) f_x = sin(x) #now estimate the expectation of g by averaging all the different models g_avg_x = array(0, resolution) for (nsim in 1 : Nsim){ g_nsim = training_gs[[nsim]] g_avg_x = g_avg_x + predict(g_nsim, data.frame(x = x)) } g_avg_x = g_avg_x / Nsim #average of all models #now plot ggplot(data.frame(x = x, f = f(x), expe_g = g_avg_x)) + geom_line(aes(x, expe_g), col = "red", lwd = 2) + geom_line(aes(x, f), col = "darkgreen", lwd = 1)  This is not a shock since if a single tree was unbiased then an average of trees will be unbiased all the more so. Now actually compute the average bias squared of$g$: {r} biases = f(x) - g_avg_x expe_bias_g_sq = mean(biases^2) expe_bias_g_sq  Almost nothing. Now... variance? {r} plot_obj = ggplot(data.frame(x = x)) + xlim(xmin, xmax) #+ ylim(xmin^2, xmax^2) for (nsim in 1 : min(Nsim, 30)){ #otherwise takes too long g_nsim = training_gs[[nsim]] g_x = predict(g_nsim, data.frame(x = x)) plot_obj = plot_obj + geom_line(data = data.frame(x = x, y = g_x), aes(x = x, y = y), col = "blue") } plot_obj + geom_line(data = data.frame(x = x, expe_g = g_avg_x), mapping = aes(x, expe_g), col = "red", lwd = 1.5)  There is variance, but less than previously when only one tree was used. Now actually compute the average variance numerically: {r} x = seq(xmin, xmax, length.out = resolution) var_x_s = array(NA, min(Nsim, 50)) for (nsim in 1 : min(Nsim, 50)){ #otherwise takes too long g_nsim = training_gs[[nsim]] g_x = predict(g_nsim, data.frame(x = x)) var_x_s[nsim] = mean((g_x - g_avg_x)^2) } expe_var_g_bagged = mean(var_x_s) expe_var_g_bagged  Any more complexity than you need allows for overfitting! Now check the equivalence {r} mse sigma^2 expe_bias_g_sq expe_var_g_bagged sigma^2 + expe_bias_g_sq + expe_var_g_bagged  We have a better algorithm! Why? Because there is reduction of correlation between the bootstrapped models: {r} rho = expe_var_g_bagged / expe_var_g rho  That's a gain of ~50% in the MSE component that we have control over! And... it's for FREE. Well, not exactly, you are giving up on the tree's explanation ability - the ultimate tradeoff in machine learning. This tradeoff is no joke! It is a huge area of research and a huge debate of civil importance. Let's watch some of Stephen Wolfram's testimony in the US Senate in Feb, 2021. https://www.youtube.com/watch?v=SuaP8izUPzQ&t=424s https://www.youtube.com/watch?v=SuaP8izUPzQ&t=588s # Bagged Trees vs. a Linear Model: Behold the power of machine learning!! Let's look at the boston housing data first. {r} rm(list = ls()) boston = MASS::Boston prop_test = 0.1 test_indices = sample(1 : nrow(boston), round(prop_test * nrow(boston))) boston_test = boston[test_indices, ] y_test = boston_test$medv X_test = boston_test X_test$medv = NULL train_indices = setdiff(1 : nrow(boston), test_indices) boston_train = boston[train_indices, ] y_train = boston_train$medv X_train = boston_train X_train$medv = NULL mod_lin = lm(y_train ~ ., X_train) y_hat_test_lin = predict(mod_lin, X_test) oos_rmse_lin = sd(y_test - y_hat_test_lin) oos_rmse_lin num_trees = 500 mod_bag = YARFBAG(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE) y_hat_test_bag = predict(mod_bag, X_test) oos_rmse_bag = sd(y_test - y_hat_test_bag) oos_rmse_bag cat("oos standard error reduction:", (1 - oos_rmse_bag / oos_rmse_lin) * 100, "%\n")  Whole lot of room to improve! That extra 26% or so of unexplained variance is split between bias and irreducible error due to unknown information. The bagged trees can get rid of the bias and minimize variance while doing so. And it did a great job! Now the diamonds data: {r} n_train = 1000 training_indices = sample(1 : nrow(diamonds), n_train) diamonds_train = diamonds[training_indices, ] y_train = diamonds_train$price X_train = diamonds_train X_train$price = NULL test_indices = setdiff(1 : nrow(diamonds), training_indices) diamonds_test = diamonds[test_indices, ] y_test = diamonds_test$price X_test = diamonds_test X_test$price = NULL mod_lin = lm(y_train ~ ., X_train) y_hat_test_lin = predict(mod_lin, X_test) oos_rmse_lin = sd(y_test - y_hat_test_lin) oos_rmse_lin mod_bag = YARFBAG(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE) y_hat_test_bag = predict(mod_bag, X_test) oos_rmse_bag = sd(y_test - y_hat_test_bag) oos_rmse_bag cat("oos standard error reduction:", (1 - oos_rmse_bag / oos_rmse_lin) * 100, "%\n")  Shocking improvement... and it's even more shocking that... {r} summary(mod_lin)$r.squared  There was seemingly not a whole lot of room to improve! But there is when you measure error in RMSE. That extra 8% of unexplained variance means (1) a whole lot of RMSE and (2) that whole lot of RMSE is split between bias and irreducible error due to unknown information. But it did the best it can. It is likely what's remaining is due to information we don't measure. # Validation in Bagging? We are using the "bootstrap" to get the trees. Can we do model validation in the same step? The answer is yes. For every tree, there was a bootstrap sample of the training set used to build the tree. But there are observations in $\mathbb{D}$ that are not in the bootstrap sample! About 1/3 on average are left out i.e. "out of bag (oob)". Over many trees, there are different oob subsets than become the full data set. So you actually have validation in a way on the whole dataset kind of like K-fold cross validation. Supposedly this validation is similar to K=2 in terms of performance. It is what everyone seems to use. Let's load the data and packages from last class and plot the data: {r} rm(list = ls()) n_train = 100 n_test = 500 xmin = 0 xmax = 10 sigma = 0.09 num_trees = 500 x_train = runif(n_train, xmin, xmax) delta_train = rnorm(n_train, 0, sigma) y_train = sin(x_train) + delta_train ggplot(data.frame(x = x_train, y = y_train)) + geom_point(aes(x, y))  Let's look at one bagged tree model and compute OOB errors after construction. Here we just drop the calculate_oob_error argument as the default is TRUE. We save the model and print it out: {r} bagged_tree_mod = YARFBAG(data.frame(x = x_train), y_train, num_trees = num_trees) bagged_tree_mod  How did this work? Let's look at the oob sets of indices: {r} cat("bootstrap indices:") sort(bagged_tree_mod$bootstrap_indices[[1]]) cat("oob:") sort(setdiff(1 : n_train, bagged_tree_mod$bootstrap_indices[[1]])) cat("bootstrap indices:") sort(bagged_tree_mod$bootstrap_indices[[2]]) cat("oob:") sort(setdiff(1 : n_train, bagged_tree_mod$bootstrap_indices[[2]])) bagged_tree_mod$y_oob n_oob = sapply(1 : n_train, function(i){sum(unlist(lapply(bagged_tree_mod$bootstrap_indices, function(set){!(i %in% set)})))}) round(n_oob / num_trees, 2) mean(n_oob / num_trees, 2) 1 / exp(1)  It took predictions on each tree on the oob set, averaged by observation across trees and then averaged across observation averages. Let's compare this to training-test manual splits. Let's look at the boston housing data first. {r} pacman::p_load(data.table) boston = MASS::Boston %>% data.table seed = 4 set.seed(seed) prop_test = 0.5 test_indices = sample(1 : nrow(boston), round(prop_test * nrow(boston))) boston_test = boston[test_indices, ] y_test = boston_test$medv X_test = boston_test X_test$medv = NULL train_indices = setdiff(1 : nrow(boston), test_indices) boston_train = boston[train_indices, ] y_train = boston_train$medv X_train = boston_train X_train$medv = NULL num_trees = 500 #build on training and validate on test mod_bag_train = YARFBAG(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE, seed = seed) y_hat_test_bag = predict(mod_bag_train, X_test) s_e_bag = sd(y_test - y_hat_test_bag) s_e_bag #build and validate on all the data at once! mod_bag_all = YARFBAG(boston[, !"medv"], boston$medv, num_trees = num_trees, seed = seed) mod_bag_all$rmse_oob  There is a lot of variation, but theoretically, they should be about the same. What do we have now? We now have model selection done within training. And training and validation are done in a single step! No more costly K-fold CV with 3 splits! # Random Forests But can it get any better? YES. As you saw, the variance terms can be shrunk further the more decorrelated the trees become. We do this now by introducing randomness into the splits by choosing only a subset of the features to split on randomly. The trees are then grown as normal. Then the we model average many trees via bagging. And that's random forests! Quick demo with the diamonds and oob validation: {r} rm(list = ls()) seed = 1984 set.seed(seed) n_samp = 2000 #note: trees are perfect for ordinal variables! Since it only needs an ordering to do the splits #and thus it is immune to different encodings so, just use the standard 1, 2, 3, ... #thus there is no need here to cast the ordinal variables cut, color, clarity to nominal variables diamonds_samp = diamonds %>% sample_n(n_samp) y = diamonds_samp$price X = diamonds_samp %>% dplyr::select(-price) num_trees = 1000 mod_bag = YARFBAG(X, y, num_trees = num_trees, seed = seed) mod_bag #now for apples-apples, build the RF on the same bootstrap data for each tree mod_rf = YARF(X, y, num_trees = num_trees, seed = seed, bootstrap_indices = mod_bag$bootstrap_indices) mod_rf # pacman::p_load(randomForest) # mod_rf = randomForest(X, y, num_trees = num_trees) # mod_rf cat("gain: ", (mod_bag$rmse_oob - mod_rf$rmse_oob) / mod_bag$rmse_oob * 100, "%\n")  This was a fail. RF is sensitive to the number of features samples (the hyperparameter mtry). Let's try again. {r} #there are 9 total features mod_rf = YARF(X, y, num_trees = num_trees, seed = seed, bootstrap_indices = mod_bag$bootstrap_indices, mtry = 8) mod_rf cat("gain: ", (mod_bag$rmse_oob - mod_rf$rmse_oob) / mod_bag$rmse_oob * 100, "%\n")  This is not so impressive. But it is for free. If mtry is small, the gain may be so small in the rho-multiple on the variance term that it doesn't outweigh the increase in bias. Thus, we underfit a little bit. Thus it's better to stay with just bagging. Here it is on the boston housing data: {r} rm(list = ls()) y = MASS::Boston$medv X = MASS::Boston X$medv = NULL seed = 1984 num_trees = 1000 mod_bag = YARFBAG(X, y, num_trees = num_trees, seed = seed) mod_bag mod_rf = YARF(X, y, num_trees = num_trees, seed = seed, bootstrap_indices = mod_bag$bootstrap_indices, mtry = 10) mod_rf cat("oob rmse gain:", round((mod_bag$rmse_oob - mod_rf$rmse_oob) / mod_bag$rmse_oob * 100, 3), "%\n")  Slightly more impressive. But again it is for free. What about for classification models? {r} rm(list = ls()) pacman::p_load_gh("coatless/ucidata") data(adult) adult %<>% na.omit #kill any observations with missingness n_samp = 1000 seed = 1984 set.seed(seed) adult_samp = adult %>% sample_n(n_samp) y = adult_samp$income X = adult_samp %>% dplyr::select(-income) num_trees = 500 mod_bag = YARFBAG(X, y, num_trees = num_trees, seed = seed) #default mtry for classification models is floor(sqrt(p)) mod_rf = YARF(X, y, num_trees = num_trees, seed = seed, bootstrap_indices = mod_bag$bootstrap_indices) mod_bag$misclassification_error mod_rf$misclassification_error cat("gain: ", (mod_bag$misclassification_error - mod_rf$misclassification_error) / mod_bag$misclassification_error * 100, "%\n")  Very nice... and can likely be improved with optimizing mtry. And on letters: {r} rm(list = ls()) pacman::p_load(mlbench, skimr) data(LetterRecognition) LetterRecognition = na.omit(LetterRecognition) #kill any observations with missingness #skim(LetterRecognition) ?LetterRecognition adult %<>% na.omit #kill any observations with missingness n_samp = 2000 seed = 1984 set.seed(seed) letters_samp = LetterRecognition %>% sample_n(n_samp) y = letters_samp$lettr X = letters_samp %>% dplyr::select(-lettr) num_trees = 500 mod_bag = YARFBAG(X, y, num_trees = num_trees, seed = seed) #default mtry for classification models is floor(sqrt(p)) mod_rf = YARF(X, y, num_trees = num_trees, seed = seed, bootstrap_indices = mod_bag$bootstrap_indices) mod_bag$misclassification_error mod_rf$misclassification_error cat("gain: ", (mod_bag$misclassification_error - mod_rf$misclassification_error) / mod_bag$misclassification_error * 100, "%\n")  Very nice... and can likely be improved with optimizing mtry. There are very real gains for RF over bagging and these are most pronounced in classification models. ## Data "Munging" with Dplyr and data.table "Data munging", sometimes referred to as "data wrangling", is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. -[Wikipedia](https://en.wikipedia.org/wiki/Data_wrangling). Half of what a data scientist does is cleaning data, visualizing data and wrangling it. In the process you learn all about your dataset and you're on a higher level when it comes time to build prediction models. The packages dplyr and data.table offer many conveninent functions to manipulate, clean, and otherwise wrangle data. Note: all the wrangling we're going to see *can* be done with base R (see previous notes on the data.frame object) but it would be *very very very very* annoying and *very very very very* slow. I will quickly compare and contrast dplyr and data.table before you see it inside actual code. * dplyr works really nicely with the piping chain as you "begin" the manipulation with the dataset and then iteratively pipe in step 1, step 2, etc until you wind up with what end product you would like. This makes dplyr very readable but very verbose - lots of lines of code. * On the flip side, data.table essentially wrote a new data wrangling language so it's a harder learning curve but it's very compact - very few lines of code. * data.table is blazing fast and kills dplyr in performance and I'm pretty sure it even beats Python in performance (someone please check this). So in the era of "big data", I think this is the winner even though it is much harder to learn. * I believe dplyr is more popular in the real world and thus has more cache to put on your CV. But this is constantly in flux! For all labs and the final project, you are recommended to pick one you want to use and go with it. For the exams, I will write code in both (if need be) to not penalize / reward a student who picked one over the other. Here is a nice [translation guide](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/) between dplyr and data.table. We will be learning them in tandem. I could've split this into two units but I decided against it because (1) it is good to see the same functionality side-by-side and (2) this is really just one concept. {r} pacman::p_load(tidyverse, magrittr) #tidyverse is shorthard for dplyr, ggplot2, tidyr, readr and a bunch of other packages recommended for the "full" dplyr experience. I'm using magrittr for special pipe operations later. pacman::p_load(data.table)  First, recall what pipe format means! We're going to need to know this well for what's coming next... {r} set.seed(1984) mean(head(round(sample(rnorm(1000), 100), digits = 2))) set.seed(1984) rnorm(1000) %>% #the pipe operator sample(100) %>% round(digits = 2) %>% #the first argument is passed in automatically. head %>% mean  Note that data.table is automatically multithreaded. Read [here](https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/setDTthreads). {r} getDTthreads()  We first instantiate the upgraded data.frame objects in both libraries: {r} diamonds_tbl = as_tibble(diamonds) #not necessary to cast because dplyr does the conversion automatically after using any dplyr function diamonds_dt = data.table(diamonds) #absolutely necessary  What happens during the data frame conversion? {r} class(diamonds_tbl) class(diamonds_dt)  Note how these are implemented as class extensions of R's data.frame as to allow for background compatibility and not break the API. They have nicer ways of showing the data: {r} diamonds_tbl #run this in the console, not inside the chunk diamonds_dt #run this in the console, not inside the chunk  Beginning with the simplest munging tasks, subsetting rows: {r} diamonds_tbl %>% slice(1 : 5) diamonds_dt[1 : 5]  And subsetting columns: {r} diamonds_tbl %>% select(cut, carat, price) #these three only in this order diamonds_dt[, .(cut, carat, price)] diamonds_tbl %>% select(carat, price, cut) #these three only in another order diamonds_dt[, .(carat, price, cut)] diamonds_tbl %>% select(-x) #drop this feature diamonds_dt[, !"x"] #diamonds_dt[, x := NULL] #mutating function (overwrites the data frame) diamonds_tbl %>% select(-c(x, y, z)) #drop these features diamonds_tbl %>% select(-x, -y, -z) #drop these features diamonds_dt[, !c("x", "y", "z")]  How about will rename a column {r} diamonds_tbl %>% rename(weight = carat, price_USD = price) diamonds_dt_copy = copy(diamonds_dt) setnames(diamonds_dt_copy, old = c("carat", "price"), new = c("weight", "price_USD")) #the setnames function is mutating, i.e. it modifies the data.table object, so I made a copy as to not alter the table for the rest of the demo diamonds_dt_copy rm(diamonds_dt_copy)  If you want to rearrange the columns... {r} #In dplyr you pretend to select a subset and then ask for everything else: diamonds_tbl %>% select(carat, price, cut, everything()) #these three in this order first then everything else # diamonds_tbl %>% # select(-carat, everything()) #move carat last (first drop it, and then add it back in with everything) diamonds_dt_copy = copy(diamonds_dt) setcolorder(diamonds_dt_copy, c("carat", "price", "cut")) #as before, the setcolorder function is mutating, i.e. it modifies the data.table object, so I made a copy as to not alter the table for the rest of the demo diamonds_dt_copy rm(diamonds_dt_copy)  Sorting the rows by column(s): {r} diamonds_tbl %>% arrange(carat) #default is ascending i.e. lowest first diamonds_dt[order(carat)] diamonds_dt_copy = copy(diamonds_dt) setorder(diamonds_dt_copy, carat) #as before, the setorder function is mutating, i.e. it modifies the data.table object, so I made a copy as to not alter the table for the rest of the demo diamonds_dt_copy rm(diamonds_dt_copy) diamonds_tbl %>% arrange(desc(carat)) #switch to descending, i.e. highest first diamonds_dt[order(-carat)] #and you can do this with setorder too diamonds_tbl %>% arrange(desc(color), clarity, cut, desc(carat)) #multiple sorts - very powerful diamonds_dt[order(-color, clarity, cut, -carat)] #and you can do this with setorder too  The filter method subsets the data based on conditions: {r} diamonds_tbl %>% filter(cut == "Ideal") diamonds_dt[cut == "Ideal"] diamonds_tbl %>% filter(cut == "Ideal") %>% filter(depth < 65) %>% filter(x * y * z > 20) diamonds_tbl %>% filter(cut == "Ideal" & depth < 65 & x * y * z > 20) diamonds_dt[cut == "Ideal" & depth < 65 & x * y * z > 20] diamonds_tbl %>% filter((cut == "Ideal" | cut == "Premium") & depth < 65 & x * y * z > 20) diamonds_dt[(cut == "Ideal" | cut == "Premium") & depth < 65 & x * y * z > 20] diamonds_tbl %>% filter(cut %in% c("Ideal", "Premium") & depth < 65 & x * y * z > 20) diamonds_dt[cut %in% c("Ideal", "Premium") & depth < 65 & x * y * z > 20]  How about removing all rows that are the same? {r} diamonds_tbl diamonds_tbl %>% distinct unique(diamonds_dt) #nice function from data.table: uniqueN(diamonds$carat) #273 < 53940 i.e. there's only a few weight measurements that are possible... let's only keep one from each unique carat value diamonds_tbl %>% distinct(carat, .keep_all = TRUE) #keeps the first row for each unique weight measurement unique(diamonds_dt, by = "carat")  Sampling is easy {r} diamonds_tbl %>% sample_n(7) diamonds_dt[sample(.N, 7)] #.N is a cool function: it is short for nrow(dt object) diamonds_tbl %>% sample_frac(1e-3) diamonds_dt[sample(.N, .N * 1e-3)] #.N is a cool function: it is short for nrow(dt object)  Now for some real fun stuff. Let's create new features with the mutate function. {r} diamonds_tbl %>% mutate(volume = x * y * z) #adds a new column keeping the old ones (this was an exam problem in a previous year) diamonds_dt2 = copy(diamonds_dt) diamonds_dt2[, volume := x * y * z] diamonds_dt2 diamonds_tbl %>% mutate(price_per_carat = price / carat) %>% arrange(desc(price_per_carat)) diamonds_dt2[, price_per_carat := price / carat] diamonds_dt2[order(-price_per_carat)] rm(diamonds_dt2)  Or rewrite old ones. {r} diamonds_tbl %>% mutate(cut = substr(cut, 1, 1)) diamonds_dt2 = copy(diamonds_dt) diamonds_dt2[, cut := substr(cut, 1, 1)] diamonds_dt2 diamonds_tbl %>% mutate(carat = factor(carat)) diamonds_dt2[, carat := factor(carat)] diamonds_dt2 rm(diamonds_dt2)  Here are some more ways to create new variables. Translating to data.table is trivial: {r} diamonds_tbl %>% mutate(carat = factor(ntile(carat, 5))) diamonds_tbl %>% mutate(carat = percent_rank(carat)) diamonds_tbl %>% mutate(lag_price = lag(price)) #if this data was a time series diamonds_tbl %>% mutate(cumul_price = cumsum(price)) #%>% tail  How about if you want to create a column and drop all other columns in the process? {r} diamonds_tbl %>% transmute(volume = x * y * z) #adds a new column dropping the old ones diamonds_dt[, .(volume = x * y * z)]  There are many ways to reshape a dataset. We will see two now and a few functions later when it becomes important. For instance: we can collapse columns together using the unite function from package tidyr (which should be loaded when you load dplyr). We will have a short unit on more exciting and useful reshapings later ("long" to "short" and vice-versa). As far as I know data.table has a less elegant... unless someone has a better idea? {r} diamonds_tbl2 = diamonds_tbl %>% unite(dimensions, x, y, z, sep = " x ") diamonds_tbl2 diamonds_dt2 = copy(diamonds_dt) diamonds_dt2[, dimensions := paste(x, y, z, sep = " x ")] #mutating diamonds_dt2 = diamonds_dt2[, !c("x", "y", "z")] diamonds_dt2  We can reverse this operation: {r} diamonds_tbl2 %>% separate(dimensions, c("x", "y", "z"), sep = " x ") rm(diamonds_tbl2) diamonds_dt2[, c("x", "y", "z") := strsplit(dimensions, "x")] diamonds_dt2[, -"dimensions"] rm(diamonds_dt2)  There are tons of other packages to do clever things. For instance, here's one that does dummies. Let's convert the color feature to dummies. Again slightly less readable or elegant in data.table: {r} pacman::p_load(sjmisc, snakecase) diamonds_tbl %>% to_dummy(color, suffix = "label") %>% #this creates all the dummies bind_cols(diamonds_tbl) %>% #now we have to add all the original data back in select(-matches("_"), everything()) %>% #this puts the dummies last select(-color) #finally we can drop color cbind( diamonds_dt[, -"color"], to_dummy(diamonds_dt[, .(color)], suffix = "label") )  What if you want to create a new variable based on functions only run on subsets of the data. This is called "grouping". Grouping only makes sense for categorical variables. (If you group on a continuous variable, then chances are you'll have $n$ different groups because you'll have $n$ unique values). For instance: {r} diamonds_tbl %>% group_by(color) diamonds_dt[,, by = color]  Nothing happened... these were directives to do things a bit differently with the addition of other logic. So after you group, you can now run operations on each group like they're their own sub-data frame. Usually, you want to *summarize* data by group. This means you take the entire sub-data frame and run one metric on it and return only those metrics (i.e. shrink $n$ rows to $L$ rows). This sounds more complicated than it is and it is where data wrangling really gets fun. Here are a few examples: {r} diamonds_tbl %>% group_by(color) %>% summarize(avg_price = mean(price)) diamonds_dt[, .(avg_price = mean(price)), by = color][order(color)] #chaining / piping [...][...][...] etc #where did all the other rows and columns go??? diamonds_tbl %>% group_by(color) %>% summarize(avg_price = mean(price), sd_price = sd(price), count = n()) diamonds_dt[, .(avg_price = mean(price), sd_price = sd(price), count = .N), by = color][order(color)] diamonds_tbl %>% group_by(color) %>% summarize(min_price = min(price), med_price = median(price), max_price = max(price)) diamonds_dt[, .(min_price = min(price), med_price = median(price), max_price = max(price)), by = color][order(color)]  Sometimes you want to do fancier things like actually run operations on the whole sub-data frame using mutate. If the function is a single metric, then that metric is then duplicated across the whole sub data frame. {r} diamonds_tbl %>% group_by(color) %>% mutate(avg_price_for_color = mean(price)) #creates a new feature based on running the feature only within group diamonds_dt2 = copy(diamonds_dt) diamonds_dt2[, avg_price_for_color := mean(price), by = color] diamonds_dt2 rm(diamonds_dt2)  So that's kind of like duplicating a summary stat. Here's something more fun: actually creating a new vector: {r} diamonds_tbl %>% group_by(color) %>% mutate(price_rank_within_color = dense_rank(price)) #creates a new feature based on running the feature only within group diamonds_dt2 = copy(diamonds_dt) diamonds_dt2[, price_rank_within_color := frankv(price, ties.method = "dense"), by = color] diamonds_dt2 rm(diamonds_dt2)  What if we want to get the first row in each category? {r} diamonds_tbl %>% group_by(color) %>% slice(1) diamonds_dt[, .SD[1], by = color][order(color)]  The .SD variable is short for "sub dataframe" and it's a stand-in for the pieces of the dataframe for each color as it loops over the colors. So .SD[1] will be first row in the sub dataframe. The reason why the matrices come out different is that the order of the rows in data.table changes based on optimizations. We'll see some of this later. I'm also unsure why it moved the color column to the front. What about first and last? {r} diamonds_tbl %>% group_by(color) %>% slice(1, n()) diamonds_dt[, .SD[c(1, .N)], by = color]  How about the diamond with the highest price by color? {r} diamonds_tbl %>% group_by(color) %>% arrange(price) %>% slice(n()) diamonds_dt[, .SD[which.max(price)], by = color]  We've seen data.table's preference for mutating functions. Here is a pipe command from package magrittr that makes the functions mutating. {r} diamonds_tbl2 = diamonds_tbl diamonds_tbl2 = diamonds_tbl2 %>% select(-x, -y, -z) %>% filter(carat < 0.5) %>% arrange(carat, cut, color) diamonds_tbl2 diamonds_tbl2 = diamonds_tbl diamonds_tbl2 %<>% #pipe and overwrite (short for what's above) select(-x, -y, -z) %>% filter(carat < 0.5) %>% arrange(carat, cut, color) diamonds_tbl2 rm(diamonds_tbl2)  This is as far we will go with data wrangling right now. Let's benchmark a few core features of both packages. To do so, let's create a dataframe that's very big: {r} pacman::p_load(microbenchmark) Nbig = 2e6 diamonds_tbl_big = diamonds_tbl %>% sample_n(Nbig, replace = TRUE) diamonds_dt_big = data.table(diamonds_tbl_big) #just to make sure we have the same data diamonds_big = data.frame(diamonds_tbl_big) #ensure that it is a base R object  How about we write this dataframe to the hard drive as a CSV? {r} microbenchmark( base_R = write.csv(diamonds_big, "diamonds_big.csv"), tidyverse = write_csv(diamonds_tbl_big, "diamonds_big.csv"), data.table = fwrite(diamonds_dt_big, "diamonds_big.csv"), times = 1 )  How about we read this dataframe from the hard drive as a CSV? {r} microbenchmark( base_R = read.csv("diamonds_big.csv"), tidyverse = read_csv("diamonds_big.csv"), data.table = fread("diamonds_big.csv"), times = 1 )  What about for creating new variables? {r} microbenchmark( base_R = {diamonds_big$log_price = log(diamonds_big$price)}, tidyverse = {diamonds_tbl_big %<>% mutate(log_price = log(price))}, data.table = diamonds_dt_big[, log_price := log(price)], times = 10 )  About the same. How about grouping and summarizing? No easy one-liner in base R. So we just compare the two packages: {r} microbenchmark( tidyverse = {diamonds_tbl_big %>% group_by(color) %>% summarize(avg_price = mean(price))}, data.table = diamonds_dt_big[, .(avg_price = mean(price), by = color)], times = 10 )  How about sorting? {r} microbenchmark( base_R = diamonds_big[order(diamonds_big$price), ], tidyverse = {diamonds_tbl_big %>% arrange(price)}, data.table = diamonds_dt_big[order(price)], times = 10 )  How about filtering? {r} microbenchmark( base_R = diamonds_big[diamonds_big$price < 1000, ], tidyverse = {diamonds_tbl_big %>% filter(price < 1000)}, data.table = diamonds_dt_big[price < 1000], times = 10 )  Let's do this again but first "key" the price column which is what you would do if you are doing lots of searches. {r} setkey(diamonds_dt_big, price) microbenchmark( base_R = diamonds_big[diamonds_big$price < 1000, ], tidyverse = {diamonds_tbl_big %>% filter(price < 1000)}, data.table = diamonds_dt_big[price < 1000], times = 30 )  # Wide and Long Dataframe Formats Another one of the core data munging skills is to transform a data frame from wide to long (most common) and from long back to wide (less common). Let's learn this by way of example. {r} pacman::p_load(data.table, tidyverse, magrittr) summary(storms) head(storms)  Let's first create a few variables that are of interest: {r} storms %<>% mutate(wind_pct_avg = wind / mean(wind, na.rm = TRUE) * 100) %>% mutate(pressure_pct_avg = pressure / mean(pressure, na.rm = TRUE) * 100) %>% mutate(ts_diameter_pct_avg = ts_diameter / mean(ts_diameter, na.rm = TRUE) * 100) %>% mutate(hu_diameter_pct_avg = hu_diameter / mean(hu_diameter, na.rm = TRUE) * 100) ggplot(storms) + aes(wind_pct_avg) + geom_histogram()  Now let's take a look at these four variables we created for a storm we all remember and create a time period variable. I'll also instantiate a data.table object for later: {r} sandy_wide_tbl = storms %>% filter(name == "Sandy") %>% select(wind_pct_avg, pressure_pct_avg, ts_diameter_pct_avg, hu_diameter_pct_avg) %>% #we only care about our variables mutate(period = 1 : n()) %>% select(period, everything()) #reorder sandy_wide_dt = data.table(sandy_wide_tbl) sandy_wide_dt  This is called a "repeated measures" dataset or a "time series" and it is one of the most common data frame types. Unfortunately, we didn't have enough classtime to do a unit on time series. It really deserves its own class! Regardless, it would be nice to be able to visualize It would be nice to look at the four variables we just created by time period. We can do this below: {r} ggplot(sandy_wide_tbl) + aes(x = period) + geom_line(aes(y = wind_pct_avg), col = "red") + geom_line(aes(y = pressure_pct_avg), col = "green") + geom_line(aes(y = ts_diameter_pct_avg), col = "blue") + geom_line(aes(y = hu_diameter_pct_avg), col = "grey") + #make legend code ylab("% over average")  Notice how that was a lot of lines of code which aren't so maintainable and we don't have a legend. Legends are built automatically in ggplot2 when we set color to a variable. This means we somehow have to let the four variables we care about be there own categorical variable. First note that the dataframe we have is in what's called "wide format" or "unstacked" meaning each row is an observation and the columns are its features. This is exactly the format of dataframe that we've been studying in this class. This is the format we humans prefer to read and it is the format for many important analyses and the format for modeling. However, to get what we want above involves a "reshaping" our dataframe into another canonical form, one that is easier for machines to read, a format called "long format" or "narrow" or "stacked" which looks like this: | Period | Value | variable | | ----------- | ----------- | -------------| | 1 | 56.08 | wind_pct_avg | | 2 | 65.43 | wind_pct_avg | etc. Sometimes this format is required for situations, so we should get used to "pivoting" between the two formats. We first go from wide to long. To do so, we identify the "id variables" which get their own row per category and the measurement variables which get their own entire subdataframe. {r} sandy_long_tbl = pivot_longer( sandy_wide_tbl, cols = -period, #measurement variables: all column except period and period is then the ID variable names_to = "metric", #default is "name" values_to = "val" #default is "value" ) sandy_long_dt = melt( sandy_wide_dt, id.vars = "period", measure.vars = c("wind_pct_avg", "pressure_pct_avg", "ts_diameter_pct_avg", "hu_diameter_pct_avg"), variable.name = "metric", value.name = "val" ) sandy_long_tbl sandy_long_dt  Same output but note the difference in sorting: tidyverse sorts on the id variables first and data.table sorts on the measurements i.e. cbinding the subdataframes. Now that it's in long format, the visualization code becomes very simple: {r} ggplot(sandy_long_dt) + geom_line(aes(x = period, y = val, color = metric)) + ylab("% over average")  Now we go from long to wide: {r} sandy_wide_tbl2 = pivot_wider( sandy_long_tbl, id_cols = period, names_from = metric, values_from = val ) sandy_wide_dt2 = dcast( sandy_long_dt, period ~ metric, #lhs is id and rhs is measurement variables value.var = "val" #the function can guess "val" has to be the cell values so it's not needed ) sandy_wide_tbl2 sandy_wide_dt2  Who's faster? {r} pacman::p_load(microbenchmark) microbenchmark( wide_to_long_tidy = pivot_longer( sandy_wide_tbl, cols = -period, names_to = "metric", values_to = "val" ), wide_to_long_dt = melt( sandy_wide_dt, id.vars = "period", measure.vars = c("wind_pct_avg", "pressure_pct_avg", "ts_diameter_pct_avg", "hu_diameter_pct_avg"), variable.name = "metric", value.name = "val" ), long_to_wide_tidy = pivot_wider( sandy_long_tbl, id_cols = period, names_from = metric, values_from = val ), long_to_wide_dt = dcast( sandy_long_dt, period ~ metric, value.var = "val" ), times = 50 )  Looks like data.table::melt is 60x faster than tidyverse's pivot and data.tabe::dcast is 2x faster than tidyverse's pivot. # Joins Another one of the core data munging skills is joining data frames together. In the real world, databases consist of multiple dataframes called "tables" and design matrices are built by gathering data from among many tables. To illustrate this, we load two datasets from the package nycflights13, one dataset about weather and one about airports: {r} pacman::p_load(nycflights13, data.table, tidyverse, magrittr) data(weather) summary(weather) data(airports) summary(airports) head(weather) head(airports)  Note how the weather and airports datasets contain a common feature: name of airport. It is called FAA in airports and origin in weather. First we rename the column in weather to match the column in airports: {r} weather %<>% rename(faa = origin)  We also pare down the datasets so we can see the joins more clearly: {r} airports %<>% select(faa, lat, lon) weather %<>% select(faa, time_hour, temp, humid, wind_speed, pressure, wind_gust) head(airports) head(weather) airports_dt = data.table(airports) weather_dt = data.table(weather)  Some features just aren't measured that often e.g. wind_gust. Let's do some joins. First "left". This is likely the most common because it's usually how we conceptualize what we're doing in our heads. {r} airports_and_weather = left_join(airports, weather, by = "faa") airports_and_weather %>% sample_n(500) airports_and_weather_dt = merge(airports_dt, weather_dt, by = "faa", all.x = TRUE) airports_and_weather_dt = merge(airports_dt, weather_dt, all.x = TRUE) #note this works too since it knows faa is the only column in common but not recommended since specifying "by" is more clear airports_and_weather_dt[sample(1 : .N, 500)]  Now "right" {r} airports_and_weather = right_join(airports, weather, by = "faa") airports_and_weather %>% sample_n(500) airports_and_weather_dt = merge(airports_dt, weather_dt, by = "faa", all.y = TRUE) airports_and_weather_dt = merge(airports_dt, weather_dt, all.y = TRUE) airports_and_weather_dt[sample(1 : .N, 500)]  {r} airports_and_weather = inner_join(airports, weather, by = "faa") airports_and_weather %>% sample_n(500) airports_and_weather_dt = merge(airports_dt, weather_dt, by = "faa") airports_and_weather_dt = merge(airports_dt, weather_dt) airports_and_weather_dt[sample(1 : .N, 500)]  And full, keeping all the rows. We use a subset to show how this works: {r} airports_without_EWR = airports %>% filter(faa != "EWR") airports_without_EWR_dt = data.table(airports_without_EWR) airports_without_EWR_and_weather = full_join(airports_without_EWR, weather, by = "faa") airports_without_EWR_and_weather %>% sample_n(500) airports_without_EWR_and_weather_dt = merge(airports_without_EWR_dt, weather_dt, by = "faa", all = TRUE) airports_without_EWR_and_weather_dt = merge(airports_without_EWR_dt, weather_dt, all = TRUE) airports_without_EWR_and_weather_dt[sample(.N, 500)]  There is also semi_join and anti_join that do the opposite of joining. In my experience, these use cases are limited. # Logistic Regression for Binary Response Let's clean up and load the cancer dataset, remove missing data, remove the ID column and add more appropriate feature names: {r} biopsy = MASS::biopsy biopsy$ID = NULL biopsy = na.omit(biopsy) colnames(biopsy) = c( #should've done this awhile ago!!! "clump_thickness", "cell_size_uniformity", "cell_shape_uniformity", "marginal_adhesion", "epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitoses", "class" ) head(biopsyclass)  We can either estimate probability of the biopsy tissue being benign (this would mean y = 1 is the benign category level) or estimate the probability of the biopsy tissue being malignant (this would mean y = 1 is the malignant category level). Let's go with the latter. To make the encoding explicitly 0/1, we can cast the factor to numeric or we can rely on R's default factor representation i.e. that the first level is 0 and the second level is 1. Here, we can use this default without reordering since the levels above show that benign is first and thus = 0 and malignant is second and thus = 1 (via coincidence of alphabetical order). Now let's split into training and test for experiments: {r} set.seed(1984) K = 5 test_prop = 1 / K train_indices = sample(1 : nrow(biopsy), round((1 - test_prop) * nrow(biopsy))) biopsy_train = biopsy[train_indices, ] y_train = biopsy_trainclass X_train = biopsy_train X_train$class = NULL test_indices = setdiff(1 : nrow(biopsy), train_indices) biopsy_test = biopsy[test_indices, ] y_test = biopsy_test$class X_test = biopsy_test X_test$class = NULL  Let's fit a linear logistic regression model. We use the function glm which looks a lot like lm except we have to set the family parameter to be "binomial" which means we are using the independent Bernoulli and within the binomial family, we are using the "logit" link. There are other types of family models we won't get a chance to study e.g. Poisson, negative binomial for count models {r} logistic_mod = glm(class ~ ., biopsy_train, family = binomial(link = "logit"))  That was fast! There was actually a lot of optimization in that line. Let's look at the$bvector that was made: {r} coef(logistic_mod)  Interpretation? If clump thickness increases by one unit the log odds of malignancy increases by 0.597... All of the coefficients are positive which means if any of the covariates increase... And let's take a look at the fitted values: {r} head(predict(logistic_mod, biopsy_train))  What's that? Those are the "inverse link" values. In this case, they are log-odds of being malginant. If you can read log odds, you'll see ... has a small change of being malignant and ... has a high probability of being malignant. It's not that hard to read log odds... What if we want probabilities? We can tell the predict function for glm to give us them explicitly: {r} head(predict(logistic_mod, biopsy_train, type = "response"))  Let's take a look at all the in-sample probability estimates: {r} p_hats_train = predict(logistic_mod, biopsy_train, type = "response") pacman::p_load(ggplot2) ggplot(data.frame(p_hats_train = p_hats_train, y_train = y_train)) + geom_histogram(aes(x = p_hats_train, fill = y_train), alpha = 0.5)  It's very sure of itself! Let's seey$by$\hat{p}another way: {r} ggplot(data.frame(p_hats_train = p_hats_train, y_train = factor(y_train))) + geom_boxplot(aes(x = y_train, y = p_hats_train))  Made only a few mistakes here and there in the training set! How about the test set? {r} p_hats_test = predict(logistic_mod, biopsy_test, type = "response") ggplot(data.frame(p_hats_test = p_hats_test, y_test = y_test)) + geom_histogram(aes(x = p_hats_test, fill = y_test), alpha = 0.5) ggplot(data.frame(p_hats_test = p_hats_test, y_test = factor(y_test))) + geom_boxplot(aes(x = y_test, y = p_hats_test))  Looks pretty good! We now will talk about error metrics for probabilistic estimation models. That will give us a way to validate this model and provide an estimate of future performance. What is the in-sample average Brier score? {r} mean(-(y_train - p_hats_train)^2)  Yup can't do arithmetic operations on factors. So now we have to go ahead and cast. {r} y_train_binary = ifelse(y_train == "malignant", 1, 0) mean(-(y_train_binary - p_hats_train)^2)  This is very good Brier score! Again, most of the probabilities were spot on. And the oos Brier score? {r} y_test_binary = ifelse(y_test == "malignant", 1, 0) mean(-(y_test_binary - p_hats_test)^2)  Not as good but still very good! What is the in-sample log score? {r} mean(y_train_binary * log(p_hats_train) + (1 - y_train_binary) * log(1 - p_hats_train))  This isn't bad (if you get intuition in reading them). And oos? {r} mean(y_test_binary * log(p_hats_test) + (1 - y_test_binary) * log(1 - p_hats_test))  Not as good but still very good! If we wanted to be more careful, we can use K-fold CV to get a less variable oos metric. Maybe we'll do that in a lab? # Probit and Cloglog probability estimation These are different generalized linear models but fit using the same code. All we need to do is change the link argument. For a probit regression we just do: {r} probit_mod = glm(class ~ ., biopsy_train, family = binomial(link = "probit"))  This is complaining about numerical underflow or overflow. If you get a z-score that's really large in magnitude, then it says probability is 1 (if z score is positive) or 0 (if z score is negative) {r} coef(probit_mod)  As we saw before, all coefficients for the covariates are positive. What's the interpretation of b for bare_nuclei? Let's take a look at all the in-sample probability estimates: {r} p_hats_train = predict(probit_mod, biopsy_train, type = "response") pacman::p_load(ggplot2) ggplot(data.frame(p_hats_train = p_hats_train, y_train = y_train)) + geom_histogram(aes(x = p_hats_train, fill = y_train), alpha = 0.5)  This is basically the same. How about out of sample? {r} p_hats_test = predict(probit_mod, biopsy_test, type = "response") ggplot(data.frame(p_hats_test = p_hats_test, y_test = y_test)) + geom_histogram(aes(x = p_hats_test, fill = factor(y_test)), alpha = 0.5)  Also basically the same-looking. To get an apples-apples comparison with logistic regression let's calculate the brier and log scoring metrics: {r} mean(-(y_train_binary - p_hats_train)^2) mean(-(y_test_binary - p_hats_test)^2) mean(y_train_binary * log(p_hats_train) + (1 - y_train_binary) * log(1 - p_hats_train)) mean(y_test_binary * log(p_hats_test) + (1 - y_test_binary) * log(1 - p_hats_test))  It appears the logistic regression is better oos. Let's do complementary log-log too: {r} cloglog_mod = glm(class ~ ., biopsy_train, family = binomial(link = "cloglog")) coef(cloglog_mod)  Same signs on coefficients. Interpretation? Difficult... Let's see how it does compared to the logistic and probit models. {r} p_hats_train = predict(cloglog_mod, biopsy_train, type = "response") p_hats_test = predict(cloglog_mod, biopsy_test, type = "response") mean(-(y_train_binary - p_hats_train)^2) mean(-(y_test_binary - p_hats_test)^2) mean(y_train_binary * log(p_hats_train) + (1 - y_train_binary) * log(1 - p_hats_train)) mean(y_test_binary * log(p_hats_test) + (1 - y_test_binary) * log(1 - p_hats_test))  Much worse than either! Logistic regression is usually the default. But just because it's the default and most popular and just because it won here doesn't mean it will always win!! Using probit or any other link function constitutes a completely different model. You can use the "model selection procedure" we discussed before to pick the best one. {r} rm(list = ls())  Let's try a harder project... load up the adult dataset where the response is 1 if the person makes more than \50K per year and 0 if they make less than \$50K per year in 1994 dollars. That's the equivalent of almost$90K/yr today (see https://www.in2013dollars.com/us/inflation/1994?amount=50000). {r} pacman::p_load_gh("coatless/ucidata") data(adult) adult = na.omit(adult) #kill any observations with missingness str(adult) ?adult adult$fnlwgt = NULL adult$occupation = NULL adult$native_country = NULL  Let's use samples of 5,000 to run experiments: {r} train_size = 5000 train_indices = sample(1 : nrow(adult), train_size) adult_train = adult[train_indices, ] y_train = adult_train$income X_train = adult_train X_train$income = NULL test_size = 5000 test_indices = sample(setdiff(1 : nrow(adult), train_indices), test_size) adult_test = adult[test_indices, ] y_test = adult_test$income X_test = adult_test X_test$income = NULL  Let's fit a logistic regression model to the training data: {r} logistic_mod = glm(income ~ ., adult_train, family = "binomial") #shortcut for binomial(link = "logit")  Numeric errors already! Let's see what the model looks like: {r} coef(logistic_mod) length(coef(logistic_mod))  There may be NA's above due to numeric errors. Usually happens if there is linear dependence (or near linear dependence). Interpretation? Let's take a look at the fitted probability estimates: {r} head(predict(logistic_mod, adult_train, type = "response"))  Let's take a look at all the in-sample probability estimates: {r} p_hats_train = predict(logistic_mod, adult_train, type = "response") pacman::p_load(ggplot2) ggplot(data.frame(p_hats_train = p_hats_train, y_train = y_train)) + geom_histogram(aes(x = p_hats_train, fill = factor(y_train)), alpha = 0.5)  Much more humble!! It's not a very confident model since this task is much harder! In fact it's never confident about the large incomes and usually confident about the small incomes. Let's see$y$by$\hat{p}: {r} ggplot(data.frame(p_hats_train = p_hats_train, y_train = factor(y_train))) + geom_boxplot(aes(x = y_train, y = p_hats_train))  Making lots of mistakes! Note that the x-axis is the native category label since we never coded as 0, 1. The default is that the first label is 0 and the second is 1. The labels are defaulted to alphabetical order (I think...) What is the in-sample average Brier score? {r} mean(-(y_train - p_hats_train)^2)  Can't use factors here. Need to code the response as 0/1 {r} y_train_numeric = ifelse(y_train == "malignant", 1, 0) mean(-(y_train_numeric - p_hats_train)^2)  This is worse than the previous dataset but not terrible. The null model gives what? {r} mean(-(y_train_numeric - rep(mean(y_train_numeric), length(y_train_numeric)))^2)  So this is a decent Brier score! Again, most of the probabilities were spot on. But this was in sample! Let's see what happens out of sample.. {r} p_hats_test = predict(logistic_mod, adult_test, type = "response") ggplot(data.frame(p_hats_test = p_hats_test, y_test = y_test)) + geom_histogram(aes(x = p_hats_test, fill = factor(y_test)), alpha = 0.5)  Looks similar to training. And the Brier score? {r} y_test_numeric = as.numeric(y_test) - 1 mean(-(y_test_numeric - p_hats_test)^2)  The oos performance is about the same as the in-sample performance so we probably didn't overfit. Brier scores only make sense if you know how to read Brier scores. It's kind of like learning a new language. However, everyone understands classification errors! # Using Probability Estimation to do Classification First repeat quickly (a) load the adult data (b) do a training / test split and (c) build the logisitc model. {r} rm(list = ls()) pacman::p_load_gh("coatless/ucidata") data(adult) adult = na.omit(adult) #kill any observations with missingness set.seed(1) train_size = 5000 train_indices = sample(1 : nrow(adult), train_size) adult_train = adult[train_indices, ] y_train = adult_trainincome X_train = adult_train X_train$income = NULL test_size = 5000 test_indices = sample(setdiff(1 : nrow(adult), train_indices), test_size) adult_test = adult[test_indices, ] y_test = adult_test$income X_test = adult_test X_test$income = NULL logistic_mod = glm(income ~ ., adult_train, family = "binomial") p_hats_train = predict(logistic_mod, adult_train, type = "response") p_hats_test = predict(logistic_mod, adult_test, type = "response")  Let's establish a rule: if the probability estimate is greater than or equal to 50%, let's classify the observation as positive, otherwise 0. {r} y_hats_train = factor(ifelse(p_hats_train >= 0.5, ">50K", "<=50K"))  How did this "classifier" do in-sample? {r} mean(y_hats_train != y_train) in_sample_conf_table = table(y_train, y_hats_train) in_sample_conf_table  And the performance stats: {r} n = sum(in_sample_conf_table) fp = in_sample_conf_table[1, 2] fn = in_sample_conf_table[2, 1] tp = in_sample_conf_table[2, 2] tn = in_sample_conf_table[1, 1] num_pred_pos = sum(in_sample_conf_table[, 2]) num_pred_neg = sum(in_sample_conf_table[, 1]) num_pos = sum(in_sample_conf_table[2, ]) num_neg = sum(oos_conf_table[1, ]) precision = tp / num_pred_pos cat("precision", round(precision * 100, 2), "%\n") recall = tp / num_pos cat("recall", round(recall * 100, 2), "%\n") false_discovery_rate = 1 - precision cat("false_discovery_rate", round(false_discovery_rate * 100, 2), "%\n") false_omission_rate = fn / num_pred_neg cat("false_omission_rate", round(false_omission_rate * 100, 2), "%\n")  That was in-sample which may be overfit. Howe about oos? {r} y_hats_test = factor(ifelse(p_hats_test >= 0.5, ">50K", "<=50K")) mean(y_hats_test != y_test) oos_conf_table = table(y_test, y_hats_test) oos_conf_table  A tad bit worse. Here are estimates of the future performance for each class: {r} n = sum(oos_conf_table) fp = oos_conf_table[1, 2] fn = oos_conf_table[2, 1] tp = oos_conf_table[2, 2] tn = oos_conf_table[1, 1] num_pred_pos = sum(oos_conf_table[, 2]) num_pred_neg = sum(oos_conf_table[, 1]) num_pos = sum(oos_conf_table[2, ]) num_neg = sum(oos_conf_table[1, ]) precision = tp / num_pred_pos cat("precision", round(precision * 100, 2), "%\n") recall = tp / num_pos cat("recall", round(recall * 100, 2), "%\n") false_discovery_rate = 1 - precision cat("false_discovery_rate", round(false_discovery_rate * 100, 2), "%\n") false_omission_rate = fn / num_pred_neg cat("false_omission_rate", round(false_omission_rate * 100, 2), "%\n")  Worse than in-sample (expected). But still could be "good enough" depending on your definition of "good enough". However... this whole classifier hinged on the decision of the prob-threshold = 50%! What if we change this default threshold?? # Asymmetric Cost Classifiers Let's establish a *new* rule: if the probability estimate is greater than or equal to 90%, let's classify the observation as positive, otherwise 0. {r} y_hats_test = factor(ifelse(p_hats_test >= 0.9, ">50K", "<=50K")) mean(y_hats_test != y_test) oos_conf_table = table(y_test, y_hats_test) oos_conf_table  Of course the misclassification error went up! But now look at the confusion table! The second column represents all$\hat{y} = 1$and there's not too many of them! Why? You've made it *much* harder to classify something as positive. Here's the new additional performance metrics now: {r} n = sum(oos_conf_table) fp = oos_conf_table[1, 2] fn = oos_conf_table[2, 1] tp = oos_conf_table[2, 2] tn = oos_conf_table[1, 1] num_pred_pos = sum(oos_conf_table[, 2]) num_pred_neg = sum(oos_conf_table[, 1]) num_pos = sum(oos_conf_table[2, ]) num_neg = sum(oos_conf_table[1, ]) precision = tp / num_pred_pos cat("precision", round(precision * 100, 2), "%\n") recall = tp / num_pos cat("recall", round(recall * 100, 2), "%\n") false_discovery_rate = 1 - precision cat("false_discovery_rate", round(false_discovery_rate * 100, 2), "%\n") false_omission_rate = fn / num_pred_neg cat("false_omission_rate", round(false_omission_rate * 100, 2), "%\n")  We don't make many false discoveries but we make a lot of false omissions! It's a tradeoff... # Receiver-Operator Curve Plot The entire classifier is indexed by that indicator function probability threshold which creates the classification decision. Why not see look at the entire range of possible classification models. We do this with a function. We will go through it slowly and explain each piece: {r} #' Computes performance metrics for a binary probabilistic classifer #' #' Each row of the result will represent one of the many models and its elements record the performance of that model so we can (1) pick a "best" model at the end and (2) overall understand the performance of the probability estimates a la the Brier scores, etc. #' #' @param p_hats The probability estimates for n predictions #' @param y_true The true observed responses #' @param res The resolution to use for the grid of threshold values (defaults to 1e-3) #' #' @return The matrix of all performance results compute_metrics_prob_classifier = function(p_hats, y_true, res = 0.001){ #we first make the grid of all prob thresholds p_thresholds = seq(0 + res, 1 - res, by = res) #values of 0 or 1 are trivial #now we create a matrix which will house all of our results performance_metrics = matrix(NA, nrow = length(p_thresholds), ncol = 12) colnames(performance_metrics) = c( "p_th", "TN", "FP", "FN", "TP", "miscl_err", "precision", "recall", "FDR", "FPR", "FOR", "miss_rate" ) #now we iterate through each p_th and calculate all metrics about the classifier and save n = length(y_true) for (i in 1 : length(p_thresholds)){ p_th = p_thresholds[i] y_hats = factor(ifelse(p_hats >= p_th, ">50K", "<=50K")) confusion_table = table( factor(y_true, levels = c("<=50K", ">50K")), factor(y_hats, levels = c("<=50K", ">50K")) ) fp = confusion_table[1, 2] fn = confusion_table[2, 1] tp = confusion_table[2, 2] tn = confusion_table[1, 1] npp = sum(confusion_table[, 2]) npn = sum(confusion_table[, 1]) np = sum(confusion_table[2, ]) nn = sum(confusion_table[1, ]) performance_metrics[i, ] = c( p_th, tn, fp, fn, tp, (fp + fn) / n, tp / npp, #precision tp / np, #recall fp / npp, #false discovery rate (FDR) fp / nn, #false positive rate (FPR) fn / npn, #false omission rate (FOR) fn / np #miss rate ) } #finally return the matrix performance_metrics }  Now let's generate performance results for the in-sample data: {r} pacman::p_load(data.table, magrittr) performance_metrics_in_sample = compute_metrics_prob_classifier(p_hats_train, y_train) %>% data.table performance_metrics_in_sample  Now let's plot the ROC curve {r} pacman::p_load(ggplot2) ggplot(performance_metrics_in_sample) + geom_line(aes(x = FPR, y = recall)) + geom_abline(intercept = 0, slope = 1, col = "orange") + coord_fixed() + xlim(0, 1) + ylim(0, 1)  Now calculate the area under the curve (AUC) which is used to evaluate the probabilistic classifier (just like the Brier score) using a trapezoid area function. {r} pacman::p_load(pracma) -trapz(performance_metrics_in_sample$FPR, performance_metrics_in_sample$recall)  This is not bad at all! Note that I should add$<0, 0>$and$<1, 1>$as points before this is done but I didn't... How do we do out of sample? {r} performance_metrics_oos = compute_metrics_prob_classifier(p_hats_test, y_test) %>% data.table performance_metrics_oos  And graph the ROC: {r} #first we do our own melting of two data frames together to make it long format performance_metrics_in_and_oos = rbind( cbind(performance_metrics_in_sample, data.table(sample = "in")), cbind(performance_metrics_oos, data.table(sample = "out")) ) ggplot(performance_metrics_in_and_oos) + geom_line(aes(x = FPR, y = recall, col = sample)) + geom_abline(intercept = 0, slope = 1, col = "orange") + coord_fixed() + xlim(0, 1) + ylim(0, 1)  {r} -trapz(performance_metrics_oos$FPR, performance_metrics_oos$recall)  Not bad at all - only a tad worse! In the real world it's usually a lot worse. We are lucky we have n = 5,000 in both a train and test set. # Detection Error Tradeoff curve {r} table(y_test) / length(y_test) table(y_train) / length(y_train) ggplot(performance_metrics_in_and_oos) + geom_line(aes(x = FDR, y = miss_rate, col = sample)) + coord_fixed() + xlim(0, 1) + ylim(0, 1)  #Using AUC to Compare Probabilistic Classification Models What would the effect be of less information on the same traing set size? Imagine we didn't know the features: occupation, education, education_num, relationship, marital_status. How would we do relative to the above? Worse! {r} pacman::p_load(data.table, tidyverse, magrittr) if (!pacman::p_isinstalled(ucidata)){ pacman::p_load_gh("coatless/ucidata") } else { pacman::p_load(ucidata) } data(adult) adult = na.omit(adult) #kill any observations with missingness set.seed(1) train_size = 5000 train_indices = sample(1 : nrow(adult), train_size) adult_train = adult[train_indices, ] y_train = adult_train$income X_train = adult_train X_train$income = NULL test_size = 5000 test_indices = sample(setdiff(1 : nrow(adult), train_indices), test_size) adult_test = adult[test_indices, ] y_test = adult_test$income X_test = adult_test X_test$income = NULL logistic_mod_full = glm(income ~ ., adult_train, family = "binomial") p_hats_test = predict(logistic_mod_full, adult_test, type = "response") performance_metrics_oos_full_mod = compute_metrics_prob_classifier(p_hats_test, y_test) %>% data.table logistic_mod_red = glm(income ~ . - occupation - education - education_num - relationship - marital_status, adult_train, family = "binomial") p_hats_test = predict(logistic_mod_red, adult_test, type = "response") performance_metrics_oos_reduced_mod = compute_metrics_prob_classifier(p_hats_test, y_test) %>% data.table ggplot(rbind( performance_metrics_oos_full_mod[, model := "full"], performance_metrics_oos_reduced_mod[, model := "reduced"] )) + geom_point(aes(x = FPR, y = recall, shape = model, col = p_th), size = 1) + geom_abline(intercept = 0, slope = 1) + coord_fixed() + xlim(0, 1) + ylim(0, 1) + scale_colour_gradientn(colours = rainbow(5))  and we can see clearly that the AUC is worse. This means that the full model dominates the reduced model for every FPR / TPR pair. {r} pacman::p_load(pracma) -trapz(performance_metrics_oos_reduced_mod$FPR, performance_metrics_oos_reduced_mod$recall) -trapz(performance_metrics_oos_full_mod$FPR, performance_metrics_oos_full_mod$recall)  As we lose information that is related to the true causal inputs, we lose predictive ability. Same story for this entire data science class since error due to ignorance increases! And certainly no different in probabilistic classifiers. Here's the same story with the DET curve: {r} ggplot(rbind( performance_metrics_oos_full_mod[, model := "full"], performance_metrics_oos_reduced_mod[, model := "reduced"] )) + geom_point(aes(x = FDR, y = miss_rate, shape = model, col = p_th), size = 1) + coord_fixed() + xlim(0, 1) + ylim(0, 1) + scale_colour_gradientn(colours = rainbow(5))  # Choosing a Decision Threshold Based on Asymmetric Costs and Rewards The ROC and DET curves gave you a glimpse into all the possible classification models derived from a probability estimation model. Each point on that curve is a separate$g(x)$with its own performance metrics. How do you pick one? Let's create rewards and costs. Imagine we are trying to predict income because we want to sell people an expensive item e.g. a car. We want to advertise our cars via a nice packet in the mail. The packet costs \$5. If we send a packet to someone who really does make $>50K$/yr then we are expected to make \$1000. So we have rewards and costs below: {r} r_tp = 0 c_fp = -5 c_fn = -1000 r_tn = 0  Let's return to the linear logistic model with all features. Let's calculate the overall oos average reward per observation (per person) for each possible$p_{th}$: {r} n = nrow(adult_test) performance_metrics_oos_full_mod$avg_cost = (r_tp * performance_metrics_oos_full_mod$TP + c_fp * performance_metrics_oos_full_mod$FP + c_fn * performance_metrics_oos_full_mod$FN + r_tn * performance_metrics_oos_full_mod$TN) / n  Let's plot average reward (reward per person) by threshold: {r} ggplot(performance_metrics_oos_full_mod) + geom_line(aes(x = p_th, y = avg_cost)) #+ # xlim(0, 0.05) + ylim(-5,0)  Obviously, the best decision is $p_{th} \approx 0$ which means you classifiy almost everything as a positive. This makes sense because the mailing is so cheap. What are the performance characteristics of the optimal model? {r} i_star = which.max(performance_metrics_oos_full_mod$avg_cost) performance_metrics_oos_full_mod[i_star, ] # performance_metrics_oos_full_mod[, .(p_th, avg_cost)]  The more interesting problem is where the cost of advertising is higher: {r} r_tp = 0 c_fp = -200 c_fn = -1000 r_tn = 0 performance_metrics_oos_full_mod$avg_cost = (r_tp * performance_metrics_oos_full_mod$TP + c_fp * performance_metrics_oos_full_mod$FP + c_fn * performance_metrics_oos_full_mod$FN + r_tn * performance_metrics_oos_full_mod$TN) / n ggplot(performance_metrics_oos_full_mod) + geom_point(aes(x = p_th, y = avg_cost), lwd = 0.01)  What are the performance characteristics of the optimal model? {r} i_star = which.max(performance_metrics_oos_full_mod$avg_cost) performance_metrics_oos_full_mod[i_star, ]  If$g_{pr}$is closer to$f_{pr}$, what happens? All the threshold-derived classification models get better and you are guaranteed to make more money since you have a better discriminating eye. There is also a way to make asymmetric models with trees. We may do this later if we have time. #Classification Trees and Confusion Tables Let's load up the adult dataset where the response is 1 if the person makes more than$50K per year and 0 if they make less than $50K per year. {r} pacman::p_load_gh("coatless/ucidata") data(adult) adult %<>% na.omit #kill any observations with missingness  Let's use samples of 2,000 to run experiments: {r} train_size = 2000 train_indices = sample(1 : nrow(adult), train_size) adult_train = adult[train_indices, ] y_train = adult_train$income X_train = adult_train X_train$income = NULL test_indices = sample(setdiff(1 : nrow(adult), train_indices), train_size) adult_test = adult[test_indices, ] y_test = adult_test$income X_test = adult_test X_test$income = NULL  Make a tree: {r} tree_mod = YARFCART(X_train, y_train) tree_mod  How "big" is this tree model? {r} get_tree_num_nodes_leaves_max_depths(tree_mod)  What are the "main" splits? {r} illustrate_trees(tree_mod, max_depth = 4, open_file = TRUE)  Compute in-sample and out of sample fits: {r} y_hat_train = predict(tree_mod, X_train) y_hat_test = predict(tree_mod, X_test)  Let's look at the confusion table in-sample: {r} table(y_train, y_hat_train)  There are no errors here! Thus, precision and recall are both 100%. This makes sense because classification trees overfit. Let's do the same oos: {r} oos_conf_table = table(y_test, y_hat_test) oos_conf_table  We didn't do as well (of course). Let's calculate some performance metrics. We assume ">50k" is the "positive" category and "<=50k" is the "negative" category. Note that this choice is arbitrary and everything would just be switched if we did it the other way. {r} n = sum(oos_conf_table) n fp = oos_conf_table[1, 2] fn = oos_conf_table[2, 1] tp = oos_conf_table[2, 2] tn = oos_conf_table[1, 1] num_pred_pos = sum(oos_conf_table[, 2]) num_pred_neg = sum(oos_conf_table[, 1]) num_pos = sum(oos_conf_table[2, ]) num_neg = sum(oos_conf_table[1, ]) acc = (tp + tn) / n acc misclassifcation_error = 1 - acc misclassifcation_error precision = tp / num_pred_pos precision recall = tp / num_pos recall false_discovery_rate = 1 - precision false_discovery_rate false_omission_rate = fn / num_pred_neg false_omission_rate  Let's see how this works on a dataset whose goal is classification for more than 2 levels. Note: this is only possible now with trees! {r} rm(list = ls()) pacman::p_load(mlbench, skimr) data(LetterRecognition) LetterRecognition = na.omit(LetterRecognition) #kill any observations with missingness skim(LetterRecognition) ?LetterRecognition  Now we split the data: {r} test_samp = 500 train_indices = sample(1 : nrow(LetterRecognition), test_samp) ltr_train = LetterRecognition[train_indices, ] y_train = ltr_train$lettr X_train = ltr_train X_train$lettr = NULL test_indices = sample(setdiff(1 : nrow(LetterRecognition), train_indices), test_samp) ltr_test = LetterRecognition[test_indices, ] y_test = ltr_test$lettr X_test = ltr_test X_test$lettr = NULL  And fit a tree model and its in-sample and oos fits: {r} tree_mod = YARFCART(X_train, y_train) y_hat_train = predict(tree_mod, X_train) y_hat_test = predict(tree_mod, X_test)  Take a look at the in-sample confusion matrix: {r} table(y_train, y_hat_train)  Perfecto... as expected... Now the oos confusion matrix: {r} oos_confusion_table = table(y_test, y_hat_test) oos_confusion_table  Hard to read. Let's make it easier to read by blanking out the diagonal and looking at entried only >= 5: {r} oos_confusion_table[oos_confusion_table < 2] = "" diag(oos_confusion_table) = "." oos_confusion_table mean(y_test != y_hat_test)  What's it using to determine letter? {r} illustrate_trees(tree_mod, max_depth = 3, open_file = TRUE)  Where did these features comes from?? Deep learning helps to create the features from the raw pixel data. Wish I had a whole next semester to discuss this... Random Forests: {r} num_trees = 500 train_size = 2000 training_indices = sample(1 : nrow(adult), train_size) adult_train = adult[training_indices, ] y_train = adult_train$income X_train = adult_train X_train$income = NULL mod_bag = YARFBAG(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE) mod_rf = YARF(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE)  And test: {r} test_indices = sample(setdiff(1 : nrow(adult), training_indices), 25000) adult_test = adult[test_indices, ] y_test = adult_test$income X_test = adult_test X_test$income = NULL y_hat_test_bag = predict(mod_bag, X_test) y_hat_test_rf = predict(mod_rf, X_test) oos_conf_table_bag = table(y_test, y_hat_test_bag) oos_conf_table_rf = table(y_test, y_hat_test_rf) oos_conf_table_bag oos_conf_table_rf miscl_err_bag = mean(y_test != y_hat_test_bag) miscl_err_rf = mean(y_test != y_hat_test_rf) miscl_err_bag miscl_err_rf cat("gain: ", (miscl_err_bag - miscl_err_rf) / miscl_err_bag * 100, "%\n")  And on letters: {r} test_samp = 2000 train_indices = sample(1 : nrow(LetterRecognition), test_samp) ltr_train = LetterRecognition[train_indices, ] y_train = ltr_train$lettr X_train = ltr_train X_train$lettr = NULL test_indices = sample(setdiff(1 : nrow(LetterRecognition), train_indices), test_samp) ltr_test = LetterRecognition[test_indices, ] y_test = ltr_test$lettr X_test = ltr_test X_test$lettr = NULL  And fit a tree model and its in-sample and oos fits: {r} mod_bag = YARFBAG(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE) mod_rf = YARF(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE) mod_bag mod_rf y_hat_test_bag = predict(mod_bag, X_test) y_hat_test_rf = predict(mod_rf, X_test) oos_conf_table_bag = table(y_test, y_hat_test_bag) oos_conf_table_rf = table(y_test, y_hat_test_rf) oos_conf_table_bag oos_conf_table_rf miscl_err_bag = mean(y_test != y_hat_test_bag) miscl_err_rf = mean(y_test != y_hat_test_rf) miscl_err_bag miscl_err_rf cat("gain: ", (miscl_err_bag - miscl_err_rf) / miscl_err_bag * 100, "%\n")  Very real gains for classification. # Missingness Take a look at an housing dataset from Australia: https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home?select=melb_data.csv# {r} pacman::p_load(tidyverse, magrittr, data.table, skimr, R.utils) apts = fread("melb_data.csv.bz2") skim(apts)  We drop all character variables first just for expedience in the demo. If you were building a prediction model, you would scour them carefully to see if there is any signal in them you can use, and then mathematize them to metrics if so. {r} apts %<>% select_if(is.numeric) %>% select(Price, everything())  Imagine we were trying to predict Price. So let's section our dataset: {r} y = apts$Price X = apts %>% select(-Price) rm(apts)  Let's first create a matrix with $p$ columns that represents missingness {r} M = as_tibble(apply(is.na(X), 2, as.numeric)) colnames(M) = paste("is_missing_", colnames(X), sep = "") M %<>% select_if(function(x){sum(x) > 0}) head(M) skim(M)  Some of these missing indicators might be collinear because they share all the rows they are missing on. Let's filter those out if they exist: {r} M = as_tibble(t(unique(t(M)))) skim(M)  Without imputing and without using missingness as a predictor in its own right, let's see what we get with a basic linear model now: {r} lin_mod_listwise_deletion = lm(y ~ ., X) summary(lin_mod_listwise_deletion)  Not bad ... but this is only on the data that has full records! There are 6,750 observations dropped! Now let's impute using the package. we cannot fit RF models to the entire dataset (it's 13,580 observations) so we will sample 2,000 observations for each of the trees. This is a typical strategy when fitting RF. It definitely reduces variance but increases bias. But we don't have a choice since we don't want to wait forever. {r} pacman::p_load(missForest) Ximp = missForest(data.frame(X), sampsize = rep(2000, ncol(X)))$ximp skim(Ximp)  Now we consider our imputed dataset as the design matrix. {r} linear_mod_impute = lm(y ~ ., Ximp) summary(linear_mod_impute)  We can do even better if we use all the information i.e. including the missingness. We take our imputed dataset, combine it with our missingness indicators for a new design matrix. {r} Ximp_and_missing_dummies = data.frame(cbind(Ximp, M)) linear_mod_impute_and_missing_dummies = lm(y ~ ., Ximp_and_missing_dummies) summary(linear_mod_impute_and_missing_dummies)  Not much gain, but there seems to be something. Are these two better models than the original model that was built with listwise deletion of observations with missingness?? Are they even comparable? It is hard to compare the two models since the first model was built with only non-missing observations which may be easy to predict on and the second was built with the observations that contained missingness. Those extra 6,750 are likely more difficult to predict on. So this is complicated... Maybe one apples-to-apples comparison is you can replace all the missingness in the original dataset with something naive e.g. the average and then see who does better. This at least keeps the same observations. {r} X %<>% mutate(Rooms = as.numeric(Rooms)) Xnaive = X %>% replace_na(as.list(colMeans(X, na.rm = TRUE))) linear_mod_naive_without_missing_dummies = lm(y ~ ., Xnaive) summary(linear_mod_naive_without_missing_dummies)  There is a clear gain to imputing and using is_missing dummy features to reduce delta (55.3% vs 52.4% Rsqs). Note: this is just an illustration of best practice. It didn't necessarily have to "work". # Spurious Correlation Take a look at the following real data: {r} rm(list = ls()) pacman::p_load(tidyverse, magrittr, data.table) spurious = data.frame( yearly_divorce_rate_maine_per_1000 = c(5,4.7,4.6,4.4,4.3,4.1,4.2,4.2,4.2,4.1), yearly_US_consumption_margarine_per_capita = c(8.2,7,6.5,5.3,5.2,4,4.6,4.5,4.2,3.7) ) with(spurious, cor(yearly_divorce_rate_maine_per_1000, yearly_US_consumption_margarine_per_capita))  And visually, {r} ggplot(spurious, aes(x = yearly_divorce_rate_maine_per_1000, y = yearly_US_consumption_margarine_per_capita)) + geom_point() + geom_smooth()  How did this happen? I looked at many, many different datasets until I found something impressive! Well, we can imagine doing the same thing. Let's look at a million datasets and find the dataset most correlated with the yearly consumption of margarine per capita: {r} y = spurious$yearly_US_consumption_margarine_per_capita n = length(y) n_sim = 1e6 best_abs_corr = 0 best_random_xs = NULL for (i in 1 : n_sim){ x = rnorm(n) random_abs_corr = abs(cor(x, y)) if (random_abs_corr > best_abs_corr){ best_abs_corr = random_abs_corr best_random_xs = x } } spurious$best_random_xs = best_random_xs best_abs_corr  And visually, {r} ggplot(spurious, aes(x = best_random_xs, y = yearly_US_consumption_margarine_per_capita)) + geom_point() + geom_smooth() + ggtitle(paste("Spurious Correlation has |r| = ", round(best_abs_corr, 3)))  So what's the narrative here? If you look through a gajillion random features that have no causal connection with the phenomenon$y$, you will eventually find something that "clicks". It's the same as the "chance capitalization" except taken to an extreme. Here are a whole bunch of them: https://www.tylervigen.com/spurious-correlations However, these will all vanish if you keep collecting data. Anything that is built upon falsehood will crumble! #An Example of Correlation without Causation When does correlation really not imply causation? We now mean real correlation, not spurious correlation. This correlation will persist as the sample size increases. From class, we spoke about the phenomenon y = "num car accidents" with observed feature x = "num umbrellas sold" but common cause z = "rain amount". It is clear the umbrella sales has *no causal* relationship with car accidents. But they *are correlated* because they are linked by a common cause. Here's the data example that makes this clear. The data generating process as specified by the causal diagram looks as follows: {r} rm(list = ls()) set.seed(1) n = 300 sigma = 0.3 umbrella_example_data = data.frame( z_rainfall = runif(n, 0, 6) #here's the common cause - rainfall ) umbrella_example_data$x_umbrella_sales = umbrella_example_data$z_rainfall^2 + rnorm(n, sigma) #x is a variable that is driven by z with noise umbrella_example_data$y_car_accidents = umbrella_example_data$z_rainfall + rnorm(n, sigma) #y is a variable driven by z with noise  So we only see$x$and$y$. Here's what it appears as: {r} pacman::p_load(tidyverse, data.table, magrittr) ggplot(umbrella_example_data) + aes(x = x_umbrella_sales, y = y_car_accidents) + geom_point() + geom_smooth(method = "lm")  and the model looks like: {r} mod = lm(y_car_accidents ~ x_umbrella_sales, umbrella_example_data) summary(mod)  So what's the interpretation of the coefficient for$x$? ... What you can't say is that$x$is a causal contributor to$y$! You may want to say it, but you can't! Now let's build a model of$y$linear in both$x$and$z$. What happens? {r} mod = lm(y_car_accidents ~ x_umbrella_sales + z_rainfall, umbrella_example_data) summary(mod)  The effect of$x$is gone!! Why? If you keep$z$constant, the sole true causal factor in$y$, manipulating$x$won't matter anymore! Why is this? Well, you can look at how x affects y in local areas of z for instance. {r} z_max = 0.2; z_min = 0.1 z_small_indices = umbrella_example_data$z_rainfall < quantile(umbrella_example_data$z_rainfall, z_max) & umbrella_example_data$z_rainfall > quantile(umbrella_example_data$z_rainfall, z_min) local_plot = ggplot(umbrella_example_data[z_small_indices, ]) + aes(x = x_umbrella_sales, y = y_car_accidents) + geom_point() local_plot local_plot + geom_smooth(method = "lm")  If you force the common cause (lurking variable) to be an approximate constant, then you won't see any affect of x on y. # Ridge Regression Let's take a look at the boston housing data and add many useless features. {r} options(java.parameters = "-Xmx4000m") pacman::p_load(data.table, tidyverse, magrittr, YARF) boston = MASS::Boston %>% data.table  Now add a whole bunch of extra features: {r} p_extra = 1000 set.seed(1) boston = cbind(boston, matrix(rnorm(nrow(boston) * p_extra), ncol = p_extra)) dim(boston)  Clearly$p + 1 > n$so OLS will not work. Let's try ridge with$\lambda = 0.01$. Let's see the ridge estimate: {r} X = cbind(1, as.matrix(boston[, !"medv"])) y = boston[, medv] lambda = 10 b_ridge = solve(t(X) %*% X + lambda * diag(ncol(X))) %*% t(X) %*% y head(b_ridge, 30)  Clearly this works as an algorithm where OLS wouldn't. Note: I left this out of the demos and out of class... you should standardize your features before ridge otherwise features become unfairly squished or unfairly dominant relative to others. Each should have te same weight. But let's see how it performs relative to OLS. To do so, we'll use the same setup but not add quite as many junk features so we can compare to OLS: {r} boston = MASS::Boston %>% data.table p_extra = 350 set.seed(1) boston = cbind(boston, matrix(rnorm(nrow(boston) * p_extra), ncol = p_extra)) dim(boston)  Now we'll split into train-test so we can see which does better. {r} prop_test = 0.2 test_indices = sample(1 : nrow(boston), round(prop_test * nrow(boston))) boston_test = boston[test_indices, ] y_test = boston_test$medv X_test = cbind(1, as.matrix(boston_test[, !"medv"])) train_indices = setdiff(1 : nrow(boston), test_indices) boston_train = boston[train_indices, ] y_train = boston_train$medv X_train = cbind(1, as.matrix(boston_train[, !"medv"]))  Let's use a big lambda since we have intuition that most of the features are junk: {r} lambda = 10  And we'll fit both models: {r} b_ols = solve(t(X_train) %*% X_train) %*% t(X_train) %*% y_train b_ridge = solve(t(X_train) %*% X_train + lambda * diag(ncol(X_train))) %*% t(X_train) %*% y_train abs(b_ols) %>% head(30) abs(b_ridge) %>% head(30)  And look at oos performance: {r} y_hat_ols = X_test %*% b_ols y_hat_ridge = X_test %*% b_ridge rmse_ols = sd(y_test - y_hat_ols) rmse_ridge = sd(y_test - y_hat_ridge) rmse_ols rmse_ridge cat("ridge advantage over OLS:", round((rmse_ols - rmse_ridge) / rmse_ols * 100, 1), "%")  Why did it do better than OLS???? Because penalized regression is a good idea if you know many of your features are junk. But only if you know many of your features are junk. Of course by using CV, we can optimize the lambda value to give ridge even better performance. The package glmnet does that for us automatically: {r} pacman::p_load(glmnet) ridge_mod_optimal_lambda = cv.glmnet(X_train, y_train, alpha = 0, lambda = 10^seq(-3, 3, by = 0.1)) y_hat_optimal_ridge = predict(ridge_mod_optimal_lambda, X_test) rmse_optimal_ridge = sd(y_test - y_hat_optimal_ridge) rmse_optimal_ridge cat("optimal lambda:", ridge_mod_optimal_lambda$lambda.min, "\n") cat("optimal ridge advantage over OLS:", round((rmse_ols - rmse_optimal_ridge) / rmse_ols * 100, 1), "%\n")  Of course you can use mlr as well but glmnet is probably more optimized. # Lasso (and bonus: variable selection) Let's do this same problem using the lasso. There is no closed form solution since the design matrix is not orthogonal (i.e. there's some multicollinearity), so we will use the numerical optimization found in the glmnet package. While we're at it, we might as well use CV to find the best lambda. {r} lasso_mod_optimal_lambda = cv.glmnet(X_train, y_train, alpha = 1, lambda = 10^seq(-3, 3, by = 0.1)) y_hat_optimal_lasso = predict(lasso_mod_optimal_lambda, X_test) rmse_optimal_lasso = sd(y_test - y_hat_optimal_lasso) rmse_optimal_lasso cat("optimal lambda:", lasso_mod_optimal_lambda$lambda.min, "\n") cat("optimal lasso advantage over OLS:", round((rmse_ols - rmse_optimal_lasso) / rmse_ols * 100, 1), "%\n")  Wow - did better than ridge in predictive accuracy. Lambda values are completely not comparable since L1 and L2 penalties are categorically different. What do the estimates look like? {r} head(coef(lasso_mod_optimal_lambda), 30)  That "." means 0 in a sparse matrix. We never studied these. But they are very memory efficient. Which ones are non-zero? {r} b_lasso = coef(lasso_mod_optimal_lambda)[, 1] b_lasso[b_lasso != 0]  Wow - it deleted all 364 variables except for 4: intercept, rm, ptratio and lstat!!! If you remember in the regression tree, these were the most important (highest level splits). That is killer variable selection! The coefficient values are also approximately the OLS estimates: {r} lm(medv ~ rm + ptratio + lstat, MASS::Boston) #i.e. a regression on the original data with no junk  It's amazing that it really deleted ALL the junk and left the most predictive variables of the original 13 features and the estimates of those four is pretty on target. # Elastic Net We can use mlr to CV over alpha, but here we can't. So let's let$\alpha = 0.5$meaning "half lasso and half ridge" penalty: {r} elastic_net_mod_optimal_lambda = cv.glmnet(X_train, y_train, alpha = 0.2, lambda = 10^seq(-3, 3, by = 0.1)) y_hat_optimal_elastic_net = predict(elastic_net_mod_optimal_lambda, X_test) rmse_optimal_elastic_net = sd(y_test - y_hat_optimal_elastic_net) rmse_optimal_elastic_net cat("optimal elastic_net advantage over OLS:", round((rmse_ols - rmse_optimal_elastic_net) / rmse_ols * 100, 1), "%\n") cat("optimal lambda:", elastic_net_mod_optimal_lambda$lambda.min, "\n")  Slightly better than lasso. I imagine if we optimized $\alpha$ we can do even better. Elastic nets can also give easy variable selection: {r} head(coef(elastic_net_mod_optimal_lambda), 30)  Here we "found" one more variable. That makes sense - as alpha decreases, the ridge penalty becomes more pronounced and it's harder to shrink exactly to zero. Unsure about the $\alpha$ value that stops the hard shrinkage to zero. Good project to think about! #RF with many features How about RF? {r} rf_mod = YARF(data.frame(X_train), y_train, num_trees = 500, calculate_oob_error = FALSE) rmse_rf = sd(y_test - predict(rf_mod, data.frame(X_test))) cat("RF advantage over OLS:", round((rmse_ols - rmse_rf) / rmse_ols * 100, 1), "%\n")  Takes a very long time to build - why? Amazingly, RF does very well. Why? How it able to not get confused by the junk features? It might be because the real features have a slight SSE edge. I think RF will do poorly if p > n. Maybe a lab exercise? How about just the RF on the lasso-picked variables? We can delete the intercept since RF doesn't need it. {r} variables_selected = names(b_lasso[b_lasso != 0]) variables_selected = variables_selected[-1] X_train_sub = data.frame(X_train)[, variables_selected] X_test_sub = data.frame(X_test)[, variables_selected] rf_mod = YARF(X_train_sub, y_train, num_trees = 500, mtry = 2, calculate_oob_error = FALSE) rmse_rf = sd(y_test - predict(rf_mod, X_test_sub)) cat("RF var selected advantage over OLS:", round((rmse_ols - rmse_rf) / rmse_ols * 100, 1), "%\n")  Why is that better than lasso? Because lasso is linear and in RF you get a bit of juice from the non-linearities and interactions. Why is it very slightly better than RF on the full data set? Because variable selection is a good "pre-step" to do sometimes. This is why in the real world there's usually a "pipeline" that cleans data, then variable selects, then fits model then validates. # Asymmetric Cost Models in Trees and RF Let's load up the adult dataset where the response is 1 if the person makes more than $50K per year and 0 if they make less than$50K per year. {r} rm(list = ls()) options(java.parameters = "-Xmx4000m") pacman::p_load(YARF) pacman::p_load_gh("coatless/ucidata") data(adult) adult %<>% na.omit #kill any observations with missingness  Let's use samples of 2,000 to run experiments: {r} train_size = 2000 train_indices = sample(1 : nrow(adult), train_size) adult_train = adult[train_indices, ] y_train = adult_train$income X_train = adult_train %>% select(-income) test_indices = setdiff(1 : nrow(adult), train_indices) adult_test = adult[test_indices, ] y_test = adult_test$income X_test = adult_test %>% select(-income)  What does the $y$'s look like? {r} table(y_train)  Very imbalanced. This would off-the-bat make y=0 the default. Now make a regular RF and look at the oob confusion table and FDR and FOR: {r} num_trees = 500 yarf_mod = YARF(X_train, y_train, num_trees = num_trees, calculate_oob_error = FALSE) y_hat_test = predict(yarf_mod, X_test) oos_confusion = table(y_test, y_hat_test) oos_confusion cat("FDR =", oos_confusion[1, 2] / sum(oos_confusion[, 2]), "\n") cat("FOR =", oos_confusion[2, 1] / sum(oos_confusion[, 1]), "\n")  High FDR rate and low FOR rate. Let's try to change this and reduce the FDR by oversampling 0's. {r} idx_0 = which(y_train == "<=50K") n_0 = length(idx_0) idx_1 = which(y_train == ">50K") n_1 = length(idx_1) bootstrap_indices = list() for (m in 1 : num_trees){ bootstrap_indices[[m]] = c( #note n_0' + n_1' doesn't equal n. You can make it so with one more line of code... sample(idx_0, round(2.0 * n_0), replace = TRUE), sample(idx_1, round(0.5 * n_1), replace = TRUE) ) } yarf_mod_asymmetric = YARF(X_train, y_train, bootstrap_indices = bootstrap_indices, calculate_oob_error = FALSE) y_hat_test = predict(yarf_mod_asymmetric, X_test) oos_confusion = table(y_test, y_hat_test) oos_confusion cat("FDR =", oos_confusion[1, 2] / sum(oos_confusion[, 2]), "\n") cat("FOR =", oos_confusion[2, 1] / sum(oos_confusion[, 1]), "\n") ` Thanks for being with us until the end! Hope you learned a lot of the fundamentals of data science, machine learning and statistical computing!