The first thing we need to do is to load the package that we installed. Run the code by clicking the green arrow on the right. ```{r} library(educationdata) library(dplyr) library(ggplot2) ``` Next we want to use the get_education_data() function to get the data that the Urban Institute wants to share with us. Type ?get_education_data to see the documentation for this function. Notice that it gives this information. ``` get_education_data(level = NULL, source = NULL, topic = NULL, by = NULL, filters = NULL, add_labels = FALSE, csv = FALSE) ``` In R a function is essentially an R program that takes a set of inputs and then returns some kind of result. In the code chunk below you will see a set of functions. If you run the functions you will get a result. ```{r} sum(5, 12, 18) sqrt(36) mean(15, 32, 11, 17) ``` We can also store the results of a function in an object and then be able to use the object later. <- is the assignment operator in R. Run this code chunk. ```{r} my_sum <- sum(5, 12, 18) my_sqrt <- sqrt(36) my_mean <- mean(15, 32, 11, 17) ``` Now run this code chunk. ```{r} my_sum my_sqrt my_mean ``` Let's try the example from the github page for the educationdata package. This is requesting data from the common core directory about enrollment separated by race and sex, for the year 2008, grades 9-12 ```{r} ccd <- get_education_data(level = 'school-districts', source = 'ccd', topic = 'directory', filters = list(year = 2015)) ``` Notice that the data loads in pages. That is so that the movement of so much data is less likely to fail and if there is a problem it can pick up at a given page number. Now lets look at the names of the variables in the ccd data set using the names function. ```{r} names(ccd) ``` To get the student-teacher ratio for a district we would need to have the number of students and the number of teachers. For students we will use enrollment. For teachers we will use teachers_total_fte. (fte is education language for full time equivalent, so two half time teachers is equivalent to one full time teacher.) Before we calculate our ratio we need to check the quality of the data. One way to do that is with the summary) function. In R you refer to specific columns in a data frame with the $ notation: dataframe_name$column_name. Our data frame is called ccd and our two columns are teachers_total_fte and enrollment. The code for enrollment is shown. Add the code for the other variable. ```{r} summary(ccd$enrollment) ``` What problem do you spot? Some of our data do not seem to be valid. You can't have negative students or teachers. We need to change those to NA, which is what R uses to indicate missing values. We will use some R code to take any school with a negetive number for total teachers and set it to NA. Do the same for enrollment. Notice the use of the assignment operator. ```{r} ccd$teachers_total_fte[ccd$teachers_total_fte < 1] <- NA summary(ccd$teachers_total_fte) ``` The documentation for the CCD data indicates that only units that have agency_type of 1 or 2 are "regulat public schools." Let's filter the data set so that we only includ those. ```{r} ccd <- filter(ccd, agency_type == 1 | agency_type == 2) ``` THe | symbol in the code above means "or". Now we will add a new variable to the data frame by dividing the enrollment column by the teachers_total_fte column. Let's call it student_teacher_ratio. ```{r} ccd$student_teacher_ratio <- ccd$enrollment/ccd$teachers_total_fte summary(ccd$student_teacher_ratio) ``` Some of the extreme numbers will be worth exploring further. They could also represent problematic data. This time let's get the SAIPE data. Get the data from level: "school-districts", source: "saipe" and year 2016. ```{r} saipe <- get_education_data(level = 'school-districts', source = 'saipe', filter = list(year = 2016)) names(saipe) ``` There are several variables in the SAIPE data. The one we are interested in is called `est_population_5_17_poverty_pct`. The challeng is that we need to get the saipe data and the ccd data into one data set so that we can look at the relationship between the student teacher ratio and the poverty rate. How can we do that? Fortunately the two data sets have some variables in common. Specifically the leaid variable is in both data sets and is different for each of the rows. We can test that. The statement below asks whether the number of unique values of leadid in the ccd data is the same as the number of rows. Do the same thing about the saipe data. ```{r} length(unique(ccd$leaid)) == nrow(ccd) ``` If both statements are true leaid is a unique key for each row and we can use it to match rows in the two data sets without any complications. When we have a key that is unique in both data sets, an inner join gives us a data set with only those rows which are in both of the data sets being joined. ```{r} combined <- dplyr::inner_join(ccd, saipe, by = "leaid") ``` Now we have one data set with all of the variables. How many rows of data are in the three data sets (ccd, saipe, combined)? Use the first line as a model. ```{r} nrow(combined) ``` Why does combined have the smallest number?