The first thing we need to do is to load the package that we installed.
Run the code by clicking the green arrow on the right.
```{r}
library(educationdata)
library(dplyr)
library(ggplot2)
```
Next we want to use the get_education_data() function to get the data that the Urban
Institute wants to share with us.
Type ?get_education_data to see the documentation for this function.
Notice that it gives this information.
```
get_education_data(level = NULL, source = NULL, topic = NULL, by = NULL,
filters = NULL, add_labels = FALSE, csv = FALSE)
```
In R a function is essentially an R program that takes a set of inputs and then returns
some kind of result. In the code chunk below you will see a set of functions.
If you run the functions you will get a result.
```{r}
sum(5, 12, 18)
sqrt(36)
mean(15, 32, 11, 17)
```
We can also store the results of a function in an object and then be able to use the
object later. <- is the assignment operator in R. Run this code chunk.
```{r}
my_sum <- sum(5, 12, 18)
my_sqrt <- sqrt(36)
my_mean <- mean(15, 32, 11, 17)
```
Now run this code chunk.
```{r}
my_sum
my_sqrt
my_mean
```
Let's try the example from the github page for the educationdata package.
This is requesting data from the common core directory about enrollment separated by
race and sex, for the year 2008, grades 9-12
```{r}
ccd <- get_education_data(level = 'school-districts', source = 'ccd',
topic = 'directory',
filters = list(year = 2015))
```
Notice that the data loads in pages. That is so that the movement of so much data is
less likely to fail and if there is a problem it can pick up at a given page number.
Now lets look at the names of the variables in the ccd data set using
the names function.
```{r}
names(ccd)
```
To get the student-teacher ratio for a district we would need to have
the number of students and the number of teachers.
For students we will use enrollment.
For teachers we will use teachers_total_fte. (fte is education language for full
time equivalent, so two half time teachers is equivalent to one full time teacher.)
Before we calculate our ratio we need to check the quality of the data.
One way to do that is with the summary) function. In R you refer to specific columns
in a data frame with the $ notation: dataframe_name$column_name.
Our data frame is called ccd and our two columns are teachers_total_fte and enrollment.
The code for enrollment is shown. Add the code for the other variable.
```{r}
summary(ccd$enrollment)
```
What problem do you spot?
Some of our data do not seem to be valid. You can't have negative students or teachers.
We need to change those to NA, which is what R uses to indicate missing values.
We will use some R code to take any school with a negetive number for total
teachers and set it to NA.
Do the same for enrollment.
Notice the use of the assignment operator.
```{r}
ccd$teachers_total_fte[ccd$teachers_total_fte < 1] <- NA
summary(ccd$teachers_total_fte)
```
The documentation for the CCD data indicates that only units that have
agency_type of 1 or 2 are "regulat public schools." Let's filter
the data set so that we only includ those.
```{r}
ccd <- filter(ccd, agency_type == 1 | agency_type == 2)
```
THe | symbol in the code above means "or".
Now we will add a new variable to the data frame by dividing the enrollment column
by the teachers_total_fte column. Let's call it student_teacher_ratio.
```{r}
ccd$student_teacher_ratio <- ccd$enrollment/ccd$teachers_total_fte
summary(ccd$student_teacher_ratio)
```
Some of the extreme numbers will be worth exploring further. They could
also represent problematic data.
This time let's get the SAIPE data. Get the data from level: "school-districts",
source: "saipe" and year 2016.
```{r}
saipe <- get_education_data(level = 'school-districts', source = 'saipe', filter = list(year = 2016))
names(saipe)
```
There are several variables in the SAIPE data. The one we are interested in is called
`est_population_5_17_poverty_pct`. The challeng is that we need to get the
saipe data and the ccd data into one data set so that we can look at the relationship
between the student teacher ratio and the poverty rate.
How can we do that? Fortunately the two data sets have some variables in common.
Specifically the leaid variable is in both data sets and is different for each of the
rows. We can test that. The statement below asks whether the number of unique values
of leadid in the ccd data is the same as the number of rows.
Do the same thing about the saipe data.
```{r}
length(unique(ccd$leaid)) == nrow(ccd)
```
If both statements are true leaid is a unique key for each row and
we can use it to match rows in the two data sets without any
complications.
When we have a key that is unique in both data sets,
an inner join gives us a data set with only those rows which are in both
of the data sets being joined.
```{r}
combined <- dplyr::inner_join(ccd, saipe, by = "leaid")
```
Now we have one data set with all of the variables.
How many rows of data are in the three data sets (ccd, saipe,
combined)? Use the first line as a model.
```{r}
nrow(combined)
```
Why does combined have the smallest number?