If you have not previously done the education_analysis_base analysis, do that now. If you do not see the data frame called combined in the Environment tab, also re-run the analysis below. (This is only the data creation code, if you have the data already, you do not need to rerun it, however do not remove it.) ```{r echo=FALSE, message=FALSE} library(educationdata) ccd <- get_education_data(level = 'school-districts', source = 'ccd', topic = 'directory', filters = list(year = 2015)) ccd$teachers_total_fte[ccd$teachers_total_fte < 1] <- NA ccd <- dplyr::filter(ccd, agency_type == 1 | agency_type == 2) ccd$student_teacher_ratio <- ccd$enrollment/ccd$teachers_total_fte saipe <- get_education_data(level = 'school-districts', source = 'saipe', filter = list(year = 2016)) combined <- dplyr::inner_join(ccd, saipe, by = "leaid") ``` Whether or not you have done the education_analysis_base, reload the libraries. R does not reload the libraries if they are already loaded, meaning that using the library() function for the same packages again does not impact memory usage or performance. ```{r warning=FALSE, message=FALSE} library(educationdata) library(dplyr) library(ggplot2) ``` In this lab we want to look at how similar or different the relationship between variables is in the different states. How many states do you think there will be in the data set? The variable identifying the state in the combined data set is called state_mailing. We refer to this as combined$state_mailing, this indicates that we are using the state_mailing variable from the combined data frame. To find out the actual answer to this question we will combine two R functions: unique() and length(). The data frame combined has 13025 rows, we do not want to look at all of those rows to see how many states there are. The unique() function creates a vector with one copy of each value of the variable state_mailing. In the code below we use the unique function and then assign the results to a new variable called state_names. Then, we print the state_names by giving its name. ```{r} state_names <- unique(combined$state_mailing) state_names ``` You could count the number of states displayed by hand, but to find out how many states there are using code you would use the length() function on the state_names. Put the code for this below. ```{r} length(state_names) ``` How many states are there? Are you surprised? Why is the number not 50? When working with data it is always important to understand what the data actually contains before going ahead with any analysis. Each row in the combined data frame represents one school district. Let's look at how many school districts are in each of the states. The code given below uses the table() function and assigns the results to n_districts. Add the code to print n_districts and run the code. ```{r} n_districts <- table(combined$state_mailing) ``` Looking at the table, what is the smallest number of districts in a state? What is the largest? Now use the R functions min() and max() to get the same information. The code chunk below gives the code for getting the minimum, add the code for getting the maximum. ```{r} min(n_districts) ``` Looking at the table, what are the places that have the minimum number of districts? While it is pretty easy to find the minimum and maximum by looking at the table, what about something like the mean or median? User the R functions mean() and median() in the code chunk below to find those statistics for n_districts. ```{r} ``` Which is higher, the mean or the median? We want to look at the relationship between the rate of child poverty and the student-teacher ratio for each of the states and compare them to each other. We are going to use the ggplot system to create "small multiples" which will provide a separate graph for each state. Let's look at the code block below. In ggplot graphics are built by connecting specific elements using + signs. The first line gives the name of the data set, combined. The next two lines define the x and y axes, in this case the x is est_population_5_17_poverty_pct, and y is student_teacher_ratio. The next line geom_point() specifies a scatterplot and provides some instructions about the size of the points, the color, and alpha, which helps to make graphs with many points more readable by making places with overlapping points darker and single points lighter. facet_wrap() is what creates the small multiples and specifies that we will create a graph for each value of state_mailing -- that is one per state. Also we will use 4 columns. Finally ggtitle gives the title. Run your graph. ```{r } ggplot(combined) + aes(x = est_population_5_17_poverty_pct) + aes(y = student_teacher_ratio) + geom_point(alpha = 0.3, size = .05, color = "black") + facet_wrap(~state_mailing, ncol= 4, scales = "free") + ggtitle("My graph") ``` What happened? We have two problems, some missing data and also ... the multiples are way too small. The missing data is due to the fact that student_teacher_ratio has some missing values. Let's remove all the rows that have missing student_teacher_ratio from the data by using a neat bit of code !is.na(). The ! means not. The function is.na() asks whether the variable inside the parentheses is missing. Putting them together, we get the rows that are not missing. ```{r} combined <- filter(combined, !is.na(student_teacher_ratio)) ``` The next problem is the size issue. Since there are 52 states, we can split them into 4 groups of 4 states. We can use the state_names variable to help us do this. The first line of the code chunk below filters the combined data set to just the data with first 13 state names using the special %in% operator. Also notice that fig.height=6 to make the figure 6 inches tall. Run the code once, then change the title and color of the points. You can also experiment with the number of columns, an the size of the points. ```{r fig.height=6} subset <- filter(combined, state_mailing %in% state_names[1:13]) ggplot(subset) + aes(x = est_population_5_17_poverty_pct) + aes(y = student_teacher_ratio) + geom_point(alpha = 0.3, size = .05, color = "black") + facet_wrap(~state_mailing, ncol= 4, scales = "free") + ggtitle("My graph") ``` Copy the code chunk for the graph three times. Change the states that are included in each graph by editing the first line (for example for the second graph, use 14:26). Looking at your graphs. Write about your results: Does the relationship between the variables look the same for all of the states? Do you notice any states where there do not seem to be a relationship (where the student teacher ratio is basically the same across all the child poverty levels)? Which ones? Do you notice any states where there seems to be a negative relationship (where the student-teacher ratio is lower when the the child poverty rate is higher)? Which ones? Do you notice any states where there seems to be a positive relationship (where the student-teacher ratio gets higher asthe child poverty rate gets higher)? Which ones? Why do you think there might be differences in this among the different states? Can you think of anything the states with similar patterns have in common?