library(tidyverse)
Let’s have a look at how we would use tidyr (part of the tidyverse) to mould our dataframe into the shapes that we need it to be, in order for us to be able to carry out the tasks we want to.
To consider this, we will start with some Personal Expenditure Data from the United States. This data set consists of United States personal expenditures (in billions of dollars) in the categories; food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960. Have a look at this data with the View()
function.
It is actually quite small, so we can have a look here
USPersonalExpenditure
## 1940 1945 1950 1955 1960
## Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## Household Operation 10.500 15.500 29.00 36.5 46.20
## Medical and Health 3.530 5.760 9.71 14.0 21.10
## Personal Care 1.040 1.980 2.45 3.4 5.40
## Private Education 0.341 0.974 1.80 2.6 3.64
We can see that the rows are the different categories of spending, while the columns are one for each year in the data set. Now there might be something you may have noticed.
class(USPersonalExpenditure)
## [1] "matrix"
The class of the dataset is actually a matrix, not a dataframe. We can convert this to a dataframe using the as.data.frame()
function. You’ve seen this yesterday, when we were converting the twitter results to a data frame.
USPersonalExpenditure_df <- as.data.frame(USPersonalExpenditure)
Now, you can check and see that the USPersonalExpenditure_df
object is a data frame.
class(USPersonalExpenditure_df)
## [1] "data.frame"
Much better.
Now let’s suppose we want to answer the question: Is there a difference between years in spending?
Our first issue is that the row names are actually row names, and are not captured in a variable of “domain” of spending or something like this. If you ever come accross this, and you need to translate your row names in to a new column (variable) so that you can refer to them, you can use the following code:
USPersonalExpenditure_df <- tibble::rownames_to_column(USPersonalExpenditure_df, "Domain")
Now we have a new variable, called “Domain”, which has the domain where the money was spent in each year. We can now use this, as a variable. Yay! We can take a look at the new data frame again below to check that we know what it looks like now:
USPersonalExpenditure_df
## Domain 1940 1945 1950 1955 1960
## 1 Food and Tobacco 22.200 44.500 59.60 73.2 86.80
## 2 Household Operation 10.500 15.500 29.00 36.5 46.20
## 3 Medical and Health 3.530 5.760 9.71 14.0 21.10
## 4 Personal Care 1.040 1.980 2.45 3.4 5.40
## 5 Private Education 0.341 0.974 1.80 2.6 3.64
OK so back to our question, that we want to be able to see whether there is a significant difference in spending overall between each year. Well, in order to answer this question, we need to be able to group by year - which should be a factor variable on it’s own, rather than a bunch of column headings, right?
Think about how you would write the pseudo code for testing the difference between groups in spending. If you would use an ANOVA test for example, you would need two variables, the numeric variable (spending) and a grouping variable (year). The code in R for an anova test is aov()
. So your pseudo code would look like this:
result <- aov(USPersonalExpenditure_df$Spending, USPersonalExpenditure_df$Year)
We currently don’t have either of these variables! While we have the information in this data set, we do not have the actual variable.
Take a moment to think about how you would do this in the environment you are used to working in. Would you just do this manually? Well, what if your data was much larger? What if you had to do this with many data sets?
If you are thinking about what we need to do to the data, it helps to visualise what the data looks like now, and what you want it to look like. Is it wide? Is it narrow? Once you have visualised that, the tidyr cheatsheet can help you identify the best function.
Here is a very skillfull drawing of what we are trying to achieve: