So now that you are familiar with the R syntax, know how to import data, and know how to get that data into the shape and form that you need for your analysis, we can begin to perform some basic operations to explore your data.
In your normal workflow, we are conscious that lot of your data exploration will come from visualisation. We cover visualisation tomorrow though, so we will stick to your basic stats in this session. We will cover
We just want to note that this tutrial is not meant to be a primer in stats, rather just an illustration of how to do these stats in R. If you are not familiar with all of these techniques there are a variety of statistics resources out there.
library(tidyverse)
We will be working through the gapminder data set again. To load this data set into your environment, load the package:
library(gapminder)
But we also want to do some data manipulation by filtering and creating a new variable we will use later.
gapminder_2007 <- gapminder %>% filter(year == 2007) %>% mutate(gdp_factor = if_else(gdpPercap >
mean(gdpPercap), "high gdp", "low gdp"), below_avg_life_exp = if_else(lifeExp <
mean(lifeExp), "low life expectancy", "high life expectancy"))
gapminder_2007$gdp_factor <- as.factor(gapminder_2007$gdp_factor)
gapminder_2007$below_avg_life_exp <- as.factor(gapminder_2007$below_avg_life_exp)
summary(gapminder_2007$gdpPercap)
Min. 1st Qu. Median Mean 3rd Qu. Max.
277.6 1624.8 6124.4 11680.1 18008.8 49357.2
sd(gapminder_2007$gdpPercap)
[1] 12859.94
summary(gapminder_2007$gdp_factor)
high gdp low gdp
47 95
For the whole dataframe
summary(gapminder_2007)
country continent year lifeExp
Afghanistan: 1 Africa :52 Min. :2007 Min. :39.61
Albania : 1 Americas:25 1st Qu.:2007 1st Qu.:57.16
Algeria : 1 Asia :33 Median :2007 Median :71.94
Angola : 1 Europe :30 Mean :2007 Mean :67.01
Argentina : 1 Oceania : 2 3rd Qu.:2007 3rd Qu.:76.41
Australia : 1 Max. :2007 Max. :82.60
(Other) :136
pop gdpPercap gdp_factor
Min. :1.996e+05 Min. : 277.6 high gdp:47
1st Qu.:4.508e+06 1st Qu.: 1624.8 low gdp :95
Median :1.052e+07 Median : 6124.4
Mean :4.402e+07 Mean :11680.1
3rd Qu.:3.121e+07 3rd Qu.:18008.8
Max. :1.319e+09 Max. :49357.2
below_avg_life_exp
high life expectancy:85
low life expectancy :57
R can perform correlation with the cor()
function. Built-in to the base distribution of the program are three routines; for Pearson, Kendal and Spearman Rank correlations.
The pseudo code to get the correlation coefficient would be:
cor(var1, var2, method = "method")
The default method is “pearson” so you may omit this if that is what you want. If you type “kendall” or “spearman” then you will get the appropriate correlation coefficient.
Correlation coefficients - The default correlation returns the pearson correlation coefficient: cor(var1, var2)
- If you specify “spearman” you will get the spearman correlation coefficient: cor(var1, var2, method = "spearman")
- If you use a datset instead of separate variables you will return a matrix of all the pairwize correlation coefficients: cor(dataset, method = "pearson")
Getting a correlation coefficient is generally only half the story; you will want to know if the relationship is significant. The cor()
function in R can be extended to provide the significance testing required. The function is cor.test()
To run a correlation test, the pseudo code would look like this:
cor.test(var1, var2, method = "method")
The default method is “pearson” so you may omit this if that is what you want. If you type “kendall” or “spearman” then you will get the appropriate significance test.
As usual with R it is a good idea to assign a variable name to your result in case you want to perfom additional operations.
Correlation Significance tests
cor.p = cor.test(var1, var2)
cor.s = cor.test(var1, var2, method = "spearman")
cor.s
So if we want to test the correlation between life expectancy and GDP per capita, using a pearson correlation, then we would do the following:
gapminder_corr <- cor.test(gapminder_2007$lifeExp, gapminder_2007$gdpPercap)
gapminder_corr
Pearson's product-moment correlation
data: gapminder_2007$lifeExp and gapminder_2007$gdpPercap
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5786217 0.7585843
sample estimates:
cor
0.6786624
You can also plot the correlation coefficient on top of a scatterplot of the variables:
ggplot(gapminder_2007, aes(x = lifeExp, y = gdpPercap)) + geom_point() + geom_smooth(colour = "red",
fill = "lightgreen", method = "lm")
Finally, you can get a correlation matrix, to assess the relationship between all your variables in your data set. For this, we need to select only the quantitative variables in our data set. We should be experts at such subsetting by now, so let’s do this by selecting the quantitative variables:
quant_gap <- gapminder_2007 %>% select(lifeExp, pop, gdpPercap)
Now print a correlation table of all the quantitative variables:
res <- cor(quant_gap)
round(res, 2)
lifeExp pop gdpPercap
lifeExp 1.00 0.05 0.68
pop 0.05 1.00 -0.06
gdpPercap 0.68 -0.06 1.00
You can also use the corrplot
package to visualise this:
install.packages("corrplot")
library(corrplot)
corrplot 0.84 loaded
corrplot(cor(quant_gap), type = "upper", order = "hclust", tl.col = "black",
tl.srt = 45)
The function corrplot()
takes the correlation matrix as the first argument. The second argument (type="upper")
is used to display only the upper triangular of the correlation matrix.
So you want to look at a cross-tab between two variables. This is as easy as using the table()
function, and passing it the two variables as arguments. For example, if we want to look at the relationship between our two dichotomous variables we created about high and low gdp and also below average life expectancy or not, we can just type:
table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp)
high life expectancy low life expectancy
high gdp 44 3
low gdp 41 54
Excellent, now we can run a chi square test with the… chisq.test()
function. Inside the brackets you have to pass a frequency table. Like the one we made below. This can be something you saved as an object, or you can just recreate the table inside the function.
Very straightforward. Now what we want to do is save this to an object:
life_exp_v_gdp <- chisq.test(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp))
Now, we can call all sort of information from this chi square test object:
Your test statistic:
life_exp_v_gdp$statistic
X-squared
31.25234
Your degrees of freedom:
life_exp_v_gdp$parameter
df
1
Your p value:
life_exp_v_gdp$p.value
[1] 2.265738e-08
Your expected counts for the cells:
life_exp_v_gdp$expected
high life expectancy low life expectancy
high gdp 28.1338 18.8662
low gdp 56.8662 38.1338
Your residuals:
life_exp_v_gdp$residuals
high life expectancy low life expectancy
high gdp 2.991291 -3.652840
low gdp -2.104000 2.569318
Your standardized residuals:
life_exp_v_gdp$stdres
high life expectancy low life expectancy
high gdp 5.772285 -5.772285
low gdp -5.772285 5.772285
However if you want to see it all together, you can use the CrossTable()
function from the gmodels
package:
install.packages("gmodels")
You can even specify for your table to follow the format you might be familiar with from SPSS or SAS. Use ?CrossTable()
to see all the parameters you can specify with this function:
library(gmodels)
CrossTable(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp, prop.t = FALSE,
prop.r = FALSE, prop.c = FALSE, expected = TRUE, resid = TRUE, sresid = TRUE,
format = "SPSS")
Cell Contents
|-------------------------|
| Count |
| Expected Values |
| Chi-square contribution |
| Residual |
| Std Residual |
|-------------------------|
Total Observations in Table: 142
| gapminder_2007$below_avg_life_exp
gapminder_2007$gdp_factor | high life expectancy | low life expectancy | Row Total |
--------------------------|----------------------|----------------------|----------------------|
high gdp | 44 | 3 | 47 |
| 28.134 | 18.866 | |
| 8.948 | 13.343 | |
| 15.866 | -15.866 | |
| 2.991 | -3.653 | |
--------------------------|----------------------|----------------------|----------------------|
low gdp | 41 | 54 | 95 |
| 56.866 | 38.134 | |
| 4.427 | 6.601 | |
| -15.866 | 15.866 | |
| -2.104 | 2.569 | |
--------------------------|----------------------|----------------------|----------------------|
Column Total | 85 | 57 | 142 |
--------------------------|----------------------|----------------------|----------------------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 33.31927 d.f. = 1 p = 7.820385e-09
Pearson's Chi-squared test with Yates' continuity correction
------------------------------------------------------------
Chi^2 = 31.25234 d.f. = 1 p = 2.265738e-08
Minimum expected frequency: 18.8662
Now since we are all stats-minded people, you may have noticed that we have a very low cell count in one of our cells in our contingency table. Since this can call into doubt our findings, it’s always helpful to use tests that account for these issues as well. In this case, the test would be a Fischer exact test, which you carry out in R with the fisher-test()
function. This function also just needs a contingency table passed to it as an argument:
fisher.test(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp))
Fisher's Exact Test for Count Data
data: table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp)
p-value = 1.457e-09
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
5.479616 102.155242
sample estimates:
odds ratio
18.95314
To get the odds ratio from a 2 x 2 cross tab, you can use the oddsratio()
function from the vcd
package.
install.packages("vcd")
And again, all you need to pass it, is a contingency table, and an arument to specify that you want odds ratio, and not log odds
library(vcd)
Loading required package: grid
oddsratio(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp),
log = FALSE)
odds ratios for and
[1] 19.31707
You can see how it’s blank where it says “and”. If you passes it a table with column names, it would tell you what your odds ratio is being calculated for. But luckily with the code right there, we still see, and if we were to forge the values, we can always just re-print the table. In this case, we know now that the odds of having below-average life expectancy for low gdp countries are 19 times the odds for high gdp countries.
But we know that we had issues with the small cell count, and in general in stats we like to be explicit about our uncertainties, so let’s also include some confidence intervals here. So a final step is to do this with the confint()
function:
confint(oddsratio(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp),
log = FALSE))
2.5 %
high gdp:low gdp/high life expectancy:low life expectancy 5.601292
97.5 %
high gdp:low gdp/high life expectancy:low life expectancy 66.61844
Finally we want to look at the differences in a numeric value between groups. As mentioned, we don’t go into the stats, just how to run the stats in R, but we just wanted to mention that what we cover here, the t-test and the anova test, both have the assumption that you can run them if the numeric variable meets the assumption of normal distribution. If not, we would want to use non-parametric alternatives. Here is a quick guide on these, that will give you pointers on how to run these test. But here we will focus just on the t-test and the anova test.
For categorical variables with two possible values, we would use the t-test, if the numeric variable meets the assumption of normal distribution. By now you should be able to guess what functions are called in R! Any ideas for the t-test?
Well… the function is called t.test()
. The R function t.test()
can be used to perform both one and two sample t-tests on vectors of data. The function contains a variety of options and can be called as follows:
t.test(x ~ y, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE,
var.equal = FALSE, conf.level = 0.95)
Here x is a numeric vector of data values and y is a binary factor. The option mu provides a number indicating the true value of the mean (or difference in means if you are performing a two sample test) under the null hypothesis. The option alternative is a character string specifying the alternative hypothesis, and must be one of the following: “two.sided” (which is the default), “greater” or “less” depending on whether the alternative hypothesis is that the mean is different than, greater than or less than mu, respectively.
For example the following call:
t.test(x, alternative = "less", mu = 10)
performs a one sample t-test on the data contained in x where the null hypothesis is that =10 and the alternative is that <10. The option paired indicates whether or not you want a paired t-test (TRUE = yes and FALSE = no). If you leave this option out it defaults to FALSE. The option var.equal is a logical variable indicating whether or not to assume the two variances as being equal when performing a two-sample t-test. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used. If you leave this option out it defaults to FALSE. Finally, the option conf.level determines the confidence level of the reported confidence interval for in the one-sample case and 1- 2 in the two-sample case.
So if we want to look at life expectancy between high and low gdp countries, using a t-test, we would:
t.test(gapminder_2007$lifeExp ~ gapminder_2007$gdp_factor, alternative = "two.sided",
paired = FALSE, conf.level = 0.95)
Welch Two Sample t-test
data: gapminder_2007$lifeExp by gapminder_2007$gdp_factor
t = 9.4568, df = 133.89, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
11.40431 17.43610
sample estimates:
mean in group high gdp mean in group low gdp
76.65474 62.23454
So now you can say that the difference in mean life expectancy between high and low gdp countries is statistically significant, and is between 11 - 17 years.
And what about the difference in life expectancy between continents? The function here is called aov()
and the setup is very similar to the t.test():
fit <- aov(x ~ y, data = mydataframe)
Where x refers to the numeric variable, and y to the categorical variable with multiple possible values. You also specify the data souce, so you only have to use the column names as the x and the y values. For other variations of ANOVA, eg two factor design, see this quick start guide
fit <- aov(lifeExp ~ continent, data = gapminder_2007)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
continent 4 13061 3265 59.71 <2e-16 ***
Residuals 137 7491 55
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You also want to carry out post-hoc testing:
TukeyHSD(fit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lifeExp ~ continent, data = gapminder_2007)
$continent
diff lwr upr p adj
Americas-Africa 18.802082 13.827042 23.777121 0.0000000
Asia-Africa 15.922446 11.372842 20.472051 0.0000000
Europe-Africa 22.842562 18.155858 27.529265 0.0000000
Oceania-Africa 25.913462 11.183451 40.643472 0.0000305
Asia-Americas -2.879635 -8.299767 2.540497 0.5844901
Europe-Americas 4.040480 -1.495233 9.576193 0.2630707
Oceania-Americas 7.111380 -7.910342 22.133102 0.6862848
Europe-Asia 6.920115 1.763372 12.076858 0.0027416
Oceania-Asia 9.991015 -4.895221 24.877251 0.3464970
Oceania-Europe 3.070900 -11.857807 17.999607 0.9793776
glm()
is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
fit_2 <- glm(lifeExp ~ continent + gdpPercap, data = gapminder_2007)
summary(fit_2)
Call:
glm(formula = lifeExp ~ continent + gdpPercap, data = gapminder_2007)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.9145 -2.8518 0.1407 2.8881 20.0479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.374e+01 9.284e-01 57.881 < 2e-16 ***
continentAmericas 1.606e+01 1.662e+00 9.663 < 2e-16 ***
continentAsia 1.267e+01 1.557e+00 8.137 2.27e-13 ***
continentEurope 1.523e+01 1.956e+00 7.786 1.57e-12 ***
continentOceania 1.665e+01 4.974e+00 3.348 0.00105 **
gdpPercap 3.466e-04 5.674e-05 6.109 9.89e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 43.22029)
Null deviance: 20552 on 141 degrees of freedom
Residual deviance: 5878 on 136 degrees of freedom
AIC: 945.66
Number of Fisher Scoring iterations: 2
Important to specify what sort of general linear model you want to use. You can do this with the Family parameter. The default is gaussian, which runs an OLS regression which assumes a normal distribution of your dependent variable. If you have a categorical dichotomous variable as your dependent variable for example, this follows a binomial distribution and you want to run a logistic regression. To do this, you would set family to binomial like so:
fit_2 <- glm(dependent_variable ~ independent_variable_1 + independent_variable_2,
data = mydata, family = "binomial")
If your outcome variable follows a poisson distribution, and you need to run a poisson regression then set family parameter to poisson:
fit_2 <- glm(dependent_variable ~ independent_variable_1 + independent_variable_2,
data = mydata, family = "poisson")
R has various functions for model diagnostics.
For example the package car has a function residual plots
which produces residual plots from your model. So from our fit_2 above, we can see our residual plots by passing the regression to the function:
library(car)
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
residualPlots(fit_2)
Test stat Pr(>|t|)
continent NA NA
gdpPercap 578.662 0
When you run the residualPlots()
function R will also print two numerical tests.
First we have a curvature test for each of the plots by adding a quadratic term and testing the quadratic to be zero (more on this in a few sections). This is Tukey’s test for nonadditivity when plotting against fitted values. When this test is significant it may indicate a lack of fit for this particular predictor.
The Tukey test optimally should not be significant. We can see in the first of the three plots that the red line is a bit curved. It is this pattern that the printed tests are picking up.
We can further diagnose the model by printing marginal model plots using the marginalModelPlots()
function of the car
package.
marginalModelPlots(fit_2)
This will plot a scatterplot for each predictor variable against the response variable. This displays the conditional distribution of the response given each predictor, ignoring the other predictors. They are called marginal plots because they show the marginal relationship between the outcome and each predictor. It will also print a scatterplot of the response versus the fitted value displaying the conditional distribution of the outcome given the fit of the model. We observe here the curvature that was already identified by the previous plots (notice the blue line).
We can also use the marginalModelPlots()
function to assess the homogeneity of variance assumption using the following argument:
marginalModelPlots(fit_2, sd = TRUE)
This will print the estimated standard deviation lines to the graph. You would want this to be constant across the X axis.
We could plot the density estimate of the residuals:
plot(density(rstudent(fit_2)))
We can see that the right hand side tail is heavier than it should be.
We can also run a formal test for outliers using the outlierTest()
function from the car
package. This function locates the largest Studentised residuals in absolute value and computes a Bonferroni-corrected t test. If the probability value associated with this test is smaller than your alpha value, then you can conclude the observation you are dealing with is likely an outlier.
outlierTest(fit_2)
rstudent unadjusted p-value Bonferonni p
1 -3.721986 0.00019766 0.028068
One solution to this is to produce graphical displays of the relationship such as those that can be obtained with the effects
package.
library(effects)
Loading required package: carData
Attaching package: 'carData'
The following objects are masked from 'package:car':
Guyer, UN, Vocab
lattice theme set by effectsTheme()
See ?effectsTheme for details.
plot(allEffects(fit_2))
You can also create a nice looking forest plot, which looks good in presentations, using the plot_model()
function frmo the sjPlot package.
library(sjPlot)
plot_model(fit_2, show.values = TRUE)
It’s also possible to easily extract the data from this plot if you want to report your findings in a table. Try the following:
p1 <- plot_model(fit_2) # assign the plot to an object
results.df <- p1$data # pull out the data and assign it to a new object
View(results.df) # view the results in a new window