So now that you are familiar with the R syntax, know how to import data, and know how to get that data into the shape and form that you need for your analysis, we can begin to perform some basic operations to explore your data.

In your normal workflow, we are conscious that lot of your data exploration will come from visualisation. We cover visualisation tomorrow though, so we will stick to your basic stats in this session. We will cover

Univariate analysis (summary stats)
Bivariate analysis ( chi-square test, t-test, anova )

We just want to note that this tutrial is not meant to be a primer in stats, rather just an illustration of how to do these stats in R. If you are not familiar with all of these techniques there are a variety of statistics resources out there.

library(tidyverse)

The data

We will be working through the gapminder data set again. To load this data set into your environment, load the package:

library(gapminder)

But we also want to do some data manipulation by filtering and creating a new variable we will use later.

gapminder_2007 <- gapminder %>% filter(year == 2007) %>% mutate(gdp_factor = if_else(gdpPercap > 
    mean(gdpPercap), "high gdp", "low gdp"), below_avg_life_exp = if_else(lifeExp < 
    mean(lifeExp), "low life expectancy", "high life expectancy"))

gapminder_2007$gdp_factor <- as.factor(gapminder_2007$gdp_factor)
gapminder_2007$below_avg_life_exp <- as.factor(gapminder_2007$below_avg_life_exp)

Univariate stats

summary(gapminder_2007$gdpPercap)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  277.6  1624.8  6124.4 11680.1 18008.8 49357.2

sd(gapminder_2007$gdpPercap)

[1] 12859.94

summary(gapminder_2007$gdp_factor)

high gdp  low gdp 
      47       95

For the whole dataframe

summary(gapminder_2007)

        country       continent       year         lifeExp     
 Afghanistan:  1   Africa  :52   Min.   :2007   Min.   :39.61  
 Albania    :  1   Americas:25   1st Qu.:2007   1st Qu.:57.16  
 Algeria    :  1   Asia    :33   Median :2007   Median :71.94  
 Angola     :  1   Europe  :30   Mean   :2007   Mean   :67.01  
 Argentina  :  1   Oceania : 2   3rd Qu.:2007   3rd Qu.:76.41  
 Australia  :  1                 Max.   :2007   Max.   :82.60  
 (Other)    :136                                               
      pop              gdpPercap          gdp_factor
 Min.   :1.996e+05   Min.   :  277.6   high gdp:47  
 1st Qu.:4.508e+06   1st Qu.: 1624.8   low gdp :95  
 Median :1.052e+07   Median : 6124.4                
 Mean   :4.402e+07   Mean   :11680.1                
 3rd Qu.:3.121e+07   3rd Qu.:18008.8                
 Max.   :1.319e+09   Max.   :49357.2                
                                                    
            below_avg_life_exp
 high life expectancy:85      
 low life expectancy :57

Bivariate stats

Numeric - numeric

Correlation

R can perform correlation with the cor() function. Built-in to the base distribution of the program are three routines; for Pearson, Kendal and Spearman Rank correlations.

The pseudo code to get the correlation coefficient would be:

cor(var1, var2, method = "method")

The default method is “pearson” so you may omit this if that is what you want. If you type “kendall” or “spearman” then you will get the appropriate correlation coefficient.

Correlation coefficients - The default correlation returns the pearson correlation coefficient: cor(var1, var2) - If you specify “spearman” you will get the spearman correlation coefficient: cor(var1, var2, method = "spearman") - If you use a datset instead of separate variables you will return a matrix of all the pairwize correlation coefficients: cor(dataset, method = "pearson")

Getting a correlation coefficient is generally only half the story; you will want to know if the relationship is significant. The cor() function in R can be extended to provide the significance testing required. The function is cor.test()

To run a correlation test, the pseudo code would look like this:

cor.test(var1, var2, method = "method")

The default method is “pearson” so you may omit this if that is what you want. If you type “kendall” or “spearman” then you will get the appropriate significance test.

As usual with R it is a good idea to assign a variable name to your result in case you want to perfom additional operations.

Correlation Significance tests

The default method is “pearson” cor.p = cor.test(var1, var2)
If you specify “spearman” you will get the spearman correlation coefficient: cor.s = cor.test(var1, var2, method = "spearman")
To see a summary of your correlation test type the name of the variable e.g. cor.s

So if we want to test the correlation between life expectancy and GDP per capita, using a pearson correlation, then we would do the following:

gapminder_corr <- cor.test(gapminder_2007$lifeExp, gapminder_2007$gdpPercap)
gapminder_corr


    Pearson's product-moment correlation

data:  gapminder_2007$lifeExp and gapminder_2007$gdpPercap
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5786217 0.7585843
sample estimates:
      cor 
0.6786624

You can also plot the correlation coefficient on top of a scatterplot of the variables:

ggplot(gapminder_2007, aes(x = lifeExp, y = gdpPercap)) + geom_point() + geom_smooth(colour = "red", 
    fill = "lightgreen", method = "lm")

Finally, you can get a correlation matrix, to assess the relationship between all your variables in your data set. For this, we need to select only the quantitative variables in our data set. We should be experts at such subsetting by now, so let’s do this by selecting the quantitative variables:

quant_gap <- gapminder_2007 %>% select(lifeExp, pop, gdpPercap)

Now print a correlation table of all the quantitative variables:

res <- cor(quant_gap)
round(res, 2)

          lifeExp   pop gdpPercap
lifeExp      1.00  0.05      0.68
pop          0.05  1.00     -0.06
gdpPercap    0.68 -0.06      1.00

You can also use the corrplot package to visualise this:

install.packages("corrplot")

library(corrplot)

corrplot 0.84 loaded

corrplot(cor(quant_gap), type = "upper", order = "hclust", tl.col = "black", 
    tl.srt = 45)

The function corrplot() takes the correlation matrix as the first argument. The second argument (type="upper") is used to display only the upper triangular of the correlation matrix.

Categorical - categorical

Chi-square

So you want to look at a cross-tab between two variables. This is as easy as using the table() function, and passing it the two variables as arguments. For example, if we want to look at the relationship between our two dichotomous variables we created about high and low gdp and also below average life expectancy or not, we can just type:

table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp)

          
           high life expectancy low life expectancy
  high gdp                   44                   3
  low gdp                    41                  54

Excellent, now we can run a chi square test with the… chisq.test() function. Inside the brackets you have to pass a frequency table. Like the one we made below. This can be something you saved as an object, or you can just recreate the table inside the function.

Very straightforward. Now what we want to do is save this to an object:

life_exp_v_gdp <- chisq.test(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp))

Now, we can call all sort of information from this chi square test object:

Your test statistic:

life_exp_v_gdp$statistic

X-squared 
 31.25234

Your degrees of freedom:

life_exp_v_gdp$parameter

df 
 1

Your p value:

life_exp_v_gdp$p.value

[1] 2.265738e-08

Your expected counts for the cells:

life_exp_v_gdp$expected

          
           high life expectancy low life expectancy
  high gdp              28.1338             18.8662
  low gdp               56.8662             38.1338

Your residuals:

life_exp_v_gdp$residuals

          
           high life expectancy low life expectancy
  high gdp             2.991291           -3.652840
  low gdp             -2.104000            2.569318

Your standardized residuals:

life_exp_v_gdp$stdres

          
           high life expectancy low life expectancy
  high gdp             5.772285           -5.772285
  low gdp             -5.772285            5.772285

However if you want to see it all together, you can use the CrossTable() function from the gmodels package:

install.packages("gmodels")

You can even specify for your table to follow the format you might be familiar with from SPSS or SAS. Use ?CrossTable() to see all the parameters you can specify with this function:

library(gmodels)
CrossTable(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp, prop.t = FALSE, 
    prop.r = FALSE, prop.c = FALSE, expected = TRUE, resid = TRUE, sresid = TRUE, 
    format = "SPSS")


   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
| Chi-square contribution |
|                Residual |
|            Std Residual |
|-------------------------|

Total Observations in Table:  142 

                          | gapminder_2007$below_avg_life_exp 
gapminder_2007$gdp_factor | high life expectancy  |  low life expectancy  |            Row Total | 
--------------------------|----------------------|----------------------|----------------------|
                 high gdp |                  44  |                   3  |                  47  | 
                          |              28.134  |              18.866  |                      | 
                          |               8.948  |              13.343  |                      | 
                          |              15.866  |             -15.866  |                      | 
                          |               2.991  |              -3.653  |                      | 
--------------------------|----------------------|----------------------|----------------------|
                  low gdp |                  41  |                  54  |                  95  | 
                          |              56.866  |              38.134  |                      | 
                          |               4.427  |               6.601  |                      | 
                          |             -15.866  |              15.866  |                      | 
                          |              -2.104  |               2.569  |                      | 
--------------------------|----------------------|----------------------|----------------------|
             Column Total |                  85  |                  57  |                 142  | 
--------------------------|----------------------|----------------------|----------------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  33.31927     d.f. =  1     p =  7.820385e-09 

Pearson's Chi-squared test with Yates' continuity correction 
------------------------------------------------------------
Chi^2 =  31.25234     d.f. =  1     p =  2.265738e-08 

 
       Minimum expected frequency: 18.8662

Fisher test

Now since we are all stats-minded people, you may have noticed that we have a very low cell count in one of our cells in our contingency table. Since this can call into doubt our findings, it’s always helpful to use tests that account for these issues as well. In this case, the test would be a Fischer exact test, which you carry out in R with the fisher-test() function. This function also just needs a contingency table passed to it as an argument:

fisher.test(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp))


    Fisher's Exact Test for Count Data

data:  table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp)
p-value = 1.457e-09
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   5.479616 102.155242
sample estimates:
odds ratio 
  18.95314

Odds ratio

To get the odds ratio from a 2 x 2 cross tab, you can use the oddsratio() function from the vcd package.

install.packages("vcd")

And again, all you need to pass it, is a contingency table, and an arument to specify that you want odds ratio, and not log odds

library(vcd)

Loading required package: grid

oddsratio(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp), 
    log = FALSE)

 odds ratios for  and  

[1] 19.31707

You can see how it’s blank where it says “and”. If you passes it a table with column names, it would tell you what your odds ratio is being calculated for. But luckily with the code right there, we still see, and if we were to forge the values, we can always just re-print the table. In this case, we know now that the odds of having below-average life expectancy for low gdp countries are 19 times the odds for high gdp countries.

But we know that we had issues with the small cell count, and in general in stats we like to be explicit about our uncertainties, so let’s also include some confidence intervals here. So a final step is to do this with the confint() function:

confint(oddsratio(table(gapminder_2007$gdp_factor, gapminder_2007$below_avg_life_exp), 
    log = FALSE))

                                                             2.5 %
high gdp:low gdp/high life expectancy:low life expectancy 5.601292
                                                            97.5 %
high gdp:low gdp/high life expectancy:low life expectancy 66.61844

Categorical - numeric

Finally we want to look at the differences in a numeric value between groups. As mentioned, we don’t go into the stats, just how to run the stats in R, but we just wanted to mention that what we cover here, the t-test and the anova test, both have the assumption that you can run them if the numeric variable meets the assumption of normal distribution. If not, we would want to use non-parametric alternatives. Here is a quick guide on these, that will give you pointers on how to run these test. But here we will focus just on the t-test and the anova test.

T-test

For categorical variables with two possible values, we would use the t-test, if the numeric variable meets the assumption of normal distribution. By now you should be able to guess what functions are called in R! Any ideas for the t-test?

Well… the function is called t.test(). The R function t.test() can be used to perform both one and two sample t-tests on vectors of data. The function contains a variety of options and can be called as follows:

t.test(x ~ y, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, 
    var.equal = FALSE, conf.level = 0.95)

Here x is a numeric vector of data values and y is a binary factor. The option mu provides a number indicating the true value of the mean (or difference in means if you are performing a two sample test) under the null hypothesis. The option alternative is a character string specifying the alternative hypothesis, and must be one of the following: “two.sided” (which is the default), “greater” or “less” depending on whether the alternative hypothesis is that the mean is different than, greater than or less than mu, respectively.

For example the following call:

t.test(x, alternative = "less", mu = 10)

performs a one sample t-test on the data contained in x where the null hypothesis is that =10 and the alternative is that <10. The option paired indicates whether or not you want a paired t-test (TRUE = yes and FALSE = no). If you leave this option out it defaults to FALSE. The option var.equal is a logical variable indicating whether or not to assume the two variances as being equal when performing a two-sample t-test. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used. If you leave this option out it defaults to FALSE. Finally, the option conf.level determines the confidence level of the reported confidence interval for in the one-sample case and 1- 2 in the two-sample case.

So if we want to look at life expectancy between high and low gdp countries, using a t-test, we would:

t.test(gapminder_2007$lifeExp ~ gapminder_2007$gdp_factor, alternative = "two.sided", 
    paired = FALSE, conf.level = 0.95)


    Welch Two Sample t-test

data:  gapminder_2007$lifeExp by gapminder_2007$gdp_factor
t = 9.4568, df = 133.89, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 11.40431 17.43610
sample estimates:
mean in group high gdp  mean in group low gdp 
              76.65474               62.23454

So now you can say that the difference in mean life expectancy between high and low gdp countries is statistically significant, and is between 11 - 17 years.

ANOVA

And what about the difference in life expectancy between continents? The function here is called aov() and the setup is very similar to the t.test():

fit <- aov(x ~ y, data = mydataframe)

Where x refers to the numeric variable, and y to the categorical variable with multiple possible values. You also specify the data souce, so you only have to use the column names as the x and the y values. For other variations of ANOVA, eg two factor design, see this quick start guide

fit <- aov(lifeExp ~ continent, data = gapminder_2007)
summary(fit)

             Df Sum Sq Mean Sq F value Pr(>F)    
continent     4  13061    3265   59.71 <2e-16 ***
Residuals   137   7491      55                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You also want to carry out post-hoc testing:

TukeyHSD(fit)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lifeExp ~ continent, data = gapminder_2007)

$continent
                      diff        lwr       upr     p adj
Americas-Africa  18.802082  13.827042 23.777121 0.0000000
Asia-Africa      15.922446  11.372842 20.472051 0.0000000
Europe-Africa    22.842562  18.155858 27.529265 0.0000000
Oceania-Africa   25.913462  11.183451 40.643472 0.0000305
Asia-Americas    -2.879635  -8.299767  2.540497 0.5844901
Europe-Americas   4.040480  -1.495233  9.576193 0.2630707
Oceania-Americas  7.111380  -7.910342 22.133102 0.6862848
Europe-Asia       6.920115   1.763372 12.076858 0.0027416
Oceania-Asia      9.991015  -4.895221 24.877251 0.3464970
Oceania-Europe    3.070900 -11.857807 17.999607 0.9793776

Multivariable stats

glm() is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

fit_2 <- glm(lifeExp ~ continent + gdpPercap, data = gapminder_2007)
summary(fit_2)


Call:
glm(formula = lifeExp ~ continent + gdpPercap, data = gapminder_2007)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-22.9145   -2.8518    0.1407    2.8881   20.0479  

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.374e+01  9.284e-01  57.881  < 2e-16 ***
continentAmericas 1.606e+01  1.662e+00   9.663  < 2e-16 ***
continentAsia     1.267e+01  1.557e+00   8.137 2.27e-13 ***
continentEurope   1.523e+01  1.956e+00   7.786 1.57e-12 ***
continentOceania  1.665e+01  4.974e+00   3.348  0.00105 ** 
gdpPercap         3.466e-04  5.674e-05   6.109 9.89e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 43.22029)

    Null deviance: 20552  on 141  degrees of freedom
Residual deviance:  5878  on 136  degrees of freedom
AIC: 945.66

Number of Fisher Scoring iterations: 2

Important to specify what sort of general linear model you want to use. You can do this with the Family parameter. The default is gaussian, which runs an OLS regression which assumes a normal distribution of your dependent variable. If you have a categorical dichotomous variable as your dependent variable for example, this follows a binomial distribution and you want to run a logistic regression. To do this, you would set family to binomial like so:

fit_2 <- glm(dependent_variable ~ independent_variable_1 + independent_variable_2, 
    data = mydata, family = "binomial")

If your outcome variable follows a poisson distribution, and you need to run a poisson regression then set family parameter to poisson:

fit_2 <- glm(dependent_variable ~ independent_variable_1 + independent_variable_2, 
    data = mydata, family = "poisson")

Diagnostics

R has various functions for model diagnostics.

For example the package car has a function residual plots which produces residual plots from your model. So from our fit_2 above, we can see our residual plots by passing the regression to the function:

library(car)


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

residualPlots(fit_2)

          Test stat Pr(>|t|)
continent        NA       NA
gdpPercap   578.662        0

When you run the residualPlots() function R will also print two numerical tests.

First we have a curvature test for each of the plots by adding a quadratic term and testing the quadratic to be zero (more on this in a few sections). This is Tukey’s test for nonadditivity when plotting against fitted values. When this test is significant it may indicate a lack of fit for this particular predictor.

The Tukey test optimally should not be significant. We can see in the first of the three plots that the red line is a bit curved. It is this pattern that the printed tests are picking up.

We can further diagnose the model by printing marginal model plots using the marginalModelPlots() function of the car package.

marginalModelPlots(fit_2)

This will plot a scatterplot for each predictor variable against the response variable. This displays the conditional distribution of the response given each predictor, ignoring the other predictors. They are called marginal plots because they show the marginal relationship between the outcome and each predictor. It will also print a scatterplot of the response versus the fitted value displaying the conditional distribution of the outcome given the fit of the model. We observe here the curvature that was already identified by the previous plots (notice the blue line).

We can also use the marginalModelPlots() function to assess the homogeneity of variance assumption using the following argument:

marginalModelPlots(fit_2, sd = TRUE)

This will print the estimated standard deviation lines to the graph. You would want this to be constant across the X axis.

Density estimate of the residuals

We could plot the density estimate of the residuals:

plot(density(rstudent(fit_2)))

We can see that the right hand side tail is heavier than it should be.

Outliers

We can also run a formal test for outliers using the outlierTest() function from the car package. This function locates the largest Studentised residuals in absolute value and computes a Bonferroni-corrected t test. If the probability value associated with this test is smaller than your alpha value, then you can conclude the observation you are dealing with is likely an outlier.

outlierTest(fit_2)

   rstudent unadjusted p-value Bonferonni p
1 -3.721986         0.00019766     0.028068

Interpreting coefficients

One solution to this is to produce graphical displays of the relationship such as those that can be obtained with the effects package.

library(effects)

Loading required package: carData


Attaching package: 'carData'

The following objects are masked from 'package:car':

    Guyer, UN, Vocab

lattice theme set by effectsTheme()
See ?effectsTheme for details.

plot(allEffects(fit_2))

You can also create a nice looking forest plot, which looks good in presentations, using the plot_model() function frmo the sjPlot package.

library(sjPlot)
plot_model(fit_2, show.values = TRUE)

It’s also possible to easily extract the data from this plot if you want to report your findings in a table. Try the following:

p1 <- plot_model(fit_2) # assign the plot to an object
results.df <- p1$data   # pull out the data and assign it to a new object
View(results.df)        # view the results in a new window

Analysing Data in R

Reka Solymosi, Sam Langton & & Emily Buehler

4 July 2019