class: center, middle, inverse, title-slide #
Getting Started in R
an introduction to data analysis and visualisation
## Modeling Data ### Réka Solymosi, Sam Langton & Emily Buehler ### 4 July 2019 --- class: inverse, center, middle # Modeling Data --- ## Making inferences from data --- ## Univariate ### Summary of one variable --- ## Univariate For all variables in dataframe ```r summary(gapminder) ``` ``` ## country continent year lifeExp ## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 ## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 ## Algeria : 12 Asia :396 Median :1980 Median :60.71 ## Angola : 12 Europe :360 Mean :1980 Mean :59.47 ## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 ## Australia : 12 Max. :2007 Max. :82.60 ## (Other) :1632 ## pop gdpPercap ## Min. :6.001e+04 Min. : 241.2 ## 1st Qu.:2.794e+06 1st Qu.: 1202.1 ## Median :7.024e+06 Median : 3531.8 ## Mean :2.960e+07 Mean : 7215.3 ## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5 ## Max. :1.319e+09 Max. :113523.1 ## ``` --- ## Univariate For all variables in dataframe with skimr ```r library(skimr) ``` --- ```r gapminder %>% skim() ``` ``` ## Skim summary statistics ## n obs: 1704 ## n variables: 6 ## ## -- Variable type:factor -------------------------------------------------------------- ## variable missing complete n n_unique ## continent 0 1704 1704 5 ## country 0 1704 1704 142 ## top_counts ordered ## Afr: 624, Asi: 396, Eur: 360, Ame: 300 FALSE ## Afg: 12, Alb: 12, Alg: 12, Ang: 12 FALSE ## ## -- Variable type:integer ------------------------------------------------------------- ## variable missing complete n mean sd p0 p25 p50 ## pop 0 1704 1704 3e+07 1.1e+08 60011 2793664 7e+06 ## year 0 1704 1704 1979.5 17.27 1952 1965.75 1979.5 ## p75 p100 hist ## 2e+07 1.3e+09 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> ## 1993.25 2007 <U+2587><U+2583><U+2587><U+2583><U+2583><U+2587><U+2583><U+2587> ## ## -- Variable type:numeric ------------------------------------------------------------- ## variable missing complete n mean sd p0 p25 p50 ## gdpPercap 0 1704 1704 7215.33 9857.45 241.17 1202.06 3531.85 ## lifeExp 0 1704 1704 59.47 12.92 23.6 48.2 60.71 ## p75 p100 hist ## 9325.46 113523.13 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> ## 70.85 82.6 <U+2581><U+2582><U+2585><U+2585><U+2585><U+2585><U+2587><U+2583> ``` --- ## Bivariate stats ### making comparisons --- ## Numeric - numeric ```r cor(var1, var2, method = "pearson") # kendall, spearman ``` ```r cor.test(var1, var2, method = "pearson") # kendall, spearman ``` --- ## Categorical - categorical ### Cross tabs ```r chisq.test(table(var1, var2)) ``` ```r fisher.test(table(var1, var2)) ``` --- ## Categorical - categorical ### Odds ratio ```r library(vcd) oddsratio(table(var1, var2)) ``` --- ## Categorical - numeric ```r # t-test t.test(var1 ~ var2) # anova aov(var1 ~ var2, data=mydataframe) ``` --- ## Multivariable ```r fit_1 <- glm(dependent_var ~ independent_var_1 + independent_var_2, data = mydata) summary(fit_1) ``` --- ## Multivariable ```r fit_2 <- glm(lifeExp ~ continent + gdpPercap, data = gapminder) summary(fit_2) ``` ``` ## ## Call: ## glm(formula = lifeExp ~ continent + gdpPercap, data = gapminder) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -49.241 -4.479 0.347 5.105 25.138 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.789e+01 3.398e-01 140.93 <2e-16 *** ## continentAmericas 1.359e+01 6.008e-01 22.62 <2e-16 *** ## continentAsia 8.658e+00 5.555e-01 15.59 <2e-16 *** ## continentEurope 1.757e+01 6.257e-01 28.08 <2e-16 *** ## continentOceania 1.815e+01 1.787e+00 10.15 <2e-16 *** ## gdpPercap 4.453e-04 2.350e-05 18.95 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 70.39368) ## ## Null deviance: 284148 on 1703 degrees of freedom ## Residual deviance: 119528 on 1698 degrees of freedom ## AIC: 12093 ## ## Number of Fisher Scoring iterations: 2 ``` --- ## Multivariable ```r library(car) residualPlots(fit_2) ``` <img src="model_slides_files/figure-html/unnamed-chunk-13-1.png" height="450px" /> ``` ## Test stat Pr(>|Test stat|) ## continent ## gdpPercap 19809 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Multivariable ```r marginalModelPlots(fit_2) ``` <img src="model_slides_files/figure-html/unnamed-chunk-14-1.png" height="350px" /> --- ## Multivariable ```r library(effects) plot(allEffects(fit_2)) ``` <img src="model_slides_files/figure-html/unnamed-chunk-15-1.png" height="350px" /> --- ## Multivariable ```r library(sjPlot) plot_model(fit_2, show.values = TRUE) ``` <img src="model_slides_files/figure-html/unnamed-chunk-16-1.png" height="350px" /> --- ## Multivariable ```r p1 <- plot_model(fit_2) ``` ```r ggplot(p1$data) + theme_minimal() + geom_vline(xintercept = 1, linetype = "dashed") + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term), height = 0.2) + geom_point(aes(y = term, x = estimate), size = 3, col = "red") + labs(y = " ", x = "Estimate") + xlim(0,23) + scale_y_discrete(labels = c("GDP","Oceania","Europe","Asia","America")) ``` --- ## Multivariable <img src="model_slides_files/figure-html/unnamed-chunk-20-1.png" width="600px" style="display: block; margin: auto;" />