Web scraping and tidy text



Key functions

  • tidytext::unnest_tokens() unnests values in text column into a word column
  • tidytext::get_sentiments() retrieve sentiment lexicons

Set up

You need to have the following R packages installed and loaded:

Scrape text from a web page

The example text derives from the commencement speech given by David Foster Wallace at Kenyon College, Ohio on 21 May 2005.

Download and parse the html web page

web_page <- read_html("https://www.theguardian.com/books/2008/sep/20/fiction")

Use the SelectorGadget to identify CSS selectors in the web page

SelectorGadget is a javascript bookmarklet that you use in your web browser to identify CSS selectors in a web page. Read the vignette("selectorgadget") for more information about using SelectorGadget.

Find the html nodes that match the CSS selector

web_nodes <- html_nodes(web_page, ".js-article__body p")

Extract the text from the html nodes

web_text <- html_text(web_nodes)

Or using %>%

web_text <- read_html("https://www.theguardian.com/books/2008/sep/20/fiction") %>% 
    html_nodes(".js-article__body p") %>% html_text()

Exercise

Extract the text of J. K. Rowling’s commencement speech at Harvard University on 5 June 2008 from the following web page: http://news.harvard.edu/gazette/story/2008/06/text-of-j-k-rowling-speech/

Answer:

example_web_text <- read_html("http://news.harvard.edu/gazette/story/2008/06/text-of-j-k-rowling-speech/") %>% 
    html_nodes(".article-body p") %>% html_text()

Prepare the text for tidying

Convert to a data frame

text_df <- data_frame(text = web_text)

Trim the text

text_df <- text_df %>% slice(1:(grep("this is water.\"", text)))

Remove any blank lines

text_df <- text_df %>% filter(nzchar(text))

Add paragraph numbers

text_df <- text_df %>% mutate(paragraph = row_number())

Or using %>%

text_df <- web_text %>% data_frame(text = .) %>% slice(1:(grep("this is water.\"", 
    text))) %>% mutate(paragraph = row_number())

Exercise

  1. Convert example_web_text to a data frame using data_frame()
  2. Remove the first line and line 5 (“Sign up for daily emails with the latest Harvard news.”) using slice(). (Hint: you can use a vector in slice())
  3. Add a paragraph number

Answer:

example_text_df <- example_web_text %>% data_frame(text = .) %>% slice(c(2:14, 
    16:n())) %>% mutate(paragraph = row_number())

Tidy the text using tidytext

Unnest values in text column into a word column

words <- text_df %>% unnest_tokens(word, text)

Remove ‘stop words’

words <- words %>% anti_join(stop_words, by = "word")

Analyse the text

Most common words

words %>% count(word, sort = TRUE)

Exercise

  1. Unnest the values in the text column of example_text_df into a word column
  2. Remove the ‘stop words’
  3. List the top 10 most common words

Answer:

example_words <- example_text_df %>% unnest_tokens(word, text) %>% anti_join(stop_words, 
    by = "word")
example_words %>% count(word, sort = TRUE)

Bigrams

bigram <- text_df %>%
  unnest_tokens(word, text, token = "ngrams", n = 2) %>% # tokenise into word pairs
  separate(word, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word, # remove 'stop words'
         !word2 %in% stop_words$word) %>%
  unite(word, word1, word2, sep = " ")

Frequency of bigrams

bigram %>% count(word, sort = TRUE) %>% top_n(10) %>% mutate(word = reorder(word, 
    n)) %>% ggplot(aes(word, n)) + geom_col(fill = "grey", alpha = 0.8) + coord_flip() + 
    scale_y_continuous(expand = c(0, 0)) + labs(x = NULL, y = "Number of mentions", 
    title = "2-word combinations in Foster Wallace's commencement speech") + 
    theme_minimal()

Sentiment analysis

Frequency of positive words

words %>% inner_join(get_sentiments("bing"), by = "word") %>% filter(sentiment == 
    "positive") %>% count(word) %>% wordcloud2(size = 0.7, fontFamily = "RobotoCondensed-Regular", 
    color = rep(c("orange", "skyblue"), length.out = nrow(.)))


Frequency of negative words

words %>% inner_join(get_sentiments("bing"), by = "word") %>% filter(sentiment == 
    "negative") %>% count(word) %>% wordcloud2(size = 0.7, fontFamily = "RobotoCondensed-Regular", 
    color = rep(c("black", "grey"), length.out = nrow(.)))


Distribution of positive and negative words

words %>% inner_join(get_sentiments("bing"), by = "word") %>% count(word, sentiment, 
    sort = TRUE) %>% ungroup() %>% filter(n > 1) %>% mutate(n = ifelse(sentiment == 
    "negative", -n, n)) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, 
    n, fill = sentiment)) + geom_col() + labs(x = NULL, y = NULL, fill = "Sentiment") + 
    coord_flip() + theme_minimal() + theme(legend.position = "bottom")


Resources

  • Silge, J. & Robinson, D. (2017). Text Mining with R: A Tidy Approach, O’Reilly Media. Available online via: http://tidytextmining.com
LS0tCnRpdGxlOiAiIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRoZW1lOiBzaW1wbGV4CiAgICBoaWdobGlnaHQ6IHRleHRtYXRlCiAgICBjb2RlX2ZvbGRpbmc6IHNob3cKLS0tCgpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0Ka25pdHI6Om9wdHNfY2h1bmskc2V0KGVjaG89VFJVRSwgY2FjaGU9VFJVRSwgcHJvbXB0PUZBTFNFLCB0aWR5PVRSVUUsIGNvbW1lbnQ9TkEsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0UpCgpsaWJyYXJ5KHRpZHl2ZXJzZSkgOyBsaWJyYXJ5KHJ2ZXN0KSA7IGxpYnJhcnkodGlkeXRleHQpIDsgbGlicmFyeSh3b3JkY2xvdWQyKQpgYGAKCgojIyBXZWIgc2NyYXBpbmcgYW5kIHRpZHkgdGV4dAoKPGJyPgoKKioqCgojIyMgS2V5IGZ1bmN0aW9ucwotIGB0aWR5dGV4dDo6dW5uZXN0X3Rva2VucygpYCB1bm5lc3RzIHZhbHVlcyBpbiB0ZXh0IGNvbHVtbiBpbnRvIGEgd29yZCBjb2x1bW4KLSBgdGlkeXRleHQ6OmdldF9zZW50aW1lbnRzKClgIHJldHJpZXZlIHNlbnRpbWVudCBsZXhpY29ucwoKKioqCgojIyMgU2V0IHVwCllvdSBuZWVkIHRvIGhhdmUgdGhlIGZvbGxvd2luZyBSIHBhY2thZ2VzIGluc3RhbGxlZCBhbmQgbG9hZGVkOgoKLSBbYHRpZHl2ZXJzZWBdKGh0dHBzOi8vY3Jhbi5yLXByb2plY3Qub3JnL3dlYi9wYWNrYWdlcy90aWR5cGFydC9pbmRleC5odG1sKQotIFtgcnZlc3RgXShodHRwczovL2NyYW4uci1wcm9qZWN0Lm9yZy93ZWIvcGFja2FnZXMvcnZlc3QvaW5kZXguaHRtbCkKLSBbYHRpZHl0ZXh0YF0oaHR0cHM6Ly9jcmFuLnItcHJvamVjdC5vcmcvd2ViL3BhY2thZ2VzL3RpZHl0ZXh0L2luZGV4Lmh0bWwpCi0gW2B3b3JkY2xvdWQyYF0oaHR0cHM6Ly9jcmFuLnItcHJvamVjdC5vcmcvd2ViL3BhY2thZ2VzL3dvcmRjbG91ZDIvaW5kZXguaHRtbCkgICAgCgojIyMgU2NyYXBlIHRleHQgZnJvbSBhIHdlYiBwYWdlClRoZSBleGFtcGxlIHRleHQgZGVyaXZlcyBmcm9tIHRoZSBjb21tZW5jZW1lbnQgc3BlZWNoIGdpdmVuIGJ5IFtEYXZpZCBGb3N0ZXIgV2FsbGFjZV0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvRGF2aWRfRm9zdGVyX1dhbGxhY2UpIGF0IEtlbnlvbiBDb2xsZWdlLCBPaGlvIG9uIDIxIE1heSAyMDA1LiAgICAgIAoKIyMjIyBEb3dubG9hZCBhbmQgcGFyc2UgdGhlIGh0bWwgd2ViIHBhZ2UKYGBge3IsIGV2YWw9RkFMU0V9CndlYl9wYWdlIDwtIHJlYWRfaHRtbCgiaHR0cHM6Ly93d3cudGhlZ3VhcmRpYW4uY29tL2Jvb2tzLzIwMDgvc2VwLzIwL2ZpY3Rpb24iKQpgYGAKCiMjIyMgVXNlIHRoZSBbU2VsZWN0b3JHYWRnZXRdKGh0dHA6Ly9zZWxlY3RvcmdhZGdldC5jb20pIHRvIGlkZW50aWZ5IENTUyBzZWxlY3RvcnMgaW4gdGhlIHdlYiBwYWdlClNlbGVjdG9yR2FkZ2V0IGlzIGEgamF2YXNjcmlwdCBib29rbWFya2xldCB0aGF0IHlvdSB1c2UgaW4geW91ciB3ZWIgYnJvd3NlciB0byBpZGVudGlmeSBDU1Mgc2VsZWN0b3JzIGluIGEgd2ViIHBhZ2UuIFJlYWQgdGhlIGB2aWduZXR0ZSgic2VsZWN0b3JnYWRnZXQiKWAgZm9yIG1vcmUgaW5mb3JtYXRpb24gYWJvdXQgdXNpbmcgU2VsZWN0b3JHYWRnZXQuICAgICAKPGJyPgoKIyMjIyBGaW5kIHRoZSBodG1sIG5vZGVzIHRoYXQgbWF0Y2ggdGhlIENTUyBzZWxlY3RvcgpgYGB7ciwgZXZhbD1GQUxTRX0Kd2ViX25vZGVzIDwtIGh0bWxfbm9kZXMod2ViX3BhZ2UsICIuanMtYXJ0aWNsZV9fYm9keSBwIikKYGBgCgojIyMjIEV4dHJhY3QgdGhlIHRleHQgZnJvbSB0aGUgaHRtbCBub2RlcwpgYGB7ciwgZXZhbD1GQUxTRX0Kd2ViX3RleHQgPC0gaHRtbF90ZXh0KHdlYl9ub2RlcykKYGBgCgoqKk9yIHVzaW5nIGAlPiVgKioKYGBge3J9CndlYl90ZXh0IDwtIHJlYWRfaHRtbCgiaHR0cHM6Ly93d3cudGhlZ3VhcmRpYW4uY29tL2Jvb2tzLzIwMDgvc2VwLzIwL2ZpY3Rpb24iKSAlPiUgCiAgaHRtbF9ub2RlcygiLmpzLWFydGljbGVfX2JvZHkgcCIpICU+JSAKICBodG1sX3RleHQoKQpgYGAKCioqKgojIyMgKipFeGVyY2lzZSoqCkV4dHJhY3QgdGhlIHRleHQgb2YgW0ouIEsuIFJvd2xpbmcnc10oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvSi5fSy5fUm93bGluZykgY29tbWVuY2VtZW50IHNwZWVjaCBhdCBIYXJ2YXJkIFVuaXZlcnNpdHkgb24gNSBKdW5lIDIwMDggZnJvbSB0aGUgZm9sbG93aW5nIHdlYiBwYWdlOiBbaHR0cDovL25ld3MuaGFydmFyZC5lZHUvZ2F6ZXR0ZS9zdG9yeS8yMDA4LzA2L3RleHQtb2Ytai1rLXJvd2xpbmctc3BlZWNoL10oaHR0cDovL25ld3MuaGFydmFyZC5lZHUvZ2F6ZXR0ZS9zdG9yeS8yMDA4LzA2L3RleHQtb2Ytai1rLXJvd2xpbmctc3BlZWNoLykKCipBbnN3ZXI6KgpgYGB7ciwgZXZhbD1GQUxTRX0KZXhhbXBsZV93ZWJfdGV4dCA8LSByZWFkX2h0bWwoImh0dHA6Ly9uZXdzLmhhcnZhcmQuZWR1L2dhemV0dGUvc3RvcnkvMjAwOC8wNi90ZXh0LW9mLWotay1yb3dsaW5nLXNwZWVjaC8iKSAgJT4lIAogIGh0bWxfbm9kZXMoIi5hcnRpY2xlLWJvZHkgcCIpICU+JSAKICBodG1sX3RleHQoKQpgYGAKKioqCgojIyMgUHJlcGFyZSB0aGUgdGV4dCBmb3IgdGlkeWluZwojIyMjIENvbnZlcnQgdG8gYSBkYXRhIGZyYW1lCmBgYHtyLCBldmFsPUZBTFNFfQp0ZXh0X2RmIDwtIGRhdGFfZnJhbWUodGV4dCA9IHdlYl90ZXh0KQpgYGAKCiMjIyMgVHJpbSB0aGUgdGV4dApgYGB7ciwgZXZhbD1GQUxTRX0KdGV4dF9kZiA8LSB0ZXh0X2RmICU+JSAKICBzbGljZSgxOihncmVwKCJ0aGlzIGlzIHdhdGVyLlwiIiwgdGV4dCkpKQpgYGAKCiMjIyMgUmVtb3ZlIGFueSBibGFuayBsaW5lcwpgYGB7ciwgZXZhbD1GQUxTRX0KdGV4dF9kZiA8LSB0ZXh0X2RmICU+JSAKICBmaWx0ZXIobnpjaGFyKHRleHQpKQpgYGAKCiMjIyMgQWRkIHBhcmFncmFwaCBudW1iZXJzCmBgYHtyLCBldmFsPUZBTFNFfQp0ZXh0X2RmIDwtIHRleHRfZGYgJT4lIAogIG11dGF0ZShwYXJhZ3JhcGggPSByb3dfbnVtYmVyKCkpCmBgYAoKKipPciB1c2luZyBgJT4lYCoqCmBgYHtyfQp0ZXh0X2RmIDwtIHdlYl90ZXh0ICU+JSAKICBkYXRhX2ZyYW1lKHRleHQgPSAuKSAlPiUgCiAgc2xpY2UoMTooZ3JlcCgidGhpcyBpcyB3YXRlci5cIiIsIHRleHQpKSkgJT4lIAogIG11dGF0ZShwYXJhZ3JhcGggPSByb3dfbnVtYmVyKCkpCmBgYAoKKioqCiMjIyAqKkV4ZXJjaXNlKioKMS4gQ29udmVydCBgZXhhbXBsZV93ZWJfdGV4dGAgdG8gYSBkYXRhIGZyYW1lIHVzaW5nIGBkYXRhX2ZyYW1lYCgpCjIuIFJlbW92ZSB0aGUgZmlyc3QgbGluZSBhbmQgbGluZSA1ICgiU2lnbiB1cCBmb3IgZGFpbHkgZW1haWxzIHdpdGggdGhlIGxhdGVzdCBIYXJ2YXJkIG5ld3MuIikgdXNpbmcgYHNsaWNlKClgLiAoSGludDogeW91IGNhbiB1c2UgYSB2ZWN0b3IgaW4gYHNsaWNlKClgKQozLiBBZGQgYSBwYXJhZ3JhcGggbnVtYmVyCgoqQW5zd2VyOioKYGBge3IsIGV2YWw9RkFMU0V9CmV4YW1wbGVfdGV4dF9kZiA8LSBleGFtcGxlX3dlYl90ZXh0ICU+JSAKICBkYXRhX2ZyYW1lKHRleHQgPSAuKSAlPiUgCiAgc2xpY2UoYygyOjE0LCAxNjpuKCkpKSAlPiUKICBtdXRhdGUocGFyYWdyYXBoID0gcm93X251bWJlcigpKQpgYGAKCioqKgoKIyMjIFRpZHkgdGhlIHRleHQgdXNpbmcgW2B0aWR5dGV4dGBdKGh0dHBzOi8vY3Jhbi5yLXByb2plY3Qub3JnL3dlYi9wYWNrYWdlcy90aWR5dGV4dC9pbmRleC5odG1sKQojIyMjIFVubmVzdCB2YWx1ZXMgaW4gdGV4dCBjb2x1bW4gaW50byBhIHdvcmQgY29sdW1uCmBgYHtyfQp3b3JkcyA8LSB0ZXh0X2RmICU+JSAKICB1bm5lc3RfdG9rZW5zKHdvcmQsIHRleHQpCmBgYAoKIyMjIyBSZW1vdmUgJ3N0b3Agd29yZHMnCmBgYHtyfQp3b3JkcyA8LSB3b3JkcyAlPiUgCiAgYW50aV9qb2luKHN0b3Bfd29yZHMsIGJ5ID0gIndvcmQiKQpgYGAKCiMjIyBBbmFseXNlIHRoZSB0ZXh0CgojIyMjIE1vc3QgY29tbW9uIHdvcmRzCmBgYHtyfQp3b3JkcyAlPiUgCiAgY291bnQod29yZCwgc29ydCA9IFRSVUUpCmBgYAoKKioqCiMjIyAqKkV4ZXJjaXNlKioKMS4gVW5uZXN0IHRoZSB2YWx1ZXMgaW4gdGhlIHRleHQgY29sdW1uIG9mIGBleGFtcGxlX3RleHRfZGZgIGludG8gYSB3b3JkIGNvbHVtbgoyLiBSZW1vdmUgdGhlICdzdG9wIHdvcmRzJwozLiBMaXN0IHRoZSB0b3AgMTAgbW9zdCBjb21tb24gd29yZHMKCipBbnN3ZXI6KgpgYGB7ciwgZXZhbD1GQUxTRX0KZXhhbXBsZV93b3JkcyA8LSBleGFtcGxlX3RleHRfZGYgJT4lIAogIHVubmVzdF90b2tlbnMod29yZCwgdGV4dCkgJT4lIAogIGFudGlfam9pbihzdG9wX3dvcmRzLCBieSA9ICJ3b3JkIikKZXhhbXBsZV93b3JkcyAlPiUgCiAgY291bnQod29yZCwgc29ydCA9IFRSVUUpCmBgYAoKKioqCgojIyMjIEJpZ3JhbXMKYGBge3J9CmJpZ3JhbSA8LSB0ZXh0X2RmICU+JQogIHVubmVzdF90b2tlbnMod29yZCwgdGV4dCwgdG9rZW4gPSAibmdyYW1zIiwgbiA9IDIpICU+JSAjIHRva2VuaXNlIGludG8gd29yZCBwYWlycwogIHNlcGFyYXRlKHdvcmQsIGMoIndvcmQxIiwgIndvcmQyIiksIHNlcCA9ICIgIikgJT4lIAogIGZpbHRlcighd29yZDEgJWluJSBzdG9wX3dvcmRzJHdvcmQsICMgcmVtb3ZlICdzdG9wIHdvcmRzJwogICAgICAgICAhd29yZDIgJWluJSBzdG9wX3dvcmRzJHdvcmQpICU+JQogIHVuaXRlKHdvcmQsIHdvcmQxLCB3b3JkMiwgc2VwID0gIiAiKQpgYGAKCioqRnJlcXVlbmN5IG9mIGJpZ3JhbXMqKgpgYGB7cn0KYmlncmFtICU+JQogIGNvdW50KHdvcmQsIHNvcnQ9VFJVRSkgJT4lCiAgdG9wX24oMTApICU+JQogIG11dGF0ZSh3b3JkID0gcmVvcmRlcih3b3JkLCBuKSkgJT4lCiAgZ2dwbG90KGFlcyh3b3JkLCBuKSkgKwogIGdlb21fY29sKGZpbGwgPSAiZ3JleSIsIGFscGhhID0gMC44KSArCiAgY29vcmRfZmxpcCgpICsKICBzY2FsZV95X2NvbnRpbnVvdXMoZXhwYW5kID0gYygwLDApKSArCiAgbGFicyh4ID0gTlVMTCwgeSA9ICJOdW1iZXIgb2YgbWVudGlvbnMiLAogICAgICAgdGl0bGUgPSAiMi13b3JkIGNvbWJpbmF0aW9ucyBpbiBGb3N0ZXIgV2FsbGFjZSdzIGNvbW1lbmNlbWVudCBzcGVlY2giKSArCiAgdGhlbWVfbWluaW1hbCgpCmBgYAoKIyMjIyBTZW50aW1lbnQgYW5hbHlzaXMKCioqRnJlcXVlbmN5IG9mIHBvc2l0aXZlIHdvcmRzKioKYGBge3J9CndvcmRzICU+JQogIGlubmVyX2pvaW4oZ2V0X3NlbnRpbWVudHMoImJpbmciKSwgYnkgPSAid29yZCIpICU+JQogIGZpbHRlcihzZW50aW1lbnQgPT0gInBvc2l0aXZlIikgJT4lCiAgY291bnQod29yZCkgJT4lCiAgd29yZGNsb3VkMihzaXplID0gMC43LCBmb250RmFtaWx5ID0gIlJvYm90b0NvbmRlbnNlZC1SZWd1bGFyIiwgCiAgICAgICAgICAgICBjb2xvciA9IHJlcChjKCdvcmFuZ2UnLCAnc2t5Ymx1ZScpLCBsZW5ndGgub3V0PW5yb3coLikpKQpgYGAKCjxicj4KCioqRnJlcXVlbmN5IG9mIG5lZ2F0aXZlIHdvcmRzKioKYGBge3J9CndvcmRzICU+JQogIGlubmVyX2pvaW4oZ2V0X3NlbnRpbWVudHMoImJpbmciKSwgYnkgPSAid29yZCIpICU+JQogIGZpbHRlcihzZW50aW1lbnQgPT0gIm5lZ2F0aXZlIikgJT4lIAogIGNvdW50KHdvcmQpICU+JSAKICB3b3JkY2xvdWQyKHNpemUgPSAwLjcsIGZvbnRGYW1pbHkgPSAiUm9ib3RvQ29uZGVuc2VkLVJlZ3VsYXIiLCAKICAgICAgICAgICAgIGNvbG9yID0gcmVwKGMoJ2JsYWNrJywgJ2dyZXknKSwgbGVuZ3RoLm91dD1ucm93KC4pKSkKYGBgCjxicj4KCioqRGlzdHJpYnV0aW9uIG9mIHBvc2l0aXZlIGFuZCBuZWdhdGl2ZSB3b3JkcyoqCmBgYHtyfQp3b3JkcyAlPiUKICBpbm5lcl9qb2luKGdldF9zZW50aW1lbnRzKCJiaW5nIiksIGJ5ID0gIndvcmQiKSAlPiUKICBjb3VudCh3b3JkLCBzZW50aW1lbnQsIHNvcnQgPSBUUlVFKSAlPiUKICB1bmdyb3VwKCkgJT4lIAogIGZpbHRlcihuID4gMSkgJT4lCiAgbXV0YXRlKG4gPSBpZmVsc2Uoc2VudGltZW50ID09ICJuZWdhdGl2ZSIsIC1uLCBuKSkgJT4lCiAgbXV0YXRlKHdvcmQgPSByZW9yZGVyKHdvcmQsIG4pKSAlPiUKICBnZ3Bsb3QoYWVzKHdvcmQsIG4sIGZpbGwgPSBzZW50aW1lbnQpKSArCiAgZ2VvbV9jb2woKSArCiAgbGFicyh4ID0gTlVMTCwgeSA9IE5VTEwsIGZpbGwgPSAiU2VudGltZW50IikgKwogIGNvb3JkX2ZsaXAoKSArCiAgdGhlbWVfbWluaW1hbCgpICsKICB0aGVtZShsZWdlbmQucG9zaXRpb249ImJvdHRvbSIpCmBgYAoKKioqCgojIyMgUmVzb3VyY2VzCi0gU2lsZ2UsIEouICYgUm9iaW5zb24sIEQuICgyMDE3KS4gVGV4dCBNaW5pbmcgd2l0aCBSOiBBIFRpZHkgQXBwcm9hY2gsIE8nUmVpbGx5IE1lZGlhLiBBdmFpbGFibGUgb25saW5lIHZpYTogW2h0dHA6Ly90aWR5dGV4dG1pbmluZy5jb21dKGh0dHA6Ly90aWR5dGV4dG1pbmluZy5jb20pCg==