Trulli
source

Overview

In this practical you’ll learn how to do natural language processing in R. By the end of this practical you will know how to:

  1. Calculate word frequencies and create word clouds.
  2. Create a term-frequency matrix.
  3. Conduct sentiment analyses.

Datasets

File Rows Columns Description
game_of_thrones.RDS 41,379 5 Subtitles of all episodes of the first seven seasons of Game of Thrones
game_of_thrones_raw.zip - - Subtitles of all episodes of the first seven seasons of Game of Thrones as raw text files

Packages

Package Installation
tidyverse install.packages("tidyverse")
tidytext install.packages("tidytext")
wordcloud install.packages("wordcloud")
rvest install.packages("rvest")

Glossary

Combining & extracting strings

Function Description
paste(), paste0 Base function for combining strings
str_c() stringr function for combining strings
str_extract_all() stringr function for extracting strings using regular expressions

Read functions

Function Description
read_file() Read flat csv file
readRDS() Read from R’s RDS format
file(...,'r'), readLines Read from file connection

Text analysis functions

Function Description
unnest_tokens() Split text into words (tokens)
bind_tf_idf() Compute tf_idf weighting
get_sentiments Access sentiment data set
inner_join() Join words with, e.g., sentiments
anti_join() Eliminate words, e.g., stop_words

Create word cloud

Function Description
wordcloud() Create word cloud

Tasks

A - Getting setup

  1. Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code.

  2. Open a new R script and save it as a new file called nlp_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load all package(s) listed in the Packages section above with library().

B - Read in processed text data

  1. Read in the processed game of thrones subtitles data in game_of_thrones.RDS using readRDS() and name the object got.
got <- readRDS(XX)

B - Word cloud

  1. Using the amazingly fast and convenient unnest_tokens()-function, tokenize the subtitles, i.e., split the individual sentences into words. Try using the pipe %>% and name the object got_words.
got_words <- XX %>%
  unnest_tokens(XX, XX) 
  1. Print the original and the new data frame and compare their dimensions. How much has the object grown?

  2. Using dplyr’s count()-function, count words per season (see ?count) and name the resulting object got_counts.

# count words
got_counts <- got_words %>%
  count(XX, XX) %>%
  ungroup()
  1. Using wordcloud(), plot a word cloud of the words of game of thrones with words that occur more than 50 times (use filter()).
# reduce to n > 50
got_counts_50 <- got_counts %>% 
  filter(n > 50)

# create word cloud
wordcloud(words = XX, freq = XX)
  1. You probably have noticed that the word cloud was dominated by some very unrevealing words. To obtain a representation of what the show is about remove so-called stop_words from got_counts using the anti_join function. See below.
# count non-stopword words
got_counts <- got_words %>%
  anti_join(stop_words, by = "word") %>%
  count(word) %>%
  ungroup()
  1. Now that stop words have been removed, select again words that occur more than 50 times and create a new word cloud. You should get a much clearer picture now.

C - Term-document matrix

In this subsection, you will be looking at the term document matrix and how it can be used to learn more about the content of individual seasons.

  1. Using top_n(), determine the 10 most important (aka most frequent) words for each season.
# determine 3 most frequent words per season
most_frequent <- got_words %>%
  count(XX, XX) %>%
  group_by(XX) %>%
  top_n(XX) %>%
  ungroup()

# plot the results
most_frequent %>% 
  arrange(XX, desc(XX)) %>%
  ggplot(aes(x = XX, y = XX)) +
  geom_bar(stat="identity") +
  facet_wrap(~XX, scale = 'free') +
  coord_flip()
  1. Again, if we don’t remove stop_words the results are not terribly informative. Fix this using the steps above (B-5).

  2. This time, removing stop_words didn’t do the trick, because there are other show-specific terms that frequently occur in every season. To account for this, use the tf-idf transformation using bind_tf_idf() and re-plot the results.

# determine 3 most frequent non-stopword words per season
# and apply tf-idf transformation
most_frequent <- got_words %>%
  anti_join(XX, by = XX) %>%
  count(season, word) %>%
  bind_tf_idf(term = XX, document = XX, n = XX) %>%
  group_by(XX) %>%
  top_n(XX, tf_idf) %>%
  ungroup()

# plot the results
most_frequent %>% 
  arrange(season, desc(n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_bar(stat="identity") + 
  facet_wrap(~season, scale = 'free') +
  coord_flip()

D - Sentiment

In this subsection, you will be using sentiment analysis and to learn something about the dramatic development of the show’s seasons.

  1. Using get_sentiments(), load the "afinn"-sentiment data set and assign it to an object called affinn.
# get afinn sentiment data
afinn <- get_sentiments(lexicon = XX)
  1. Using inner_join(), combine the got_words data frame with the afinn data frame and call the object got_afinn. Notice how inner_join() takes care of words not available in the afinn data frame.

  2. Aggregate the sentiment score by episode index and plot the result. Which episode index usually has the worst sentiment?

# aggregate and plot by episode sentiment
got_afinn %>%
  group_by(XX) %>%
  summarize(sentiment = mean(XX)) %>%
  ggplot(aes(x = XX, y = XX)) + 
  geom_point() + 
  geom_smooth() + theme_light()
  1. Plot the results separately by season. Which episode has the lowest sentiment?

X - Psycho-linguistics

  1. A fundamental law of Psycholinguistics is that there are few words that occur a frequently and many that occur rarely. Specifically, the relationship between the words rank in the frequency distribution (with 1 indicating the rank with the highest frequency) and the actual frequency usually follows a power law roughly proportional to \(rank^{-alpha}\) where \(alpha\) is some value around 1. This relationship is know as (one of) Zipf’s law(s). Evaluate whether Game of Thrones also shows Zipf’s law. Create a new variable that contains the rank of the frequency counts using rank(-x) (remember rank 1 must be associated with the highest frequency). Then plot the relationship between the log() of the rank and the log() of the frequency. If the plot shows a roughly linear relationship then it follows a power law (because we are plotting in a log-log space).

  2. Another law of Psycholinguistics, which is related to information science, is that frequent words should be composed of fewer characters than more infrequent words, as this would minimize the number of to be communicated characters and, thus, maximize efficiency. Evaluate this relationship for Game of Thrones by adding a variable containing the word’s number of characters using nchar() and plotting its relationship to the words’ frequencies. Does Game of Thrones communicate efficiently?

Y - Process raw text

  1. Read in the subtitles for each episode of Game of Thrones using read_file() from the readr-package. To do this, first extract the file names of all of the files using list.files(path, full.names = TRUE). One way to achieve this quickly is by first creating a vector containing the subtitle’s folder paths using paste() and then using the lapply()-function to iterate over the folder paths to extract the filenames. The lapply()-function such as any other apply()-function (this will be covered in the Programming with R session) iterates through the object (the first argument) and for each element applies a function (the second argument). Thus, you want to run a command similar to files <- lapply(folder_paths, list.files, full.names = TRUE). Note that any third arguments will be passed on to the function specified in the second argument.

  2. Extract the text from the subtitle files. Begin by inspecting the text. Use str_sub() to print the first few hundred characters. Try to identify what characters precede the the spoken lines and which succeed. Think about how to build a regular expression that captures the end and stop points of the spoken line that also handles the many lines including not speech but comments. Evaluate the code below (find more info here). Try to understand why the regular expression looks that way. Use it to extract the text

# inspect
str_sub(got[[1]], 1, 1000)

# extract data
got = str_extract_all(got, '(?<=\n)[^(][<i>]*[:alpha:]+[:control:]*[:print:]+(?=\r*\n)')
  1. Extract episode names from Wikipedia using the code below. Try to understand what the code does.
# define XPath locations of episode tables
paths = paste0('//*[@id="mw-content-text"]/div/table[',2:8,']')

# extract episode names
names = unlist(lapply(paths, function(x) {
  read_html('https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes') %>% 
    html_nodes(xpath = x) %>% 
    html_table() %>% 
    `[[`(1) %>%  
    `[[`(3) %>% 
    str_replace_all('"','')
  }))

4.Combine the extracted text, the episode names, their index in the season, and the season’s index inside a single tibble(). Use the code below and try to understand what the code does.

# create tibbles
got = lapply(1:length(got), function(i){
  season = ceiling(i / 10)
  episode = i - ((season-1) * 10)
  tibble(season, episode,  title = names[i], text = got[[i]])
  })

# combine data frames
got = do.call(rbind, got)

Additional reading