In this practical you’ll learn how to do natural language processing in R. By the end of this practical you will know how to:
File | Rows | Columns | Description |
---|---|---|---|
game_of_thrones.RDS | 41,379 | 5 | Subtitles of all episodes of the first seven seasons of Game of Thrones |
game_of_thrones_raw.zip | - | - | Subtitles of all episodes of the first seven seasons of Game of Thrones as raw text files |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
tidytext |
install.packages("tidytext") |
wordcloud |
install.packages("wordcloud") |
rvest |
install.packages("rvest") |
Combining & extracting strings
Function | Description |
---|---|
paste() , paste0 |
Base function for combining strings |
str_c() |
stringr function for combining strings |
str_extract_all() |
stringr function for extracting strings using regular expressions |
Read functions
Function | Description |
---|---|
read_file() |
Read flat csv file |
readRDS() |
Read from R’s RDS format |
file(...,'r'), readLines |
Read from file connection |
Text analysis functions
Function | Description |
---|---|
unnest_tokens() |
Split text into words (tokens) |
bind_tf_idf() |
Compute tf_idf weighting |
get_sentiments |
Access sentiment data set |
inner_join() |
Join words with, e.g., sentiments |
anti_join() |
Eliminate words, e.g., stop_words |
Create word cloud
Function | Description |
---|---|
wordcloud() |
Create word cloud |
Open your baselrbootcamp
R project. It should already have the folders 1_Data
and 2_Code
.
Open a new R script and save it as a new file called nlp_practical.R
in the 2_Code
folder. At the top of the script, using comments, write your name and the date. Then, load all package(s) listed in the Packages section above with library()
.
game_of_thrones.RDS
using readRDS()
and name the object got
.got <- readRDS(XX)
unnest_tokens()
-function, tokenize the subtitles, i.e., split the individual sentences into words. Try using the pipe %>%
and name the object got_words
.got_words <- XX %>%
unnest_tokens(XX, XX)
Print the original and the new data frame and compare their dimensions. How much has the object grown?
Using dplyr
’s count()
-function, count words per season (see ?count
) and name the resulting object got_counts
.
# count words
got_counts <- got_words %>%
count(XX, XX) %>%
ungroup()
wordcloud()
, plot a word cloud of the words of game of thrones with words that occur more than 50 times (use filter()
).# reduce to n > 50
got_counts_50 <- got_counts %>%
filter(n > 50)
# create word cloud
wordcloud(words = XX, freq = XX)
stop_words
from got_counts
using the anti_join
function. See below.# count non-stopword words
got_counts <- got_words %>%
anti_join(stop_words, by = "word") %>%
count(word) %>%
ungroup()
In this subsection, you will be looking at the term document matrix and how it can be used to learn more about the content of individual seasons.
top_n()
, determine the 10 most important (aka most frequent) words for each season.# determine 3 most frequent words per season
most_frequent <- got_words %>%
count(XX, XX) %>%
group_by(XX) %>%
top_n(XX) %>%
ungroup()
# plot the results
most_frequent %>%
arrange(XX, desc(XX)) %>%
ggplot(aes(x = XX, y = XX)) +
geom_bar(stat="identity") +
facet_wrap(~XX, scale = 'free') +
coord_flip()
Again, if we don’t remove stop_words
the results are not terribly informative. Fix this using the steps above (B-5).
This time, removing stop_words
didn’t do the trick, because there are other show-specific terms that frequently occur in every season. To account for this, use the tf-idf
transformation using bind_tf_idf()
and re-plot the results.
# determine 3 most frequent non-stopword words per season
# and apply tf-idf transformation
most_frequent <- got_words %>%
anti_join(XX, by = XX) %>%
count(season, word) %>%
bind_tf_idf(term = XX, document = XX, n = XX) %>%
group_by(XX) %>%
top_n(XX, tf_idf) %>%
ungroup()
# plot the results
most_frequent %>%
arrange(season, desc(n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat="identity") +
facet_wrap(~season, scale = 'free') +
coord_flip()
In this subsection, you will be using sentiment analysis and to learn something about the dramatic development of the show’s seasons.
get_sentiments()
, load the "afinn"
-sentiment data set and assign it to an object called affinn
.# get afinn sentiment data
afinn <- get_sentiments(lexicon = XX)
Using inner_join()
, combine the got_words
data frame with the afinn
data frame and call the object got_afinn
. Notice how inner_join()
takes care of words not available in the afinn
data frame.
Aggregate the sentiment score by episode
index and plot the result. Which episode index usually has the worst sentiment?
# aggregate and plot by episode sentiment
got_afinn %>%
group_by(XX) %>%
summarize(sentiment = mean(XX)) %>%
ggplot(aes(x = XX, y = XX)) +
geom_point() +
geom_smooth() + theme_light()
A fundamental law of Psycholinguistics is that there are few words that occur a frequently and many that occur rarely. Specifically, the relationship between the words rank in the frequency distribution (with 1 indicating the rank with the highest frequency) and the actual frequency usually follows a power law roughly proportional to \(rank^{-alpha}\) where \(alpha\) is some value around 1. This relationship is know as (one of) Zipf’s law(s). Evaluate whether Game of Thrones also shows Zipf’s law. Create a new variable that contains the rank of the frequency counts using rank(-x)
(remember rank 1 must be associated with the highest frequency). Then plot the relationship between the log()
of the rank and the log()
of the frequency. If the plot shows a roughly linear relationship then it follows a power law (because we are plotting in a log-log space).
Another law of Psycholinguistics, which is related to information science, is that frequent words should be composed of fewer characters than more infrequent words, as this would minimize the number of to be communicated characters and, thus, maximize efficiency. Evaluate this relationship for Game of Thrones by adding a variable containing the word’s number of characters using nchar()
and plotting its relationship to the words’ frequencies. Does Game of Thrones communicate efficiently?
Read in the subtitles for each episode of Game of Thrones using read_file()
from the readr
-package. To do this, first extract the file names of all of the files using list.files(path, full.names = TRUE)
. One way to achieve this quickly is by first creating a vector containing the subtitle’s folder paths using paste()
and then using the lapply()
-function to iterate over the folder paths to extract the filenames. The lapply()
-function such as any other apply()
-function (this will be covered in the Programming with R session) iterates through the object (the first argument) and for each element applies a function (the second argument). Thus, you want to run a command similar to files <- lapply(folder_paths, list.files, full.names = TRUE)
. Note that any third arguments will be passed on to the function specified in the second argument.
Extract the text from the subtitle files. Begin by inspecting the text. Use str_sub()
to print the first few hundred characters. Try to identify what characters precede the the spoken lines and which succeed. Think about how to build a regular expression that captures the end and stop points of the spoken line that also handles the many lines including not speech but comments. Evaluate the code below (find more info here). Try to understand why the regular expression looks that way. Use it to extract the text
# inspect
str_sub(got[[1]], 1, 1000)
# extract data
got = str_extract_all(got, '(?<=\n)[^(][<i>]*[:alpha:]+[:control:]*[:print:]+(?=\r*\n)')
# define XPath locations of episode tables
paths = paste0('//*[@id="mw-content-text"]/div/table[',2:8,']')
# extract episode names
names = unlist(lapply(paths, function(x) {
read_html('https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes') %>%
html_nodes(xpath = x) %>%
html_table() %>%
`[[`(1) %>%
`[[`(3) %>%
str_replace_all('"','')
}))
4.Combine the extracted text, the episode names, their index in the season, and the season’s index inside a single tibble()
. Use the code below and try to understand what the code does.
# create tibbles
got = lapply(1:length(got), function(i){
season = ceiling(i / 10)
episode = i - ((season-1) * 10)
tibble(season, episode, title = names[i], text = got[[i]])
})
# combine data frames
got = do.call(rbind, got)