Here a link to the lecture slides for this session: LINK
In this practical you’ll learn how to do natural language processing in R. By the end of this practical you will know how to:
Here are some tools:
Function | Description |
---|---|
paste() , paste0 |
Base function for combining strings |
str_c() |
stringr function for combining strings |
str_extract_all() |
stringr function for extracting strings using regular expressions |
Here are the main read-in functions:
Function | Description |
---|---|
read_file() |
Read flat csv file |
readRDS() |
Read from R’s RDS format |
file(...,'r'), readLines |
Read from file connection |
Here are the main web-scraping functions (package rvest
):
Function | Description |
---|---|
read_html() |
Read html from web |
read_nodes() |
Access tagged elements within the html document using, e.g., XPath |
html_table() |
Extract a table from an html document |
Here are the main tidytext
(& dplyr
) functions:
Function | Description |
---|---|
unnest_tokens() |
Split text into words (tokens) |
bind_tf_idf() |
Compute tf_idf weighting |
get_sentiments |
Access sentiment data set |
inner_join() |
Join words with, e.g., sentiments |
anti_join() |
Eliminate words, e.g., stop_words |
This tutorial begins with an optional task for users who feel confident with programming in R and want to deal with the often complicated but necessary step of bringing the raw text data into the right shape for your analysis tools. As this will involve programming elements that we have not covered yet, I generally recommend to jump to Read in processed text data and to return to this section later if you have the time.
Read in the subtitles for each episode of Game of Thrones using read_file()
from the readr
-package. To do this, first extract the file names of all of the files using list.files(path, full.names = TRUE)
. One way to achieve this quickly is by first creating a vector containing the subtitle’s folders using paste()
and then by using the lapply()
-function to iterate over the folders. The lapply()
-function such as any other apply()
-function (this will be covered in the Programming with R session) iterates the object provided as the first argument and applies a function provided as the second. Thus, you want to run a command similar to files <- lapply(folder_paths, list.files, full.names = TRUE)
. Note that any third arguments will be passed on to the function specified in the second argument.
Extract text lines from subtitles. Begin by inspecting the text. Use str_sub()
to print the first few hundred characters. Try to identify what characters precede the the spoken lines and which succeed. Think about how to build a regular expression that captures the end and stop points of the spoken line that also handles the many lines including not speech but comments. Evaluate the code below (find more info here). Try to understand why the regular expression looks that way. Use it to extract the text
# inspect
str_sub(got[[1]], 1, 1000)
[1] "1\r\n00:01:55,418 --> 00:01:58,420\r\nEasy, boy.\r\n\r\n2\r\n00:02:55,047 --> 00:02:56,881\r\n<i>( Gasps</i>\r\n\r\n3\r\n00:03:11,162 --> 00:03:13,331\r\n<i>What do you expect?</i>\r\n<i>They're savages.</i>\r\n\r\n4\r\n00:03:13,365 --> 00:03:15,700\r\nOne lot steals a goat\r\nfrom another lot,\r\n\r\n5\r\n00:03:15,734 --> 00:03:18,737\r\nbefore you know it they're ripping\r\neach other to pieces.\r\n\r\n6\r\n00:03:18,771 --> 00:03:20,939\r\nI've never seen wildlings\r\ndo a thing like this.\r\n\r\n7\r\n00:03:20,974 --> 00:03:23,275\r\nI never seen a thing like this,\r\nnot ever in my life.\r\n\r\n8\r\n00:03:23,310 --> 00:03:26,179\r\nHow close did you get?\r\n\r\n9\r\n00:03:26,213 --> 00:03:29,148\r\n- Close as any man would.\r\n- We should head back to the wall.\r\n\r\n10\r\n00:03:31,685 --> 00:03:33,319\r\nDo the dead frighten you?\r\n\r\n11\r\n00:03:33,353 --> 00:03:35,788\r\nOur orders were\r\nto track the wildlings.\r\n\r\n12\r\n00:03:35,822 --> 00:03:38,657\r\nWe tracked them.\r\nThey won't trouble us no more.\r\n\r\n13\r\n00:03:38,691 --> 00:03:41,360\r\nYou don't think\r\nhe'll ask us how the"
# extract data
got = str_extract_all(got, '(?<=\n)[^(][<i>]*[:alpha:]+[:control:]*[:print:]+(?=\r*\n)')
# define XPath locations of episode tables
paths = paste0('//*[@id="mw-content-text"]/div/table[',2:8,']')
# extract episode names
names = unlist(lapply(paths, function(x) {
read_html('https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes') %>%
html_nodes(xpath = x) %>%
html_table() %>%
`[[`(1) %>%
`[[`(3) %>%
str_replace_all('"','')
}))
4.Combine the extracted text, the episode names, their index in the season, and the season’s index inside a single tibble()
. Use the code below. Try to understand what the code does.
# create tibbles
got = lapply(1:length(got), function(i){
season = ceiling(i / 10)
episode = i - ((season-1) * 10)
tibble(season, episode, title = names[i], text = got[[i]])
})
# combine data frames
got = do.call(rbind, got)
readRDS()
from today’s data sets.got <- readRDS('data/game_of_thrones.RDS')
unnest_tokens()
from the tidytext
package (install and load). Try using the pipe %>%
. Evaluate the effect on the object. How much bigger have its dimensions become?# install
# install.packages('tidytext')
# load
library(tidytext)
Warning: package 'tidytext' was built under R version 3.4.4
# tokenize
got_words <- got %>%
unnest_tokens(word, text)
# print
got
# A tibble: 63,263 x 4
season episode title text
<dbl> <dbl> <chr> <chr>
1 1. 1. Winter Is Coming Easy, boy.
2 1. 1. Winter Is Coming <i>( Gasps</i>
3 1. 1. Winter Is Coming <i>What do you expect?</i>
4 1. 1. Winter Is Coming <i>They're savages.</i>
5 1. 1. Winter Is Coming One lot steals a goat
6 1. 1. Winter Is Coming from another lot,
7 1. 1. Winter Is Coming before you know it they're ripping
8 1. 1. Winter Is Coming each other to pieces.
9 1. 1. Winter Is Coming do a thing like this.
10 1. 1. Winter Is Coming not ever in my life.
# ... with 63,253 more rows
got_words
# A tibble: 252,945 x 4
season episode title word
<dbl> <dbl> <chr> <chr>
1 1. 1. Winter Is Coming easy
2 1. 1. Winter Is Coming boy
3 1. 1. Winter Is Coming i
4 1. 1. Winter Is Coming gasps
5 1. 1. Winter Is Coming i
6 1. 1. Winter Is Coming i
7 1. 1. Winter Is Coming what
8 1. 1. Winter Is Coming do
9 1. 1. Winter Is Coming you
10 1. 1. Winter Is Coming expect
# ... with 252,935 more rows
count()
-function from the dplyr
-package. Don’t forget to ungroup()
.# count words
got_cnts <- got_words %>%
count(season, episode, title, word) %>%
ungroup()
rank()
. Then plot the relationship between the log()
of the rank and the log()
of the frequency. If the plot looks roughly linear then the relationship follows a power law (because we are plotting in a log-log space).# count words
got_cnts %>%
mutate(rank = rank(-n)) %>%
ggplot(aes(log(rank),log(n))) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'gam'
nchar()
and plotting its relationship to the words’ frequencies. Does Game of Thrones communicate efficiently?# count words
got_cnts %>%
mutate(nchar = nchar(word)) %>%
ggplot(aes(nchar,log(n))) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'gam'
In this subsection we will be looking at the term document matrix. To do this, we will actually not be required to transform our data into a matrix, because of the way how tidyverse works. However, if you wanted to, you cast the table with counts to term-document or document-term matrices using cast_tdm()
or cast_dtm()
. This will create an object of class TermDocumentMatrix
or DocumentTermMatrix
and allow you to continue working with the tools included in the tm
-package.
title
variable and top_n()
).# count words
most_frequent <- got_cnts %>%
group_by(title, season, episode) %>%
top_n(3) %>%
ungroup()
Selecting by n
# plot
most_frequent %>%
arrange(season, episode, desc(n)) %>%
mutate(word = as.factor(word),
title_no = paste0(season,'_',episode,'_',title)) %>%
ggplot(aes(word, n, fill = season)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~title_no, ncol = 6, scales = "free") +
coord_flip()
tidytext
package. The object is called stop_words
. Remove them using anti_join()
, determine again the top 3 most frequent and plot.# count words
most_frequent <- got_cnts %>%
anti_join(stop_words) %>%
group_by(title, season, episode) %>%
top_n(3, n) %>%
ungroup()
Joining, by = "word"
# plot
most_frequent %>%
arrange(season, episode, desc(n)) %>%
mutate(word = as.factor(word),
title_no = paste0(season,'_',episode,'_',title)) %>%
ggplot(aes(word, n, fill = season)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~title_no, ncol = 6, scales = "free") +
coord_flip()
bind_tf_idf()
. Repeat the same analysis for the newly created tf_idf
variable.# count words
most_frequent <- got_cnts %>%
anti_join(stop_words) %>%
bind_tf_idf(word, title, n) %>%
group_by(title, season, episode) %>%
top_n(3, tf_idf) %>%
ungroup()
Joining, by = "word"
# plot
most_frequent %>%
arrange(season, episode, desc(tf_idf)) %>%
mutate(word = as.factor(word),
title_no = paste0(season,'_',episode,'_',title)) %>%
ggplot(aes(word, tf_idf, fill = season)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~title_no, ncol = 6, scales = "free") +
coord_flip()
# count words
most_frequent <- got_cnts %>%
anti_join(stop_words) %>%
bind_tf_idf(word, title, n) %>%
group_by(season, word) %>%
summarize(tf_idf = sum(tf_idf)) %>%
group_by(season) %>%
top_n(10, tf_idf) %>%
ungroup()
Joining, by = "word"
# plot
most_frequent %>%
arrange(season, desc(tf_idf)) %>%
mutate(word = as.factor(word)) %>%
ggplot(aes(word, tf_idf, fill = season)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~season, ncol = 4, scales = "free") +
coord_flip()
get_sentiments("afinn")
and assign it to an object.# define afinn
afinn <- get_sentiments("afinn")
unnest_tokens()
, and join the data with the AFINN data set using inner_join()
.# read processed data
got <- readRDS('data/game_of_thrones.RDS')
# join afinn
got_afinn <- got %>%
unnest_tokens(word, text) %>%
inner_join(afinn)
Joining, by = "word"
# aggregate and plot
got_afinn %>%
group_by(episode) %>%
summarize(sentiment = mean(score)) %>%
ggplot(aes(x = episode, y = sentiment, col = sentiment)) +
geom_point() +
#facet_wrap(~season) +
scale_colour_gradientn(colours = c('red','green')) +
geom_smooth()
`geom_smooth()` using method = 'loess'
get_sentiments("bing")
and get_sentiments("nrc")
.