class: center, middle, inverse, title-slide # Natural language processsing ### The R Bootcamp
www.therbootcamp.com
@therbootcamp
### July 2018 --- layout: true <div class="my-footer"><span> <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">Basel, July 2018</font></a>                                           <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">www.therbootcamp.com</font></a> </span></div> --- # Definitions **Natural-language processing (NLP)** according to [Wikipedia](https://en.wikipedia.org/wiki/Natural-language_processing): > Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the **interactions between computers and human (natural) languages**, in particular how to program computers to fruitfully **process large amounts of natural language data**. <br><br> **Natural language** according [Wikipedia](https://en.wikipedia.org/wiki/Natural_language): >In neuropsychology, linguistics, and the philosophy of language, a **natural language or ordinary language** is any language that has **evolved naturally in humans** through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are distinguished from constructed and formal languages such as those used to program computers or to study logic.[1] --- # Sources <p align="center"> <img scr="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/sources.png"> </p> <img src="https://www.popsci.com/sites/popsci.com/files/images/2017/11/books.jpg" alt="test" height = 200> <img src="http://scidle.com/wp-content/uploads/2016/03/wikipedia.jpg" alt="test" height = 200> <img src="https://cdn.wccftech.com/wp-content/uploads/2018/02/Twitter.jpeg" alt="test" height = 200> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7c/Facebook_New_Logo_%282015%29.svg/2000px-Facebook_New_Logo_%282015%29.svg.png" height = 200> <img src="https://blog.richmond.edu/richmondlawcdo/files/2017/02/Email.jpg" height = 200> <img src="http://www.vitecinc.com/wp-content/uploads/2016/04/voice-recognition-640x480.jpg" height = 200 width = 206> --- # Turing test .pull-left5[ ><font size = 5><i>"A computer would deserve to be called intelligent if it could deceive a human into believing that it was human."</i><br>Alan Turing, 1950</font> <p align="center"> <img src="https://upload.wikimedia.org/wikipedia/commons/5/55/Turing_test_diagram.png" height="240px" vspace="10"></img><br> <a href="https://en.wikipedia.org/wiki/Alan_Turing">The imitation game</a> </p> ] .pull-right45[ <p align="center"> <img src="https://upload.wikimedia.org/wikipedia/commons/7/79/Alan_Turing_az_1930-as_%C3%A9vekben.jpg" height="400px"></img><br> Alan Turing, 1938, <a href="https://en.wikipedia.org/wiki/Alan_Turing">Wikipedia</a> </p> ] --- # Google Duplex / Assistant .pull-left5[ <iframe width="600" height="337" src="https://www.youtube.com/embed/bd1mEm2Fy08" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> <high>Sundar Pichai @ Google IO, May 2018</high> ] .pull-right5[ <iframe width="600" height="337" src="https://www.youtube.com/embed/-qCanuYrR0g" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> <high>Google Ad, June 2018</high> ] --- # Tools & Methods .pull-left5[ ### Basics [**Tokenizing**](https://en.wikipedia.org/wiki/Word_segmentation)<br> [Stemming](https://en.wikipedia.org/wiki/Stemming)<br> [Part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)<br> [Parsing](https://en.wikipedia.org/wiki/Parsing)<br> etc. ### Semantics [Lexical semantics](https://en.wikipedia.org/wiki/Word_segmentation)<br> [Machine Translation](https://en.wikipedia.org/wiki/Machine_translation)<br> [Relationship extraction](https://en.wikipedia.org/wiki/Relationship_extraction)<br> [**Sentiment analysis**](https://en.wikipedia.org/wiki/Sentiment_analysis)<br> [**Topic analysis**](https://en.wikipedia.org/wiki/Topic_segmentation)<br> etc. ] .pull-right5[ ### Discourse [Automatic summarization](https://en.wikipedia.org/wiki/Automatic_summarization)<br> [Discourse analysis](https://en.wikipedia.org/wiki/Discourse_analysis)<br> etc. ### Speech [Speech recognition](https://en.wikipedia.org/wiki/Speech_recognition)<br> [Speech segmentation](https://en.wikipedia.org/wiki/Speech_segmentation)<br> [Relationship extraction](https://en.wikipedia.org/wiki/Relationship_extraction)<br> [Text-to-speech](https://en.wikipedia.org/wiki/Text-to-speech)<br> etc. ] <font size="2"> from <a href="https://en.wikipedia.org/wiki/Natural-language_processing">Wikipedia</a></font> --- # Encoding .pull-left55[ <font size="5">1960: ASCII</font> <img src="https://www.asciitable.com/index/asciifull.gif"> More info: [here](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) & [here](http://kunststube.net/encoding/) ] .pull-right4[ <font size="5">1991: Unicode</font> <a href="http://unicode.org/"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Unicode_logo.svg/512px-Unicode_logo.svg.png" width = 365></a> ] --- <br><br> | Character| Code Point | Encoding | Precision | Representation| |:------|:--------|:--------||:--------| | `A`| `U+0041` | ASCII | fixed 7 bit | 1000001 | | `A`| `U+0041` | UTF-8 | min 8 bit / 1 byte | 01000001 | | `A`| `U+0041` | UTF-16 | min 16 bit / 2 byte | 00000000 01000001 | | `A`| `U+0041` | UTF-32 | min 32 bit / 4 byte | 00000000 00000000 00000000 01000001 | | `あ`| `U+3042` | ASCII | fixed 7 bit | - | | `あ`| `U+3042` | UTF-8 | min 8 bit / 1 byte | 11100011 10000001 10000010 | | `あ`| `U+3042` | UTF-16 | min 16 bit / 2 byte | 00110000 01000010 | | `あ`| `U+3042` | UTF-32 | min 32 bit / 4 byte | 00000000 00000000 00110000 01000010 | | <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | ASCII | fixed 7 bit | - | | <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | UTF-8 | min 8 bit / 1 byte | 1111 0000 1001 1111 1001 1000 1000 0000 | | <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | UTF-16 | min 16 bit / 2 byte | 1101 1000 0011 1101 1101 1110 0000 0000 | | <img src="https://images.emojiterra.com/google/android-oreo/512px/1f600.png" height="20px"> | `U+1F600` | UTF-32 | min 32 bit / 4 byte | - | --- # Regular expressions .pull-left45[ According to [Wikipedia](https://en.wikipedia.org/wiki/Regular_expression): >A regular expression, <high>regex or regexp</high> (sometimes called a rational expression) is, in theoretical computer science and formal language theory, <high>a sequence of characters that define a search pattern</high>. Usually this pattern is then used by <high>string searching algorithms</high> for "find" or "find and replace" operations on strings. ] .pull-right45[ <img src="https://upload.wikimedia.org/wikipedia/commons/2/23/The_river_effect_in_justified_text.jpg"> <p align="center">`(?<=\.) {2,}(?=[A-Z])`<p> ] --- # Regular expressions .pull-left65[ The `stringr` package provides a series of high performance regular expression functions based on the <a href="http://site.icu-project.org/home">`ICU`</a> C++ library. ```r str_*(string, pattern, ...) ``` | Function suffix | Use | |:---------------|:------------------------------| | `str_detect` | Test if pattern is present. | | `str_count` | Count number of pattern matches. | | `str_locate` | Find location of pattern. | | `str_extract` | Extract strings matching pattern. | | `str_replace` | Replace string matching pattern by other string. | | `str_split` | Split string around pattern. | ] .pull-right25[ <img src="https://www.rstudio.com/wp-content/uploads/2014/04/stringr.png" height=200> <a href="http://site.icu-project.org/home"><img src="http://www.unicode.org/announcements/ICU-logo.png" height=200></a> ] --- # Regular expressions: Segmentation ```r # text txt <- "Happy families are all alike; every unhappy family is unhappy in its own way." # Tokenize str_extract_all(txt, "[A-Za-z]+") ``` ``` ## [[1]] ## [1] "Happy" "families" "are" "all" "alike" "every" ## [7] "unhappy" "family" "is" "unhappy" "in" "its" ## [13] "own" "way" ``` ```r # Sentenize str_extract_all(txt, '[^[:space:]][^[.!?;]]*[.!?;]') ``` ``` ## [[1]] ## [1] "Happy families are all alike;" ## [2] "every unhappy family is unhappy in its own way." ``` --- # The Stark family: A Game of Thrones example <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/starks.png" height="450px"></img><br> </p> --- # Term-document matrix (TDM) Number of mention of <high>Stark family</high> members across the <high>first seasons</high> of [Game of Thrones](https://de.wikipedia.org/wiki/Game_of_Thrones) | | Winter Is Coming| The Kingsroad| Lord Snow| Cripples, Bastards, and Broken Things| The Wolf and the Lion| A Golden Crown| You Win or You Die| The Pointy End| Baelor| Fire and Blood| |:-------|----------------:|-------------:|---------:|-------------------------------------:|---------------------:|--------------:|------------------:|--------------:|------:|--------------:| |jon | 6| 3| 3| 6| 5| 1| 3| 3| 0| 0| |ned | 5| 9| 8| 2| 5| 0| 6| 2| 0| 1| |robb | 2| 1| 0| 1| 0| 5| 0| 5| 4| 5| |sansa | 2| 3| 3| 4| 0| 1| 0| 5| 1| 0| |arya | 1| 11| 1| 1| 4| 0| 0| 2| 0| 0| |bran | 3| 4| 3| 3| 2| 4| 1| 1| 0| 4| |rickon | 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| |catelyn | 0| 1| 4| 1| 1| 1| 3| 2| 2| 0| --- .pull-left65[ # TDM: Applications <high>Search</high> Identify most appropriate document for search queries, e.g.,<br><br>"Who is Arya Stark" -> "Watch Episode The Kingsroad" <high>Semantic meaning</high> Use documents as variables to learn about relationships between terms. "What is Robb Stark" -> "Someone who is not Ned Stark" <high>Topic modeling</high> decompose matrix into topics (sets of words) Topic 1: Jon, Arya, Ned<br> Topic 2: Rickon, Arya, Bran <high>Spam detection</high> Use terms as variables to detect properties of documents, e.g., spam status or quality of episode. "Which is the worst episode?" ] .pull-right3[ <br><br><br> | | Winter Is Coming| The Kingsroad| Lord Snow| |:-------|----------------:|-------------:|---------:| |jon | 6| 3| 3| |ned | 5| 9| 8| |robb | 2| 1| 0| |sansa | 2| 3| 3| |arya | 1| 11| 1| |bran | 3| 4| 3| |rickon | 0| 1| 0| |catelyn | 0| 1| 4| ] --- # Term-document matrix (TDM) Number of mention of <high>Stark family</high> members across the <high>first seasons</high> of [Game of Thrones](https://de.wikipedia.org/wiki/Game_of_Thrones) | | Winter Is Coming| The Kingsroad| Lord Snow| Cripples, Bastards, and Broken Things| The Wolf and the Lion| A Golden Crown| You Win or You Die| The Pointy End| Baelor| Fire and Blood| |:-------|----------------:|-------------:|---------:|-------------------------------------:|---------------------:|--------------:|------------------:|--------------:|------:|--------------:| |jon | 6| 3| 3| 6| 5| 1| 3| 3| 0| 0| |ned | 5| 9| 8| 2| 5| 0| 6| 2| 0| 1| |robb | 2| 1| 0| 1| 0| 5| 0| 5| 4| 5| |sansa | 2| 3| 3| 4| 0| 1| 0| 5| 1| 0| |arya | 1| 11| 1| 1| 4| 0| 0| 2| 0| 0| |bran | 3| 4| 3| 3| 2| 4| 1| 1| 0| 4| |rickon | 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| |catelyn | 0| 1| 4| 1| 1| 1| 3| 2| 2| 0| --- # Tf-idf: Term frequency - inverse document frequency Accounts (usually) for (a) differences in the number of <high>terms within document</high> (e.g., `\(f_{t,d}/\sum{f_{t,d}}\)`) and (b) the frequency of <high>terms between documents</high> (e.g., `\(log(N/n_{t})\)`). | | Winter Is Coming| The Kingsroad| Lord Snow| Cripples, Bastards, and Broken Things| The Wolf and the Lion| A Golden Crown| You Win or You Die| The Pointy End| Baelor| Fire and Blood| |:-------|----------------:|-------------:|---------:|-------------------------------------:|---------------------:|--------------:|------------------:|--------------:|------:|--------------:| |jon | 0.07| 0.02| 0.03| 0.07| 0.07| 0.02| 0.05| 0.03| 0.00| 0.00| |ned | 0.06| 0.06| 0.08| 0.02| 0.07| 0.00| 0.10| 0.02| 0.00| 0.02| |robb | 0.04| 0.01| 0.00| 0.02| 0.00| 0.15| 0.00| 0.09| 0.20| 0.16| |sansa | 0.04| 0.03| 0.05| 0.08| 0.00| 0.03| 0.00| 0.09| 0.05| 0.00| |arya | 0.03| 0.17| 0.02| 0.03| 0.12| 0.00| 0.00| 0.05| 0.00| 0.00| |bran | 0.02| 0.01| 0.01| 0.02| 0.01| 0.04| 0.01| 0.01| 0.00| 0.04| |rickon | 0.00| 0.05| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.15| |catelyn | 0.00| 0.01| 0.04| 0.01| 0.01| 0.02| 0.05| 0.02| 0.06| 0.00| --- .pull-left4[ # Sentiment analysis Aims to identify the sentiment, i.e., <high>affective value</high>, of natural language, based on... <high>Human judgment</high><br> Identify words that were judged by humans based on their sentiments and aggregate the values. <high>Heuristic rules</high><br> Complements the first approach with rules, e.g., negation. <high>Machine learning</high><br> Learns a mapping of terms and document sentiment based on a training set of judged and estimated document sentiments. ] .pull-right55[ <br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/sentiment.png" width="550"> ] --- # Sentiment analysis .pull-left4[ <p align="center"> <high>Example</high> </p> |word | score| |:--------|--------:| |i | NA| |could | NA| |kill | -3| |for | NA| |a | NA| |lovely | 3| |piece | NA| |of | NA| |yummy | 3| |cake | NA| ] .pull_right5[ <br> Analysis of <high>Game of Thrones episode sentiment</high> as a function episode's index within a season. <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/sentiment_got.png" width="550"> </p> ] --- # Packages .pull-left45[ <high>Frameworks</high> | Package | Use | |:------------|:----------------| | `tm` | text mining framework for R. [Paper](http://www.jstatsoft.org/v25/i05/) | | `openNLP` |R interface to [OpenNLP](http://opennlp.sourceforge.net/) | | `tidytext` | text mining using tidy tools. | <br><high>Semantics</high> | Package | Use | |:----------|:----------------| | `topicmodels` | Latent Dirichlet Allocation and Correlated Topics Models | | `text2vec` | tools for text vectorization and word embeddings. | ] .pull-right45[ <high>Tools</high> | Package | Use | |:---------------------|:-------------------------------------------------| |`stringr` | regular expressions and basic text manipulation | |`hunspell` | spelling correction | <br><high>Data</high> | Package | Use | |:----------|:----------------| |`gutenbergr` | allows downloading books from the Project Gutenberg collection| |`rvest` | scrape the internet | |`twitteR` | access Twitter data | |`Rfacebook` | access Facebook data | ] --- # Practical <p><font size=6><b><a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/LanguageProcessing/LanguageProcessing_practical.html">Link to practical</a>