Natural Language Processing

class: center, middle, inverse, title-slide

# Natural Language Processing
### The R Bootcamp Twitter: <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### April 2018

---

# Definitions

**Natural-language processing (NLP)** according to [Wikipedia](https://en.wikipedia.org/wiki/Natural-language_processing):

> Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the **interactions between computers and human (natural) languages**, in particular how to program computers to fruitfully **process large amounts of natural language data**.

**Natural language** according [Wikipedia](https://en.wikipedia.org/wiki/Natural_language):

>In neuropsychology, linguistics, and the philosophy of language, a **natural language or ordinary language** is any language that has **evolved naturally in humans** through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are distinguished from constructed and formal languages such as those used to program computers or to study logic.[1]

---

# Sources

---

# Use cases

.pull-left5[

### Basics
[**Tokenizing**](https://en.wikipedia.org/wiki/Word_segmentation) 
[Stemming](https://en.wikipedia.org/wiki/Stemming) 
[Part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) 
[Parsing](https://en.wikipedia.org/wiki/Parsing) 
etc.

### Semantics
[Lexical semantics](https://en.wikipedia.org/wiki/Word_segmentation) 
[Machine Translation](https://en.wikipedia.org/wiki/Machine_translation) 
[Relationship extraction](https://en.wikipedia.org/wiki/Relationship_extraction) 
[**Sentiment analysis**](https://en.wikipedia.org/wiki/Sentiment_analysis) 
[**Topic analysis**](https://en.wikipedia.org/wiki/Topic_segmentation) 
etc.

]

.pull-right5[

### Discourse
[Automatic summarization](https://en.wikipedia.org/wiki/Automatic_summarization) 
[Discourse analysis](https://en.wikipedia.org/wiki/Discourse_analysis) 
etc.

### Semantics
[Speech recognition](https://en.wikipedia.org/wiki/Speech_recognition) 
[Speech segmentation](https://en.wikipedia.org/wiki/Speech_segmentation) 
[Relationship extraction](https://en.wikipedia.org/wiki/Relationship_extraction) 
[Text-to-speech](https://en.wikipedia.org/wiki/Text-to-speech) 
etc.

]

from <a href="https://en.wikipedia.org/wiki/Natural-language_processing">Wikipedia</a>

---

# Encoding

.pull-left55[
1960: ASCII
<img src="https://www.asciitable.com/index/asciifull.gif">

More info: [here](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) & [here](http://kunststube.net/encoding/)

]

.pull-right4[
1991: Unicode
<a href="http://unicode.org/"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/Unicode_logo.svg/512px-Unicode_logo.svg.png" width = 365></a>

]

---

| Character| Code Point | Encoding | Precision | Representation|
|:------|:--------|:--------||:--------|
|     `A`| `U+0041` | ASCII | fixed 7 bit | 1000001 |
|     `A`| `U+0041` | UTF-8 | min 8 bit / 1 byte | 01000001 |
|     `A`| `U+0041` | UTF-16 | min 16 bit / 2 byte | 00000000 01000001 |
|     `A`| `U+0041` | UTF-32 | min 32 bit / 4 byte | 00000000 00000000 00000000 01000001 |
|     `あ`| `U+3042` | ASCII | fixed 7 bit | - |
|     `あ`| `U+3042` | UTF-8 | min 8 bit / 1 byte | 11100011 10000001 10000010 |
|     `あ`| `U+3042` | UTF-16 | min 16 bit / 2 byte | 00110000 01000010 |
|     `あ`| `U+3042` | UTF-32 | min 32 bit / 4 byte | 00000000 00000000 00110000 01000010 |
| 😁 | `U+1F600` | ASCII | fixed 7 bit | - |
| 😁 | `U+1F600` | UTF-8 | min 8 bit / 1 byte | 1111 0000 1001 1111 1001 1000 1000 0000 |
| 😁 | `U+1F600` | UTF-16 | min 16 bit / 2 byte | 1101 1000 0011 1101 1101 1110 0000 0000 |
| 😁 | `U+1F600` | UTF-32 | min 32 bit / 4 byte | - |

---

# Regular expressions

.pull-left45[

According to [Wikipedia](https://en.wikipedia.org/wiki/Regular_expression):
>A regular expression, **regex or regexp** (sometimes called a rational expression) is, in theoretical computer science and formal language theory, **a sequence of characters that define a search pattern**. Usually this pattern is then used by **string searching algorithms** for **"find"** or **"find and replace"** operations on strings.

]

.pull-right45[

`(?<=\.) {2,}(?=[A-Z])`

]

---

# Regular expressions

.pull-left45[

```r
str_*(string, pattern, ...)
```

| Function suffix | Use |
|:------|:--------------|
|   `detect`  | Test if pattern is present. |
|   `count`   | Count number of pattern matches. |
|   `locate`  | Find location of pattern. |
|   `extract` | Extractt strings matching pattern. |
|   `replace` | Replace string matching pattern by other string. |
|   `split`   | Split string around pattern. |

]

.pull-right25[

]

---

# Regular expressions: Segmentation

```r
# text
txt <- "Happy families are all alike; every unhappy family is unhappy in its own way."

# Tokenize
str_extract_all(txt, "[A-Za-z]+")
```

```
## [[1]]
##  [1] "Happy"    "families" "are"      "all"      "alike"    "every"   
##  [7] "unhappy"  "family"   "is"       "unhappy"  "in"       "its"     
## [13] "own"      "way"
```

```r
# Sentenize
str_extract_all(txt, '[^[:space:]][^[.!?;]]*[.!?;]')
```

```
## [[1]]
## [1] "Happy families are all alike;"                  
## [2] "every unhappy family is unhappy in its own way."
```

---

# Term-document matrix

Number of mention of stark family members across the first seven seasons of [Game of Thrones](https://de.wikipedia.org/wiki/Game_of_Thrones)

|        | Winter Is Coming| The Kingsroad| Lord Snow| Cripples, Bastards, and Broken Things| The Wolf and the Lion| A Golden Crown| You Win or You Die| The Pointy End| Baelor| Fire and Blood|
|:-------|----------------:|-------------:|---------:|-------------------------------------:|---------------------:|--------------:|------------------:|--------------:|------:|--------------:|
|jon     |                6|             3|         3|                                     6|                     5|              1|                  3|              3|      0|              0|
|ned     |                5|             9|         8|                                     2|                     5|              0|                  6|              2|      0|              1|
|robb    |                2|             1|         0|                                     1|                     0|              5|                  0|              5|      4|              5|
|sansa   |                2|             3|         3|                                     4|                     0|              1|                  0|              5|      1|              0|
|arya    |                1|            11|         1|                                     1|                     4|              0|                  0|              2|      0|              0|
|bran    |                3|             4|         3|                                     3|                     2|              4|                  1|              1|      0|              4|
|rickon  |                0|             1|         0|                                     0|                     0|              0|                  0|              0|      0|              1|
|catelyn |                0|             1|         4|                                     1|                     1|              1|                  3|              2|      2|              0|

---

# Term-document matrix: Tf-idf

Accounts for (a) differences in the number of words per document and (b) the frequency of words across documents.

|        | Winter Is Coming| The Kingsroad| Lord Snow| Cripples, Bastards, and Broken Things| The Wolf and the Lion| A Golden Crown| You Win or You Die| The Pointy End| Baelor| Fire and Blood|
|:-------|----------------:|-------------:|---------:|-------------------------------------:|---------------------:|--------------:|------------------:|--------------:|------:|--------------:|
|jon     |             0.07|          0.02|      0.03|                                  0.07|                  0.07|           0.02|               0.05|           0.03|   0.00|           0.00|
|ned     |             0.06|          0.06|      0.08|                                  0.02|                  0.07|           0.00|               0.10|           0.02|   0.00|           0.02|
|robb    |             0.04|          0.01|      0.00|                                  0.02|                  0.00|           0.15|               0.00|           0.09|   0.20|           0.16|
|sansa   |             0.04|          0.03|      0.05|                                  0.08|                  0.00|           0.03|               0.00|           0.09|   0.05|           0.00|
|arya    |             0.03|          0.17|      0.02|                                  0.03|                  0.12|           0.00|               0.00|           0.05|   0.00|           0.00|
|bran    |             0.02|          0.01|      0.01|                                  0.02|                  0.01|           0.04|               0.01|           0.01|   0.00|           0.04|
|rickon  |             0.00|          0.05|      0.00|                                  0.00|                  0.00|           0.00|               0.00|           0.00|   0.00|           0.15|
|catelyn |             0.00|          0.01|      0.04|                                  0.01|                  0.01|           0.02|               0.05|           0.02|   0.06|           0.00|

---

# Term-document matrix: Uses

.pull-left5[

Basis for all sorts of NLP applications...

**Search** - identify most appropriate document for search queries

**Topic modeling** - decompose matrix into topics

**Spam detection** - use word as variables to train spam detection algorithms, e.g., Naive Bayes.

**Semantic meaning** - use documents as variables to learn about relationships between words.

etc.
]

.pull-right35[

|        | Winter Is Coming| The Kingsroad| Lord Snow|
|:-------|----------------:|-------------:|---------:|
|jon     |             0.07|          0.02|      0.03|
|ned     |             0.06|          0.06|      0.08|
|robb    |             0.04|          0.01|      0.00|
|sansa   |             0.04|          0.03|      0.05|
|arya    |             0.03|          0.17|      0.02|
|bran    |             0.02|          0.01|      0.01|
|rickon  |             0.00|          0.05|      0.00|
|catelyn |             0.00|          0.01|      0.04|

]

---

.pull-left35[

# Sentiment analysis

Aims to identify the sentiment, i.e., **affective value**, of natural language, based on...

- **Human judgment**

- **Heuristic rules**

- **Machine learning**

]

.pull-right55[

]

---

# Sentiment analysis

.pull-left25[

|word   | score|
|:--------|--------:|
|i      |    NA|
|could  |    NA|
|kill   |    -3|
|for    |    NA|
|a      |    NA|
|lovely |     3|
|piece  |    NA|
|of     |    NA|
|yummy  |     3|
|cake   |    NA|

]

.pull_right65[

]

---

# Packages

.pull-left45[

#### Framworks

| Package   | Use |
|:------------|:----------------|
| `tm`       | text mining framework for R. [Paper](http://www.jstatsoft.org/v25/i05/) |
| `openNLP`  |R interface to [OpenNLP](http://opennlp.sourceforge.net/) |
| `tidytext` | text mining using tidy tools. |

#### Semantics

| Package   | Use |
|:----------|:----------------|
| `topicmodels`  | Latent Dirichlet Allocation and Correlated Topics Models |
| `text2vec`  | tools for text vectorization and word embeddings. |

]

.pull-right45[

#### Tools

| Package   | Use                   |
|:---------------------|:-------------------------------------------------|
|`stringr`  | regular expressions and basic text manipulation |
|`hunspell`    | spelling correction |
|`SnowballC`    | word stemming |

#### Data

| Package   | Use |
|:----------|:----------------|
|`gutenbergr`  |  allows downloading books from the Project Gutenberg collection|
|`rvest`    | scrape the internet |
|`twitteR`    | access Twitter data  |
|`Rfacebook`    | access Facebook data  |

]

---

# Practical

<a href="https://therbootcamp.github.io/_sessions/D3S3_NaturalLanguageProcessing/NaturalLanguageProcessing.html">Link to practical</a>