class: center, middle, inverse, title-slide # Features ### Machine Learning with R
The R Bootcamp @ DHLab
### November 2020 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Machine Learning with R | November 2020 </font> </a> </span> </div> --- .pull-left45[ # Feature issues <ul> <li class="m1"><span><b>Too many features</b></span></li> <ul class="level"> <li><span>Curse of <high>dimensionality</high></span></li> <li><span>Feature <high>importance</high></span></li> </ul><br> <li class="m2"><span><b>Wrong features</b> <ul class="level"> <li><span>Feature <high>scaling</high></span></li> <li><span>Feature <high>correlation</high></span></li> <li><span>Feature <high>quality</high></span></li> </ul> <li class="m3"><span><b>Create new features</b> <ul class="level"> <li><span>Feature <high>engineering</high></span></li> </ul> </span></li> </ul> ] .pull-right45[ <br><br> <p align="center"> <img src="image/dumbdata.png" height = 500px><br> <font style="font-size:10px">from <a href="https://xkcd.com/1838/">xkcd.com</a></font> </p> ] --- # Curse of dimensionality .pull-left35[ <ul> <li class="m1"><span><b>Density</b></span></li> <ul class="level"> <li><span>The number of cases, that are necessary to <high>cover the data space</high> grows exponentially with the number of features.</span></li> </ul><br> <li class="m2"><span><b>Redundancy</b></span></li> <ul class="level"> <li><span>Redundancy between the features grows with number, implying increased <high>uncertainty</high> in estimation.</span></li> </ul><br> <li class="m3"><span><b>Efficiency</b></span></li> <ul class="level"> <li><span>The number of <high>parameters</high> grows with features requiring for <high>computational resources</high>.</span></li> </ul> </span></li> </ul> ] .pull-right6[ <br> <p align="center"> <img src="image/cod.png"><br> <font style="font-size:10px">from <a href="https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335?gi=6e6735e00188">medium.freecodecamp.org</a></font> </p> ] --- # How to reduce dimensionality? .pull-left45[ <ul> <li class="m1"><span><b>Manual selection</b></span></li> <ul class="level"> <li><span>Reduce features <high>manually</high> based on statistical or intuitive considerations..</span></li> </ul><br> <li class="m2"><span><b>Automatic selection</b></span></li> <ul class="level"> <li><span>Reduce variables <high>automatically</high> using suitable ML algorithms, e.g., <mono>random forests</mono> or <mono>lasso</mono>, or feature selection algorithms, e.g., <mono>recursive feature selection</mono>.</span></li> </ul><br> <li class="m3"><span><b>Automatic reduction</b></span></li> <ul class="level"> <li><span>Compress variables using <high>dimensionality reduction algorithms</high>, such as principal component analysis (PCA).</span></li> </ul> </span></li> </ul> ] .pull-right5[ <p align = "center"> <img src="image/highd.jpeg" height=350> <font style="font-size:10px">from <a href="">Interstellar</a></font> </p> ] --- # Feature importance .pull-left4[ <ul> <li class="m1"><span>Characterizes how much a <high>feature contributes</high> to the fitting/prediction performance. .</span></li><br> <li class="m2"><span>The metric is <high>model specific</high>, but typically <high>normalized</high> to <mono>[0, 100]</mono>.</span></li><br> <li class="m3"><span><b>Strategies</b></span></li> <ul class="level"> <li><span>Single variable prediction (e.g., using LOESS, ROC) </span></li> <li><span>Accuracy loss from scrambling</span></li> <li><span>Random Forest importance</span></li> <li><span>etc.</span></li> </ul> </ul> ] .pull-right5[ ```r # plot variable importance for lm(income ~ .) plot(varImp(income_lm)) ``` <img src="Features_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- # `varImp()` .pull-left45[ <ul> <li class="m1"><span>Automatically selects <high>appropriate measure</high> of variable importance for a given algorithm.</span></li> </ul> ```r varImp(income_lm) ``` ``` lm variable importance Overall age 100.000 food 42.480 alcohol 23.682 happiness 13.909 tattoos 6.284 height 3.230 children 1.837 datause 1.486 ``` ] .pull-right5[ ```r # plot variable importance for lm(income ~ .) plot(varImp(income_lm)) ``` <img src="Features_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- .pull-left35[ # Recursive feature selection <ul> <li class="m1"><span><mono>rfe()</mono> uses <high>cross-validation</high> to select the best <i>n</i> freatures.</span></li><br> <li class="m2"><span>Algorithm</span></li> <ol> <li><span><high>Candidates</high>, e.g., <mono>n = [2, 3, 5, 10]</mono>.</span></li> <li><span><high>Resample</high> and split data.</span></li> <li><span>Evaluate <high>performance</high> for the best <mono>n</mono> features.</span></li> <li><span>Select best <mono>n</mono> on the basis of <high>aggregate performance</high>.</span></li> </ol> </ul> ] .pull-right55[ <br><br> ```r # Run feature elimination rfe(x = ..., y = ..., sizes = c(3,4,5,10), # feature set sizes rfeControl = rfeControl(functions = lmFuncs)) ``` ``` Recursive feature selection Outer resampling method: Bootstrapped (25 reps) Resampling performance over subset size: Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected 3 0.386 0.855 0.303 0.0143 0.01099 0.0127 4 0.382 0.858 0.299 0.0146 0.01063 0.0134 5 0.382 0.858 0.299 0.0134 0.00987 0.0124 * 10 0.383 0.858 0.299 0.0127 0.00959 0.0115 14 0.382 0.858 0.299 0.0128 0.00970 0.0114 The top 5 variables (out of 5): age, food, alcohol, happiness, tattoos ``` ] --- # Principal component analysis .pull-left45[ <ul> <li class="m1"><span>The <high>go-to algorithm</high> for dimensionality reduction</span></li><br> <li class="m2"><span>Linear model (regression) represents features in a <high>new, smaller feature space</high>.</span></li><br> <li class="m3"><span>The new feature space explains <high>maximal variance</high> of the original features.</span></li> </ul> ] .pull-right45[ <p align = "center"> <img src="image/pca.png" height=350> <font style="font-size:10px">from <a href="https://blog.umetrics.com/what-is-principal-component-analysis-pca-and-how-it-is-used ">blog.umetrics.com</a></font> </p> ] --- # Using `PCA` .pull-left45[ ```r # train model WITHOUT PCA preprocessing model = train(income ~ ., method = 'lm', data = bas_train) plot(varImp(model)) ``` <img src="Features_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] .pull-right45[ ```r # train model WITH PCA preprocessing model = train(income ~ ., method = 'lm', data = bas_train, preProc = c('pca')) plot(varImp(model)) ``` <img src="Features_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] --- # Other, easy feature problems .pull-left45[ ### Multi-collinearity <ul> <li class="m1"><span><high>High feature correlations</high> mean that there is redundancy in the data, which can lead to less stable fits, uninterpretable variable importances, and worse predictions.</span></li> </ul> ```r # identify redundant variables findCorrelation(cor(basel)) ``` ``` [1] 5 ``` ```r # remove from data remove <- findCorrelation(cor(basel)) basel <- basel %>% select(-remove) ``` ] .pull-right45[ ### Low variance <ul> <li class="m2"><span>Low variance variables add parameters, but <high>can hardly contribute to prediction</high> and are, thus, also redundant.</span></li> </ul> ```r # identify low variance variables nearZeroVar(basel) ``` ``` integer(0) ``` <ul> <li class="m3"><span>Unequal variance <high>breaks regularization</high> (L1, L2) and renders estimates difficult to interpret..</span></li> </ul> ```r # standardize and center variables train(..., preProc("center", "scale")) ``` ] --- # Difficult feature problems <br> .pull-left35[ <ul> <li class="m1"><span><b>Trivial Features</b></span></li> <ul> <li><span>Successful prediction not necessarily implies that a meaningful pattern has been detected.</span></li> </ul><br> <li class="m2"><span><b>Missing features</b></span></li> <ul> <li><span>Some problems are hard, requiring the engineering of new features.</span></li><br> </ul> </ul> ] .pull-right55[ <br> <p align = "center"> <img src="image/here_to_help.png"><br> <font style="font-size:10px">from <a href="https://xkcd.com/1831/">xkcd.com</a></font> </p> ] --- # Trivial features .pull-left3[ <u><a href="https://www.gwern.net/Tanks">An urban myth?!</a></u> "The Army trained a program to differentiate American tanks from Russian tanks with 100% accuracy. Only later did analysts realize that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness."<br><br> New York Times <a href="https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html" style="font-size:8px">[Full text]</a> ] .pull-right6[ <p align = "center"> <img src="image/tank.jpg"> <font style="font-size:10px">from <a href="https://en.wikipedia.org/wiki/British_heavy_tanks_of_World_War_I#/media/File:Mark_I_series_tank.jpg">wikipedia.org</a></font> </p> ] --- # (Always!) missing features .pull-left85[ <i>"…some machine learning projects succeed and some fail. What makes the difference? <high>Easily the most important factor is the features used</high>."</i> [Pedro Domingos](https://en.wikipedia.org/wiki/Pedro_Domingos) <br> <i>"The algorithms we used are very standard for Kagglers. […] <high>We spent most of our efforts in feature engineering</high>. [...] We were also very careful to discard features likely to expose us to the risk of over-fitting our model."</i> [Xavier Conort]() <br> <i>"Coming up with features is difficult, time-consuming, requires expert knowledge. <high>Applied machine learning is basically feature engineering</high>."</i> [Andrew Ng](https://en.wikipedia.org/wiki/Andrew_Ng) ] --- # Feature engineering .pull-left45[ <br> <i>“Feature engineering is the process of <high>transforming raw data</high> into features that <high>better represent the underlying problem</high> to the predictive models, resulting in improved model accuracy on unseen data.”</i> [Jason Brownlee]() <br> <i>"...while avoiding the <high>curse of dimensionality</high>."</i> [duw]() ] .pull-right45[ <p align = "center"> <img src="image/albert.jpeg"><br> <font style="font-size:10px">from <a href="http://www.open.edu/openlearncreate/mod/oucontent/view.php?id=80245§ion=1">open.edu</a></font> </p> ] --- # <mono>createDataPartition()</mono> .pull-left4[ <ul> <li class="m1"><span>Use <mono>createDataPartition()</mono> to split the <high>data set</high> in training and test.</span></li> </ul> <br> <table style="cellspacing:0; cellpadding:0; border:none;"> <col width="30%"> <col width="70%"> <tr> <td bgcolor="white"> <b>Argument</b> </td> <td bgcolor="white"> <b>Beschreibung</b> </td> </tr> <tr> <td bgcolor="white"> <mono>y</mono> </td> <td bgcolor="white"> The criterion. Important for a <high>balanced split</high> of the data. </td> </tr> <tr> <td bgcolor="white"> <mono>p</mono> </td> <td bgcolor="white"> The <high>proportion of data</high> of data assigned to training. Often <mono>.8</mono> or <mono>.5</mono>. </td> </tr> </table> ] .pull-right5[ ```r # Important for reproducible results set.seed(100) # Index for training index <- createDataPartition(y = basel$income, p = .8, list = FALSE) # Create training basel_train <- basel %>% slice(index) # Create test basel_test <- basel %>% slice(-index) ``` ] --- class: middle, center <h1><a href="https://therbootcamp.github.io/ML-DHLab/_sessions/Features/Features_practical.html">Practical</a></h1>