class: center, middle, inverse, title-slide # Features ### Machine Learning with R
Basel R Bootcamp
### October 2019 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Machine Learning with R | October 2019 </font> </a> </span> </div> --- .pull-left45[ # Feature issues <br> <b>Too many features</b> - Curse of dimensionality - Feature importance <b>Wrong features</b> - Feature scaling - Feature correlation - Feature quality <b>Create new features</b> - Feature engineering ] .pull-right45[ <br><br> <p align="center"> <img src="image/dumbdata.png" height = 500px><br> <font style="font-size:10px">from <a href="https://xkcd.com/1838/">xkcd.com</a></font> </p> ] --- # Curse of dimensionality .pull-left35[ As the number of features grows... <high>Performance</high> - the amount of data that needs to generalize accurately grows exponentially. <high>Efficiency</high> - the amount of computations grows (how much depends on the model). <high>Redundancy</high> - the amount of redundancy grows (how much depends on the model). → <high>Small set of good predictors<high> ] .pull-right6[ <br> <p align="center"> <img src="image/cod.png"><br> <font style="font-size:10px">from <a href="https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335?gi=6e6735e00188">medium.freecodecamp.org</a></font> </p> ] --- # How to reduce dimensionality? .pull-left45[ <b>3 ways</b> 1 - Reduce variables <high>manually</high> based on statistical or intuitive considerations. 2 - Reduce variables <high>automatically</high> using the right ML algorithms, e.g., `random forests` or `lasso regression`, or feature selection algorithms, e.g., `recursive feature selection`. 3 - Compress variables using <high>dimensionality reduction algorithms</high>, such as `principal component analysis` (PCA). ] .pull-right5[ <p align = "center"> <img src="image/highd.jpeg" height=350> <font style="font-size:10px">from <a href="">Interstellar</a></font> </p> ] --- # Feature importance .pull-left4[ <high>Feature importance</high> characterizes how much a feature contributes to the fitting/prediction performance. The metric is <high>model specific </high>, but typically <high>normalized</high> to `[0, 100]`. <u>Strategies</u> - Single variable prediction (e.g., using `LOESS`, `ROC`) - Accuracy loss from scrambling - `random forests` importance - etc. ] .pull-right5[ ```r # plot variable importance for lm(income ~ .) plot(varImp(income_lm)) ``` <img src="Features_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- # `varImp()` .pull-left45[ `varImp()` <high>automatically selects appropriate measure</high> of variable importance for a given algorithm. ```r varImp(income_lm) ``` ``` lm variable importance Overall age 100.000 food 41.134 alcohol 25.247 happiness 11.582 tattoos 5.888 children 2.745 height 2.337 weight 0.983 ``` ] .pull-right5[ ```r # plot variable importance for lm(income ~ .) plot(varImp(income_lm)) ``` <img src="Features_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- .pull-left35[ # Recursive feature selection `rfe()` Find the <high>best number of `n` predictors</high>, with `n` drawn from specified candidate sets `N`, e.g., `N = [2,3,5,10]`. <u>Algorithm</u> 1. <high>Resample</high> and split data<br2> 2. Identify <high>best `n` predictors</high> and their prediction performance<br2> 3. <high>Aggregate performance</high> and select best `n` and the accordingly best predictors ] .pull-right55[ <br><br> ```r # Run feature elimination rfe(x = ..., y = ..., sizes = c(3,4,5,10), # feature set sizes rfeControl = rfeControl(functions = lmFuncs)) ``` ``` Recursive feature selection Outer resampling method: Bootstrapped (25 reps) Resampling performance over subset size: Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected 3 0.371 0.861 0.292 0.0133 0.0139 0.01078 4 0.367 0.863 0.288 0.0127 0.0126 0.00992 * 5 0.368 0.863 0.289 0.0122 0.0124 0.00951 10 0.371 0.861 0.292 0.0123 0.0121 0.00922 14 0.371 0.861 0.291 0.0122 0.0120 0.00918 The top 4 variables (out of 4): age, food, alcohol, happiness ``` ] --- # Dimensionality reduction using `PCA` .pull-left45[ The go-to algorithm for dimensionality reduction is <high>principal component analysis</high> (PCA). PCA is an <high>unsupervised</high>, regression-based algorithm that re-represents the data in a <high>new feature space</high>. The new features aka principal components are greedy. <high>Skimming off the best components</high> results in a small number of features that <high>preserve the original features</high> as well as possible. ] .pull-right45[ <p align = "center"> <img src="image/pca.png" height=350> <font style="font-size:10px">from <a href="https://blog.umetrics.com/what-is-principal-component-analysis-pca-and-how-it-is-used ">blog.umetrics.com</a></font> </p> ] --- # Using `PCA` .pull-left45[ ```r # train model WITHOUT PCA preprocessing model = train(income ~ ., method = 'lm', data = bas_train) plot(varImp(model)) ``` <img src="Features_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] .pull-right45[ ```r # train model WITH PCA preprocessing model = train(income ~ ., method = 'lm', data = bas_train, preProc = c('pca')) plot(varImp(model)) ``` <img src="Features_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] --- # Other, easy feature problems .pull-left45[ ### Multi-collinearity Multi-collinearity, <high>high feature correlations</high>, mean that there is redundancy in the data, which can lead to <high>less stable fits</high>, <high>uninterpretable variable importances</high>, and <high>worse predictions</high>. ```r # identify redundant variables findCorrelation(cor(baselers)) ``` ``` [1] 5 ``` ```r # remove from data remove <- findCorrelation(cor(baselers)) baselers <- baselers %>% select(-remove) ``` ] .pull-right45[ ### Unequal & low variance Unequal variance <high>breaks regularization</high> (L1, L2) and renders estimates difficult to interpret. ```r # standardize and center variables train(..., preProc("center", "scale")) ``` Low variance variables add parameters, but <high>can hardly contribute to prediction</high> and are, thus, also redundant. ```r # identify low variance variables nearZeroVar(baselers) ``` ``` integer(0) ``` ] --- # Difficult feature problems <br> .pull-left25[ 1 - <b>Trivial features</b> Successful prediction not necessarily implies that a meaningful pattern has been detected. <br> 2 - <b>Missing features</b> Some problems are hard, requiring the engineering of new features. ] .pull-right65[ <br> <p align = "center"> <img src="image/here_to_help.png"><br> <font style="font-size:10px">from <a href="https://xkcd.com/1831/">xkcd.com</a></font> </p> ] --- # Trivial features .pull-left3[ <u><a href="https://www.gwern.net/Tanks">An urban myth?!</a></u> "The Army trained a program to differentiate American tanks from Russian tanks with 100% accuracy. Only later did analysts realize that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness."<br><br> New York Times <a href="https://www.nytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html" style="font-size:8px">[Full text]</a> ] .pull-right6[ <p align = "center"> <img src="image/tank.jpg"> <font style="font-size:10px">from <a href="https://en.wikipedia.org/wiki/British_heavy_tanks_of_World_War_I#/media/File:Mark_I_series_tank.jpg">wikipedia.org</a></font> </p> ] --- # Trivial features In 2012, Nate Silver was praised to have correctly predicted the outcomes of the presidential election in 50 states after having correctly predicted 49 states in 2008. <high>But how much of a challenge was that?</high> .pull-left5[ <p align = "center"> <img src="image/elect2008.png" height = 360px><br> <font style="font-size:10px">from <a href="https://www.vox.com/policy-and-politics/2016/11/8/13563106/election-map-historical-vote">vox.com</a></font> </p> ] .pull-right5[ <p align = "center"> <img src="image/elect2012.png" height = 360px><br> <font style="font-size:10px">from <a href="https://www.vox.com/policy-and-politics/2016/11/8/13563106/election-map-historical-vote">vox.com</a></font> </p> ] --- # (Always!) missing features <i>"…some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."</i> Pedro Domingos <i>"The algorithms we used are very standard for Kagglers. […] We spent most of our efforts in feature engineering. [...] We were also very careful to discard features likely to expose us to the risk of over-fitting our model."</i> Xavier Conort <i>"Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering."</i> Andrew Ng --- # Feature engineering .pull-left45[ <i>“Feature engineering is the process of <high>transforming raw data</high> into features that <high>better represent the underlying problem</high> to the predictive models, resulting in improved model accuracy on unseen data.”</i> Jason Brownlee <i>"...while avoiding the <high>curse of dimensionality</high>."</i> duw <u>Feature engineering involves</u> - <b>Transformations</b> - <b>Interactions</b> - <b>New features</b> ] .pull-right45[ <p align = "center"> <img src="image/albert.jpeg"><br> <font style="font-size:10px">from <a href="http://www.open.edu/openlearncreate/mod/oucontent/view.php?id=80245§ion=1">open.edu</a></font> </p> ] --- class: middle, center <h1><a href="https://therbootcamp.github.io/ML_2019Oct/_sessions/Features/Features_practical.html">Practical</a></h1>