Machine Learning with R Basel R Bootcamp |
Demonstrate your machine-learning-jedi-knight skills in this model competition. Predict gender
from a tweeter’s meta information and win lots of 🍫🍫🍫.
The competition will end in…
After his overpaid machine learning task force maneuvered itself into obsolescence having developed the perfect algorithm to predict the locations of future crime, Baschi Dürr, head of justice and homeland security department of Basel-Stadt, needed a new project for his machine learning task before Elisabeth Ackerman, president of the governing council, might withdraw his newly-earned funding. And he knew he had to act quickly. To finance a new set of Wegner Swivel Chairs (what else!) for city hall’s conference rooms, Elisabeth Ackerman and Dr. Eva Herzog, head of the city’s financial department, had just announced that any expendable resources were to be invested in cryptocurrencies. Out of a lack of ideas, Baschi Dürr called up his American friends, who had previously given him such a great price on the American crime data set. He learned that they had recently begun to use machine learning for political education with impressive results (truly outstanding!). They explained that, using machine learning, one could uncover a person’s gender or personality to then send that person tailored political information. All too aware of the Swiss democracy’s need for information, Baschi Dürr ordered his machine learning task force to congregate at once not to waste any time: Switzerland was not going to miss out on this amazing opportunity. America first, Switzerland second.
(Names, characters, businesses, places, events, locales, and incidents are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.)
Open your BaselRBootcamp
R project. It should already have the folders 1_Data
and 2_Code
.
Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called Models_competition.R
in the 2_Code
folder.
Using library()
load the set of packages for this practical listed in the packages section above.
## NAME
## DATE
## Modeling competition
library(tidyverse)
library(caret)
tweets
data set and change any character variables to factors.# Load tweet data
tweets <- read_csv(file = "1_Data/tweets_train.csv")
# change character to factor
tweets <- tweets %>% mutate_if(is.character, as.factor)
The goal of the competition is to predict with maximal Accuracy
whether a twitter user is 'female'
or 'male'
.
Entering the competition grants you a chance to win lots of 🍫🍫🍫.
To enter the competition, you can submit up to three caret train
-object (result of the train()
function) containing your candidate model.
To submit the model, first save your model as an .RDS
-file named pseudonym_train.RDS
, with pseudonym
replaced by a pseudonym of your choice. See the code below.
# save train obect as .RDS
saveRDS(my_train,'1_Data/mypseudonym_train.RDS')
.RDS
file (containing your training object) via the following link:Task type | Criterion | Performance measure | Submission link |
---|---|---|---|
Classification |
tweets (gender )
|
Accuracy | Submit candidate |
Use any weapon in your arsenal (or caret
’s arsenal). Feel free to try different models, use different tuning parameter settings or preprocessing methods, make use of all or some variables. Whatever may lead to the highest prediction Accuracy
. Consult the course materials for help.
In order for us to to be able to evaluate and compare the models, you must refrain from any manipulation (or engineering) of features other than those accessible via the preProcess
argument in the train()
function.
File | Rows | Columns |
---|---|---|
tweets | 2500 | 23 |
Note: The tweets
data are a (heavily) pre-processed subsets of this original data set from Kaggle.
Name | Meaning |
---|---|
gender | The criterion. Whether the person tweeting was "male" or "female" . |
year_created | The year the person’s twitter account was created. |
hour_created | The hour of day (1:24h) the person’s twitter account was created. |
tweet_count | The number of tweets that the person has posted. |
retweet_count | The number of retweets that the person has posted. |
user_timezone | The person’s time zone relative to GMT. |
name_nchar | The number of characters in the person’s twitter name. |
name_male | 1 if the person’s twitter name contains one of the 1’000 most frequent male baby names in America, 0 if not. |
name_female | 1 if the person’s twitter name contains one of the 1’000 most frequent female baby names in America, 0 if not. |
descr_nchar | The number of characters in the person’s twitter account description. |
descr_male | 1 if the person’s twitter account description contains one of the 1’000 most frequent male baby names in America, 0 if not. |
descr_female | 1 if the person’s twitter account description contains one of the 1’000 most frequent female baby names in America, 0 if not. |
descr_sent | Average sentiment score (>0 = positive sentiment) of the person’s twitter account description. |
tweet_nchar | The number of characters in one randomly chosen tweet by the person. |
tweet_male | 1 if the randomly chosen tweet contains one of the 1’000 most frequent male baby names in America, 0 if not. |
tweet_female | 1 if the randomly chosen tweet contains one of the 1’000 most frequent female baby names in America, 0 if not. |
tweet_sent | Average sentiment score (>0 = positive sentiment) of the randomly chosen tweet. |
linkcol_red | Red value (1:255) in the link color according to the person’s twitter scheme. |
linkcol_green | green value (1:255) in the link color according to the person’s twitter scheme. |
linkcol_blue | blue value (1:255) in the link color according to the person’s twitter scheme. |
sidecol_red | Red value (1:255) in the side bar color according to the person’s twitter scheme. |
sidecol_green | Green value (1:255) in the side bar color according to the person’s twitter scheme. |
sidecol_blue | Blue value (1:255) in the side bar color according to the person’s twitter scheme. |