Overview

Demonstrate your machine-learning-jedi-knight skills in this model competition. Predict gender from a tweeter’s meta information and win lots of 🍫🍫🍫.

The competition will end in…




Background story

The members of governing council of Basel-Stadt recently met at the comfortable, yet luxurious home of Elisabeth Ackerman’s, the council’s president, to watch the latest science-fiction blockbuster, Minority report. In this uplifting, utopian movie, Chief John Anderton, marvelously played by Tom Cruise, works together with friendly but slightly introverted super-beings with the rare (but scientifically proven - see rigorous academic paper) ability to foresee the future in order to arrest perpetrators not after they committed a crime (boring!) but before. Aside from spending a wonderful evening with his colleagues, Baschi Dürr, head of justice and homeland security department and the one who proposed the movie, was hoping to win over his fellow council members for a revolutionary program to tackle an unprecedented swell in non-violent crimes and murders in Basel-Stadt in recent months. Baschi Dürr knew, of course, that precogs - this is how the super-beings were called - belonged to the land of fairy tails, but during his visit to the recent useR! conference in Brisbane he had learned of the next best thing: machine learning. His goal was, thus, to hire an elite team of programmers and to equip them with a maximally informative data set of past crimes to predict where future crimes would occur. Knowing that Basel shares many similarities with the United States of America, Baschi Dürr already had spent a significant amount of money to acquire a data set of murders and non-violent crimes across American counties. Now, he only needed to convince his colleagues. And Fortuna was on his side. The movie and the few bottles of Napoleon French Brandy that he had bought at Coop earlier that day impressed so much that Elisabeth Ackerman decided to dissolve the investigative branch of the police and invest all of its budget into the new ML task force. The next morning, with a slight, manageable hangover, Baschi Dürr announced an international ML modeling competition promising that whoever built the best model would head the new task force. After a many hours of thinking about the problem and much fewer hours of thinking about the solution, the task force maneuvered itself into obsolescence, having developed the perfect algorithm to predict the locations of future crime.

(Names, characters, businesses, places, events, locales, and incidents are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.)

Mission statement

After an overpaid machine learning task force maneuvered itself into obsolescence having developed the perfect algorithm to predict the locations of future crime, Baschi Dürr, head of justice and homeland security department of Basel-Stadt, needed a new application for his machine learning task. He feared that Elisabeth Ackerman, president of the governing council, might otherwise withdraw his newly-earned funding (you’ll find the story of how he got to that funding in the “Story” tab). And he knew he had to act quickly. To finance a new set of Wegner Swivel Chairs (aren’t they pretty!) for city hall’s conference rooms, Elisabeth Ackerman and Dr. Eva Herzog, head of the city’s financial department, had just announced that any expendable resources would to be invested into cryptocurrencies. In search of inspiration, Baschi Dürr called up his American friends, who had previously given him such a great price on the American crime data set. He learned that they had recently begun to use machine learning for political education with outstanding results (truly outstanding!). They explained to him that, using machine learning, one could uncover a person’s gender or personality to then send that person tailored political information. Knowing the information hungry Swiss democracy, Baschi Dürr wasted no time after he ended his call to congregate his machine learning task force: Switzerland was not going to miss out on this amazing opportunity. America first, Switzerland second.

(Names, characters, businesses, places, events, locales, and incidents are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.)

A - Preliminaries

  1. Open your BaselRBootcamp R project. It should already have the folders 1_Data and 2_Code.

  2. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called Models_competition.R in the 2_Code folder.

  3. Using library() load the set of packages for this practical listed in the packages section above.

## NAME
## DATE
## Modeling competition

library(tidyverse)
library(caret)
  1. With the code below, load the tweets data set and change any character variables to factors.
# Load tweet data
tweets <- read_csv(file = "1_Data/tweets_train.csv")

# change character to factor
tweets <- tweets %>% mutate_if(is.character, as.factor)

B Competition rules

  1. The goal of the competition is to predict with maximal Accuracy whether a twitter user is 'female' or 'male'.

  2. Entering the competition grants you a chance to win lots of 🍫🍫🍫.

  3. To enter the competition, you can submit up to three caret train-object (result of the train() function) containing your candidate model.

  4. To submit the model, first save your model as an .RDS-file named pseudonym_train.RDS, with pseudonym replaced by a pseudonym of your choice. See the code below.

# save train obect as .RDS
saveRDS(my_train,'1_Data/mypseudonym_train.RDS')
  1. Submit your .RDS file (containing your training object) via the following link:
Task type Criterion Performance measure Submission link
Classification tweets (gender) Accuracy Submit candidate
  1. Use any weapon in your arsenal (or caret’s arsenal). Feel free to try different models, use different tuning parameter settings or preprocessing methods, make use of all or some variables. Whatever may lead to the highest prediction Accuracy. Make sure to look at the course materials for help.

  2. In order to be able to evaluate the models refrain from any manipulation (or engineering) of features other than those accessible via the preProcess argument in the train() function.

Datasets

File Rows Columns
tweets 2500 23

Note: The tweets data are a (heavily) pre-processed subsets of this original data set from Kaggle.

Variable descriptions
Name Meaning
gender The criterion. Whether the person tweeting was "male" or "female".
year_created The year the person’s twitter account was created.
hour_created The hour of day (1:24h) the person’s twitter account was created.
tweet_count The number of tweets that the person has posted.
retweet_count The number of retweets that the person has posted.
user_timezone The person’s time zone relative to GMT.
name_nchar The number of characters in the person’s twitter name.
name_male 1 if the person’s twitter name contains one of the 1’000 most frequent male baby names in America, 0 if not.
name_female 1 if the person’s twitter name contains one of the 1’000 most frequent female baby names in America, 0 if not.
descr_nchar The number of characters in the person’s twitter account description.
descr_male 1 if the person’s twitter account description contains one of the 1’000 most frequent male baby names in America, 0 if not.
descr_female 1 if the person’s twitter account description contains one of the 1’000 most frequent female baby names in America, 0 if not.
descr_sent Average sentiment score (>0 = positive sentiment) of the person’s twitter account description.
tweet_nchar The number of characters in one randomly chosen tweet by the person.
tweet_male 1 if the randomly chosen tweet contains one of the 1’000 most frequent male baby names in America, 0 if not.
tweet_female 1 if the randomly chosen tweet contains one of the 1’000 most frequent female baby names in America, 0 if not.
tweet_sent Average sentiment score (>0 = positive sentiment) of the randomly chosen tweet.
linkcol_red Red value (1:255) in the link color according to the person’s twitter scheme.
linkcol_green green value (1:255) in the link color according to the person’s twitter scheme.
linkcol_blue blue value (1:255) in the link color according to the person’s twitter scheme.
sidecol_red Red value (1:255) in the side bar color according to the person’s twitter scheme.
sidecol_green Green value (1:255) in the side bar color according to the person’s twitter scheme.
sidecol_blue Blue value (1:255) in the side bar color according to the person’s twitter scheme.

Cheatsheet

Trulli
from github.com/rstudio