Overview

Demonstrate your machine-learning skills in this model competition by predicting gender from a tweeter’s meta information.

The competition will end in…




Competition

A - Preliminaries

  1. Open your BaselRBootcamp R project. It should already have the folders 1_Data and 2_Code.

  2. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called Models_competition.R in the 2_Code folder.

  3. Load caret and tidyverse

  4. With the code below, load the tweets data set and change any character variables to factors.

# Load tweet data
tweets <- read_csv(file = "1_Data/tweets_train.csv")

# change character to factor
tweets <- tweets %>% mutate_if(is.character, as.factor)

B Competition rules

  1. The goal of the competition is to predict with maximal Accuracy whether a twitter user is 'female' or 'male'.

  2. To enter the competition, you can submit up to three caret train-object (result of the train() function) containing your candidate model.

  3. To submit the model, first save your model as an .RDS-file named pseudonym_train.RDS using saveRDS(), with MYPSEUDONYM replaced by a pseudonym of your choice. See the code below.

# save train obect as .RDS
saveRDS(my_train,'1_Data/MYPSEUDONYM_train.RDS')
  1. Submit your .RDS file(s) containing your training object(s) via mail:
Task type Criterion Performance measure Submission link
Classification tweets (gender) Accuracy Submit candidate
  1. Use any weapon in your arsenal (or caret’s arsenal). Feel free to try different models, use different tuning parameter settings or preprocessing methods, make use of all or some variables. Whatever may lead to the highest prediction Accuracy. Consult the course materials for help.

  2. In order for me to to be able to evaluate and compare the models, you must refrain from any manipulation (or engineering) of features other than those accessible via the preProcess argument in the train() function.

Datasets

File Rows Columns
tweets 2500 23

Note: The tweets data are a (heavily) pre-processed subsets of this original data set from Kaggle.

Variable descriptions
Name Meaning
gender The criterion. Whether the person tweeting was "male" or "female".
year_created The year the person’s twitter account was created.
hour_created The hour of day (1:24h) the person’s twitter account was created.
tweet_count The number of tweets that the person has posted.
retweet_count The number of retweets that the person has posted.
user_timezone The person’s time zone relative to GMT.
name_nchar The number of characters in the person’s twitter name.
name_male 1 if the person’s twitter name contains one of the 1’000 most frequent male baby names in America, 0 if not.
name_female 1 if the person’s twitter name contains one of the 1’000 most frequent female baby names in America, 0 if not.
descr_nchar The number of characters in the person’s twitter account description.
descr_male 1 if the person’s twitter account description contains one of the 1’000 most frequent male baby names in America, 0 if not.
descr_female 1 if the person’s twitter account description contains one of the 1’000 most frequent female baby names in America, 0 if not.
descr_sent Average sentiment score (>0 = positive sentiment) of the person’s twitter account description.
tweet_nchar The number of characters in one randomly chosen tweet by the person.
tweet_male 1 if the randomly chosen tweet contains one of the 1’000 most frequent male baby names in America, 0 if not.
tweet_female 1 if the randomly chosen tweet contains one of the 1’000 most frequent female baby names in America, 0 if not.
tweet_sent Average sentiment score (>0 = positive sentiment) of the randomly chosen tweet.
linkcol_red Red value (1:255) in the link color according to the person’s twitter scheme.
linkcol_green green value (1:255) in the link color according to the person’s twitter scheme.
linkcol_blue blue value (1:255) in the link color according to the person’s twitter scheme.
sidecol_red Red value (1:255) in the side bar color according to the person’s twitter scheme.
sidecol_green Green value (1:255) in the side bar color according to the person’s twitter scheme.
sidecol_blue Blue value (1:255) in the side bar color according to the person’s twitter scheme.

Cheatsheet

Trulli
from github.com/rstudio