My Bui (Mimi)

Data Engineer & DataOps

My LinkedIn
My GitHub

Ensemble Learning with RandomForest and XGBoost: speed dating dataset

Goals

1. Classification - matched or unmatched date: accuracy = 0.85618

2. Regression - how much a participant likes his/her date: RMSE = 0.70373

Data set description

This data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute “first date” with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information.

Attributes

ML tasks

1. Classification: matched or unmatched

2. Regression: how much a participant likes his/her date

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('speeddating.csv')
data.dropna(inplace=True)
# drop field since this attribute contains many messy values
data.drop(columns='field', index=1, inplace=True)
data
gender age age_o d_age race race_o samerace importance_same_race importance_same_religion pref_o_attractive ... interests_correlate expected_happy_with_sd_people expected_num_interested_in_me expected_num_matches like guess_prob_liked met decision decision_o match
0 female 21.0 27.0 6 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 2.0 4.0 35.0 ... 0.14 3.0 2.0 4.0 7.0 6.0 0.0 1 0 0
3 female 21.0 23.0 2 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 2.0 4.0 30.0 ... 0.61 3.0 2.0 4.0 7.0 6.0 0.0 1 1 1
4 female 21.0 24.0 3 'Asian/Pacific Islander/Asian-American' 'Latino/Hispanic American' 0 2.0 4.0 30.0 ... 0.21 3.0 2.0 4.0 6.0 6.0 0.0 1 1 1
5 female 21.0 25.0 4 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 2.0 4.0 50.0 ... 0.25 3.0 2.0 4.0 6.0 5.0 0.0 0 1 0
6 female 21.0 30.0 9 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 2.0 4.0 35.0 ... 0.34 3.0 2.0 4.0 6.0 5.0 0.0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1836 male 19.0 20.0 1 'Asian/Pacific Islander/Asian-American' Other 0 4.0 1.0 15.0 ... 0.35 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1837 male 19.0 21.0 2 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 4.0 1.0 15.0 ... 0.45 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1838 male 19.0 20.0 1 'Asian/Pacific Islander/Asian-American' 'Black/African American' 0 4.0 1.0 20.0 ... 0.13 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1840 male 19.0 21.0 2 'Asian/Pacific Islander/Asian-American' European/Caucasian-American 0 4.0 1.0 15.0 ... 0.54 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1843 male 19.0 20.0 1 'Asian/Pacific Islander/Asian-American' 'Latino/Hispanic American' 0 4.0 1.0 10.0 ... 0.54 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0

1047 rows × 65 columns

Data Pre-processing

num_data = data.select_dtypes(exclude='object').reset_index(drop=True)
o_data = data.select_dtypes(include='object')
num_data
age age_o d_age samerace importance_same_race importance_same_religion pref_o_attractive pref_o_sincere pref_o_intelligence pref_o_funny ... interests_correlate expected_happy_with_sd_people expected_num_interested_in_me expected_num_matches like guess_prob_liked met decision decision_o match
0 21.0 27.0 6 0 2.0 4.0 35.0 20.0 20.0 20.0 ... 0.14 3.0 2.0 4.0 7.0 6.0 0.0 1 0 0
1 21.0 23.0 2 0 2.0 4.0 30.0 5.0 15.0 40.0 ... 0.61 3.0 2.0 4.0 7.0 6.0 0.0 1 1 1
2 21.0 24.0 3 0 2.0 4.0 30.0 10.0 20.0 10.0 ... 0.21 3.0 2.0 4.0 6.0 6.0 0.0 1 1 1
3 21.0 25.0 4 0 2.0 4.0 50.0 0.0 30.0 10.0 ... 0.25 3.0 2.0 4.0 6.0 5.0 0.0 0 1 0
4 21.0 30.0 9 0 2.0 4.0 35.0 15.0 25.0 10.0 ... 0.34 3.0 2.0 4.0 6.0 5.0 0.0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1042 19.0 20.0 1 0 4.0 1.0 15.0 15.0 20.0 25.0 ... 0.35 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1043 19.0 21.0 2 0 4.0 1.0 15.0 15.0 25.0 25.0 ... 0.45 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1044 19.0 20.0 1 0 4.0 1.0 20.0 20.0 20.0 20.0 ... 0.13 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1045 19.0 21.0 2 0 4.0 1.0 15.0 15.0 25.0 25.0 ... 0.54 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1046 19.0 20.0 1 0 4.0 1.0 10.0 10.0 35.0 35.0 ... 0.54 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0

1047 rows × 62 columns

o_data
gender race race_o
0 female 'Asian/Pacific Islander/Asian-American' European/Caucasian-American
3 female 'Asian/Pacific Islander/Asian-American' European/Caucasian-American
4 female 'Asian/Pacific Islander/Asian-American' 'Latino/Hispanic American'
5 female 'Asian/Pacific Islander/Asian-American' European/Caucasian-American
6 female 'Asian/Pacific Islander/Asian-American' European/Caucasian-American
... ... ... ...
1836 male 'Asian/Pacific Islander/Asian-American' Other
1837 male 'Asian/Pacific Islander/Asian-American' European/Caucasian-American
1838 male 'Asian/Pacific Islander/Asian-American' 'Black/African American'
1840 male 'Asian/Pacific Islander/Asian-American' European/Caucasian-American
1843 male 'Asian/Pacific Islander/Asian-American' 'Latino/Hispanic American'

1047 rows × 3 columns

o_encoder = OneHotEncoder()
o_trans = o_encoder.fit_transform(o_data)
o_att = list()

for i in o_encoder.categories_:
    for n in i:
        o_att.append(str(n).replace("'", ''))
        
for i in o_att[7:]:
    o_att.append(i+'_o')
    o_att.remove(i)
o_att
['female',
 'male',
 'Asian/Pacific Islander/Asian-American',
 'Black/African American',
 'Latino/Hispanic American',
 'European/Caucasian-American',
 'Other',
 'Asian/Pacific Islander/Asian-American_o',
 'Black/African American_o',
 'Latino/Hispanic American_o',
 'European/Caucasian-American_o',
 'Other_o']
o_data = pd.DataFrame(o_trans.toarray(), columns=o_att)
trans_data = pd.concat([o_data, num_data], axis=1)
trans_data
female male Asian/Pacific Islander/Asian-American Black/African American Latino/Hispanic American European/Caucasian-American Other Asian/Pacific Islander/Asian-American_o Black/African American_o Latino/Hispanic American_o ... interests_correlate expected_happy_with_sd_people expected_num_interested_in_me expected_num_matches like guess_prob_liked met decision decision_o match
0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.14 3.0 2.0 4.0 7.0 6.0 0.0 1 0 0
1 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.61 3.0 2.0 4.0 7.0 6.0 0.0 1 1 1
2 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.21 3.0 2.0 4.0 6.0 6.0 0.0 1 1 1
3 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.25 3.0 2.0 4.0 6.0 5.0 0.0 0 1 0
4 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.34 3.0 2.0 4.0 6.0 5.0 0.0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1042 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.35 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1043 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.45 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1044 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.13 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1045 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.54 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0
1046 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.54 5.0 0.0 0.0 5.0 1.0 0.0 0 0 0

1047 rows × 74 columns

1. Classification: matched or unmatched

1.1. Bagging with RandomForestClassifier of 100 independent DecisionTrees

X_train, X_test, y_train, y_test = train_test_split(trans_data.iloc[:, :-10], trans_data.iloc[:, -1], random_state=42)
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
from sklearn.metrics import accuracy_score

print('Baseline Accuracy score: {:.5f}'.format(accuracy_score(y_test, y_pred)))
Baseline Accuracy score: 0.82061

Feature Importance: being attractive and funny, and sharing same interests are important for matching a partner

features = pd.DataFrame(data=rf.feature_importances_, index=trans_data.columns[:-10], columns=['importance'])
features.sort_values('importance', ascending=False)[:10]
importance
attractive_o 0.060663
attractive_partner 0.057318
shared_interests_o 0.056460
funny_o 0.046871
shared_interests_partner 0.044348
funny_partner 0.042942
age_o 0.029814
pref_o_attractive 0.024023
pref_o_funny 0.023656
ambitous_o 0.022849

1.2. Stacking with 2 layers of models

X, y = trans_data.iloc[:, :-10], trans_data.iloc[:, -1]
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier

level_0 = [('lr', LogisticRegression()),
           ('knn', KNeighborsClassifier()),
           ('rf', RandomForestClassifier()),
           ('svm', SVC())]
level_1 = RandomForestClassifier()
stacking = StackingClassifier(estimators=level_0, final_estimator=level_1, cv=5)

models = {'lr': LogisticRegression(),
         'knn': KNeighborsClassifier(),
         'rf': RandomForestClassifier(),
         'svm': SVC(),
         'stacking': stacking}

def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=5, random_state=42)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

results = list()
names = list()

for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train)
    results.append(scores)
    names.append(name)
    print((name, round(np.mean(scores), 5), round(np.std(scores), 5)))
('lr', 0.84408, 0.02193)
('knn', 0.81452, 0.01951)
('rf', 0.85618, 0.01361)
('svm', 0.83057, 0.00312)
('stacking', 0.84611, 0.02071)

Result: Stacking Accuracy = 0.84611 < RandomForestClassifier Accuracy = 0.85618

2. Regression: how much a participant likes his/her date

2.1. Gradient Boosting with GradientBoostingRegressor of 100 independent DecisionTrees

X_train, X_test, y_train, y_test = train_test_split(trans_data.iloc[:, :-3], trans_data.iloc[:, -6], random_state=42)
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)

y_pred = gbr.predict(X_test)
y_pred.astype('int')
array([6, 6, 6, 7, 4, 6, 8, 6, 6, 7, 6, 6, 6, 6, 8, 4, 2, 5, 6, 7, 4, 8,
       6, 6, 7, 5, 6, 6, 7, 6, 6, 8, 6, 8, 6, 7, 6, 2, 6, 6, 6, 6, 6, 6,
       6, 6, 2, 7, 5, 8, 6, 3, 7, 6, 6, 5, 7, 6, 6, 6, 9, 9, 6, 6, 6, 5,
       4, 4, 6, 7, 7, 2, 9, 7, 6, 7, 4, 5, 6, 6, 3, 8, 6, 5, 8, 3, 4, 6,
       6, 6, 6, 6, 7, 6, 1, 7, 6, 6, 0, 5, 6, 8, 6, 7, 6, 5, 6, 6, 7, 2,
       5, 8, 6, 6, 8, 6, 4, 6, 6, 6, 7, 6, 6, 7, 1, 6, 6, 7, 6, 6, 3, 6,
       6, 6, 6, 2, 6, 6, 6, 7, 6, 7, 7, 7, 6, 6, 6, 5, 6, 9, 6, 6, 6, 6,
       6, 9, 6, 6, 7, 3, 5, 6, 6, 8, 5, 8, 7, 4, 6, 5, 6, 6, 9, 6, 2, 5,
       3, 2, 8, 6, 6, 6, 6, 6, 6, 2, 4, 6, 6, 2, 5, 7, 7, 6, 5, 2, 5, 7,
       6, 6, 6, 5, 7, 6, 6, 6, 6, 6, 6, 6, 5, 6, 9, 6, 6, 6, 6, 6, 7, 5,
       6, 7, 5, 7, 6, 1, 8, 4, 6, 6, 2, 5, 7, 5, 4, 6, 6, 6, 7, 5, 6, 9,
       6, 5, 6, 6, 6, 6, 6, 6, 9, 6, 3, 7, 6, 5, 6, 2, 6, 6, 6, 6])
from sklearn.metrics import mean_squared_error

print('Baseline RMSE: {:.5f}'.format(
                 mean_squared_error(y_test, y_pred.astype('int'), squared=False)))
Baseline RMSE: 0.70373

2.2. Extreme Gradient Boosting with XGBoost and hyper parameter tuning with GridSearch

import xgboost as xgb

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
params = {
    'max_depth':6,
    'min_child_weight': 1,
    'eta':0.3,
    'subsample': 1,
    'colsample_bytree': 1,
    'objective':'reg:squarederror',
    'eval_metric': 'rmse'
}
cv_results = xgb.cv(
    params,
    train,
    num_boost_round=100,
    seed=42,
    nfold=5,
    metrics={'rmse'},
    early_stopping_rounds=10
)

cv_results
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
0 4.211302 0.009553 4.211142 0.029887
1 2.963511 0.006647 2.962846 0.023045
2 2.086589 0.004590 2.086106 0.018931
3 1.469331 0.003155 1.469113 0.011535
4 1.035416 0.002208 1.035566 0.010592
5 0.729338 0.001536 0.729979 0.007580
6 0.514014 0.001038 0.515827 0.006285
7 0.362344 0.000730 0.364688 0.005393
8 0.255511 0.000529 0.259339 0.006231
9 0.180219 0.000364 0.185675 0.007906
10 0.127121 0.000253 0.135551 0.010946
11 0.089681 0.000176 0.101505 0.014863
12 0.063277 0.000124 0.078882 0.019360
13 0.044657 0.000088 0.064012 0.023895
14 0.031525 0.000064 0.054265 0.027993
15 0.022262 0.000047 0.047811 0.031418
16 0.015726 0.000035 0.043479 0.034129
17 0.011115 0.000027 0.040548 0.036189
18 0.007860 0.000023 0.038538 0.037721
19 0.005563 0.000020 0.037150 0.038841
20 0.003942 0.000020 0.036186 0.039650
21 0.002798 0.000020 0.035513 0.040233
22 0.001991 0.000020 0.035042 0.040650
23 0.001421 0.000022 0.034714 0.040945
24 0.001018 0.000022 0.034482 0.041156
25 0.000735 0.000022 0.034319 0.041305
26 0.000534 0.000023 0.034204 0.041410
27 0.000394 0.000023 0.034122 0.041485
28 0.000294 0.000023 0.034066 0.041536
29 0.000225 0.000022 0.034028 0.041573
30 0.000175 0.000021 0.034000 0.041598
31 0.000141 0.000018 0.033983 0.041614
32 0.000117 0.000016 0.033973 0.041624
33 0.000102 0.000013 0.033968 0.041629
34 0.000092 0.000010 0.033960 0.041636
35 0.000084 0.000008 0.033958 0.041637
36 0.000078 0.000006 0.033958 0.041640
37 0.000077 0.000006 0.033958 0.041640
38 0.000077 0.000006 0.033958 0.041641
39 0.000077 0.000006 0.033958 0.041641
40 0.000077 0.000006 0.033958 0.041641
41 0.000077 0.000006 0.033958 0.041641
42 0.000077 0.000006 0.033958 0.041641
gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(10,15)
    for min_child_weight in range(5,10)
]

min_rmse = float("Inf")
best_params = None

for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))
    
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    
    cv_results = xgb.cv(
        params,
        train,
        num_boost_round=100,
        seed=42,
        nfold=5,
        metrics={'rmse'},
        early_stopping_rounds=10
    )
    
    # update best RMSE
    mean_rmse = cv_results['test-rmse-mean'].min()
    boost_rounds = cv_results['test-rmse-mean'].argmin()
    print("\tRMSE {:.5f} for {} rounds".format(mean_rmse, boost_rounds))
    if mean_rmse < min_rmse:
        min_rmse = mean_rmse
        best_params = (max_depth, min_child_weight)

print("Best params: {}, {}, RMSE: {:.5f}".format(best_params[0], best_params[1], min_rmse))
CV with max_depth=10, min_child_weight=5
	RMSE 0.03549 for 30 rounds
CV with max_depth=10, min_child_weight=6
	RMSE 0.03525 for 22 rounds
CV with max_depth=10, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=10, min_child_weight=8
	RMSE 0.05993 for 20 rounds
CV with max_depth=10, min_child_weight=9
	RMSE 0.10087 for 18 rounds
CV with max_depth=11, min_child_weight=5
	RMSE 0.03524 for 30 rounds
CV with max_depth=11, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=11, min_child_weight=7
	RMSE 0.03502 for 22 rounds
CV with max_depth=11, min_child_weight=8
	RMSE 0.06001 for 20 rounds
CV with max_depth=11, min_child_weight=9
	RMSE 0.10091 for 18 rounds
CV with max_depth=12, min_child_weight=5
	RMSE 0.03549 for 23 rounds
CV with max_depth=12, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=12, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=12, min_child_weight=8
	RMSE 0.05980 for 22 rounds
CV with max_depth=12, min_child_weight=9
	RMSE 0.10092 for 18 rounds
CV with max_depth=13, min_child_weight=5
	RMSE 0.03549 for 23 rounds
CV with max_depth=13, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=13, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=13, min_child_weight=8
	RMSE 0.05948 for 20 rounds
CV with max_depth=13, min_child_weight=9
	RMSE 0.10092 for 18 rounds
CV with max_depth=14, min_child_weight=5
	RMSE 0.03549 for 23 rounds
CV with max_depth=14, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=14, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=14, min_child_weight=8
	RMSE 0.05944 for 25 rounds
CV with max_depth=14, min_child_weight=9
	RMSE 0.10092 for 18 rounds
Best params: 11, 7, RMSE: 0.03502
params['max_depth'] = best_params[0]
params['min_child_weight'] = best_params[1]
model = xgb.train(
    params,
    train,
    num_boost_round=35,
    evals=[(test, "Test")],
    early_stopping_rounds=10
)

print("Best RMSE: {:.5f} in {} rounds".format(model.best_score, model.best_iteration+1))
[0]	Test-rmse:4.27663
Will train until Test-rmse hasn't improved in 10 rounds.
[1]	Test-rmse:3.00449
[2]	Test-rmse:2.11192
[3]	Test-rmse:1.48495
[4]	Test-rmse:1.04245
[5]	Test-rmse:0.73402
[6]	Test-rmse:0.51477
[7]	Test-rmse:0.36311
[8]	Test-rmse:0.25605
[9]	Test-rmse:0.18107
[10]	Test-rmse:0.12900
[11]	Test-rmse:0.09295
[12]	Test-rmse:0.06874
[13]	Test-rmse:0.05330
[14]	Test-rmse:0.04382
[15]	Test-rmse:0.03871
[16]	Test-rmse:0.03604
[17]	Test-rmse:0.03500
[18]	Test-rmse:0.03441
[19]	Test-rmse:0.03440
[20]	Test-rmse:0.03474
[21]	Test-rmse:0.03488
[22]	Test-rmse:0.03531
[23]	Test-rmse:0.03549
[24]	Test-rmse:0.03594
[25]	Test-rmse:0.03614
[26]	Test-rmse:0.03662
[27]	Test-rmse:0.03685
[28]	Test-rmse:0.03735
[29]	Test-rmse:0.03757
Stopping. Best iteration:
[19]	Test-rmse:0.03440

Best RMSE: 0.03440 in 20 rounds

Result: GradientBoostingRegressor RMSE = 0.70373 > XGBoost RMSE = 0.03440