Ensemble Learning with RandomForest and XGBoost: speed dating dataset

Goals

1. Classification - matched or unmatched date: accuracy = 0.85618

2. Regression - how much a participant likes his/her date: RMSE = 0.70373

Data set description

Author: Ray Fisman and Sheena Iyengar
Source: Columbia Business School - 2004
Link: https://www.openml.org/d/40536

This data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute “first date” with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information.

Attributes

gender: Gender of self
age: Age of self
age_o: Age of partner
d_age: Difference in age
race: Race of self
race_o: Race of partner
samerace: Whether the two persons have the same race or not.
importance_same_race: How important is it that partner is of same race?
importance_same_religion: How important is it that partner has same religion?
field: Field of study
pref_o_attractive: How important does partner rate attractiveness
pref_o_sinsere: How important does partner rate sincerity
pref_o_intelligence: How important does partner rate intelligence
pref_o_funny: How important does partner rate being funny
pref_o_ambitious: How important does partner rate ambition
pref_o_shared_interests: How important does partner rate having shared interests
attractive_o: Rating by partner (about me) at night of event on attractiveness
sincere_o: Rating by partner (about me) at night of event on sincerity
intelligence_o: Rating by partner (about me) at night of event on intelligence
funny_o: Rating by partner (about me) at night of event on being funny
ambitous_o: Rating by partner (about me) at night of event on being ambitious
shared_interests_o: Rating by partner (about me) at night of event on shared interest
attractive_important: What do you look for in a partner - attractiveness
sincere_important: What do you look for in a partner - sincerity
intellicence_important: What do you look for in a partner - intelligence
funny_important: What do you look for in a partner - being funny
ambtition_important: What do you look for in a partner - ambition
shared_interests_important: What do you look for in a partner - shared interests
attractive: Rate yourself - attractiveness
sincere: Rate yourself - sincerity
intelligence: Rate yourself - intelligence
funny: Rate yourself - being funny
ambition: Rate yourself - ambition
attractive_partner: Rate your partner - attractiveness
sincere_partner: Rate your partner - sincerity
intelligence_partner: Rate your partner - intelligence
funny_partner: Rate your partner - being funny
ambition_partner: Rate your partner - ambition
shared_interests_partner: Rate your partner - shared interests
sports: Your own interests [1-10] tvsports, exercise, dining, museums, art, hiking, gaming, clubbing, reading, tv, theater, movies, concerts, music, shopping, yoga
interests_correlate: Correlation between participant’s and partner’s ratings of interests.
expected_happy_with_sd_people: How happy do you expect to be with the people you meet during the speed-dating event?
expected_num_interested_in_me: Out of the 20 people you will meet, how many do you expect will be interested in dating you?
expected_num_matches: How many matches do you expect to get?
like: Did you like your partner?
guess_prob_liked: How likely do you think it is that your partner likes you?
met: Have you met your partner before?
decision: Decision at night of event.
decision_o: Decision of partner at night of event.
match: Match (yes/no)

ML tasks

1. Classification: matched or unmatched

2. Regression: how much a participant likes his/her date

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import numpy as np

data = pd.read_csv('speeddating.csv')

data.dropna(inplace=True)

# drop field since this attribute contains many messy values
data.drop(columns='field', index=1, inplace=True)

data

	gender	age	age_o	d_age	race	race_o	samerace	importance_same_race	importance_same_religion	pref_o_attractive	...	interests_correlate	expected_happy_with_sd_people	expected_num_interested_in_me	expected_num_matches	like	guess_prob_liked	met	decision	decision_o	match
0	female	21.0	27.0	6	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	2.0	4.0	35.0	...	0.14	3.0	2.0	4.0	7.0	6.0	0.0	1	0	0
3	female	21.0	23.0	2	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	2.0	4.0	30.0	...	0.61	3.0	2.0	4.0	7.0	6.0	0.0	1	1	1
4	female	21.0	24.0	3	'Asian/Pacific Islander/Asian-American'	'Latino/Hispanic American'	0	2.0	4.0	30.0	...	0.21	3.0	2.0	4.0	6.0	6.0	0.0	1	1	1
5	female	21.0	25.0	4	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	2.0	4.0	50.0	...	0.25	3.0	2.0	4.0	6.0	5.0	0.0	0	1	0
6	female	21.0	30.0	9	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	2.0	4.0	35.0	...	0.34	3.0	2.0	4.0	6.0	5.0	0.0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1836	male	19.0	20.0	1	'Asian/Pacific Islander/Asian-American'	Other	0	4.0	1.0	15.0	...	0.35	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1837	male	19.0	21.0	2	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	4.0	1.0	15.0	...	0.45	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1838	male	19.0	20.0	1	'Asian/Pacific Islander/Asian-American'	'Black/African American'	0	4.0	1.0	20.0	...	0.13	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1840	male	19.0	21.0	2	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American	0	4.0	1.0	15.0	...	0.54	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1843	male	19.0	20.0	1	'Asian/Pacific Islander/Asian-American'	'Latino/Hispanic American'	0	4.0	1.0	10.0	...	0.54	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0

1047 rows × 65 columns

Data Pre-processing

num_data = data.select_dtypes(exclude='object').reset_index(drop=True)

o_data = data.select_dtypes(include='object')

num_data

	age	age_o	d_age	samerace	importance_same_race	importance_same_religion	pref_o_attractive	pref_o_sincere	pref_o_intelligence	pref_o_funny	...	interests_correlate	expected_happy_with_sd_people	expected_num_interested_in_me	expected_num_matches	like	guess_prob_liked	met	decision	decision_o	match
0	21.0	27.0	6	0	2.0	4.0	35.0	20.0	20.0	20.0	...	0.14	3.0	2.0	4.0	7.0	6.0	0.0	1	0	0
1	21.0	23.0	2	0	2.0	4.0	30.0	5.0	15.0	40.0	...	0.61	3.0	2.0	4.0	7.0	6.0	0.0	1	1	1
2	21.0	24.0	3	0	2.0	4.0	30.0	10.0	20.0	10.0	...	0.21	3.0	2.0	4.0	6.0	6.0	0.0	1	1	1
3	21.0	25.0	4	0	2.0	4.0	50.0	0.0	30.0	10.0	...	0.25	3.0	2.0	4.0	6.0	5.0	0.0	0	1	0
4	21.0	30.0	9	0	2.0	4.0	35.0	15.0	25.0	10.0	...	0.34	3.0	2.0	4.0	6.0	5.0	0.0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1042	19.0	20.0	1	0	4.0	1.0	15.0	15.0	20.0	25.0	...	0.35	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1043	19.0	21.0	2	0	4.0	1.0	15.0	15.0	25.0	25.0	...	0.45	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1044	19.0	20.0	1	0	4.0	1.0	20.0	20.0	20.0	20.0	...	0.13	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1045	19.0	21.0	2	0	4.0	1.0	15.0	15.0	25.0	25.0	...	0.54	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1046	19.0	20.0	1	0	4.0	1.0	10.0	10.0	35.0	35.0	...	0.54	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0

1047 rows × 62 columns

o_data

	gender	race	race_o
0	female	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American
3	female	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American
4	female	'Asian/Pacific Islander/Asian-American'	'Latino/Hispanic American'
5	female	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American
6	female	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American
...	...	...	...
1836	male	'Asian/Pacific Islander/Asian-American'	Other
1837	male	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American
1838	male	'Asian/Pacific Islander/Asian-American'	'Black/African American'
1840	male	'Asian/Pacific Islander/Asian-American'	European/Caucasian-American
1843	male	'Asian/Pacific Islander/Asian-American'	'Latino/Hispanic American'

1047 rows × 3 columns

o_encoder = OneHotEncoder()
o_trans = o_encoder.fit_transform(o_data)

o_att = list()

for i in o_encoder.categories_:
    for n in i:
        o_att.append(str(n).replace("'", ''))
        
for i in o_att[7:]:
    o_att.append(i+'_o')
    o_att.remove(i)

o_att

['female',
 'male',
 'Asian/Pacific Islander/Asian-American',
 'Black/African American',
 'Latino/Hispanic American',
 'European/Caucasian-American',
 'Other',
 'Asian/Pacific Islander/Asian-American_o',
 'Black/African American_o',
 'Latino/Hispanic American_o',
 'European/Caucasian-American_o',
 'Other_o']

o_data = pd.DataFrame(o_trans.toarray(), columns=o_att)

trans_data = pd.concat([o_data, num_data], axis=1)

trans_data

	female	male	Asian/Pacific Islander/Asian-American	Black/African American	Latino/Hispanic American	European/Caucasian-American	Other	Asian/Pacific Islander/Asian-American_o	Black/African American_o	Latino/Hispanic American_o	...	interests_correlate	expected_happy_with_sd_people	expected_num_interested_in_me	expected_num_matches	like	guess_prob_liked	met	decision	decision_o	match
0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.14	3.0	2.0	4.0	7.0	6.0	0.0	1	0	0
1	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.61	3.0	2.0	4.0	7.0	6.0	0.0	1	1	1
2	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	0.21	3.0	2.0	4.0	6.0	6.0	0.0	1	1	1
3	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.25	3.0	2.0	4.0	6.0	5.0	0.0	0	1	0
4	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.34	3.0	2.0	4.0	6.0	5.0	0.0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1042	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.35	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1043	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.45	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1044	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.13	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1045	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.54	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0
1046	0.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	0.54	5.0	0.0	0.0	5.0	1.0	0.0	0	0	0

1047 rows × 74 columns

1. Classification: matched or unmatched

1.1. Bagging with RandomForestClassifier of 100 independent DecisionTrees

X_train, X_test, y_train, y_test = train_test_split(trans_data.iloc[:, :-10], trans_data.iloc[:, -1], random_state=42)

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

from sklearn.metrics import accuracy_score

print('Baseline Accuracy score: {:.5f}'.format(accuracy_score(y_test, y_pred)))

Baseline Accuracy score: 0.82061

features = pd.DataFrame(data=rf.feature_importances_, index=trans_data.columns[:-10], columns=['importance'])

features.sort_values('importance', ascending=False)[:10]

	importance
attractive_o	0.060663
attractive_partner	0.057318
shared_interests_o	0.056460
funny_o	0.046871
shared_interests_partner	0.044348
funny_partner	0.042942
age_o	0.029814
pref_o_attractive	0.024023
pref_o_funny	0.023656
ambitous_o	0.022849

1.2. Stacking with 2 layers of models

X, y = trans_data.iloc[:, :-10], trans_data.iloc[:, -1]

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier

level_0 = [('lr', LogisticRegression()),
           ('knn', KNeighborsClassifier()),
           ('rf', RandomForestClassifier()),
           ('svm', SVC())]
level_1 = RandomForestClassifier()
stacking = StackingClassifier(estimators=level_0, final_estimator=level_1, cv=5)

models = {'lr': LogisticRegression(),
         'knn': KNeighborsClassifier(),
         'rf': RandomForestClassifier(),
         'svm': SVC(),
         'stacking': stacking}

def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=5, random_state=42)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

results = list()
names = list()

for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train)
    results.append(scores)
    names.append(name)
    print((name, round(np.mean(scores), 5), round(np.std(scores), 5)))

('lr', 0.84408, 0.02193)
('knn', 0.81452, 0.01951)
('rf', 0.85618, 0.01361)
('svm', 0.83057, 0.00312)
('stacking', 0.84611, 0.02071)

Result: Stacking Accuracy = 0.84611 < RandomForestClassifier Accuracy = 0.85618

2. Regression: how much a participant likes his/her date

2.1. Gradient Boosting with GradientBoostingRegressor of 100 independent DecisionTrees

X_train, X_test, y_train, y_test = train_test_split(trans_data.iloc[:, :-3], trans_data.iloc[:, -6], random_state=42)

from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)

y_pred = gbr.predict(X_test)

y_pred.astype('int')

array([6, 6, 6, 7, 4, 6, 8, 6, 6, 7, 6, 6, 6, 6, 8, 4, 2, 5, 6, 7, 4, 8,
       6, 6, 7, 5, 6, 6, 7, 6, 6, 8, 6, 8, 6, 7, 6, 2, 6, 6, 6, 6, 6, 6,
       6, 6, 2, 7, 5, 8, 6, 3, 7, 6, 6, 5, 7, 6, 6, 6, 9, 9, 6, 6, 6, 5,
       4, 4, 6, 7, 7, 2, 9, 7, 6, 7, 4, 5, 6, 6, 3, 8, 6, 5, 8, 3, 4, 6,
       6, 6, 6, 6, 7, 6, 1, 7, 6, 6, 0, 5, 6, 8, 6, 7, 6, 5, 6, 6, 7, 2,
       5, 8, 6, 6, 8, 6, 4, 6, 6, 6, 7, 6, 6, 7, 1, 6, 6, 7, 6, 6, 3, 6,
       6, 6, 6, 2, 6, 6, 6, 7, 6, 7, 7, 7, 6, 6, 6, 5, 6, 9, 6, 6, 6, 6,
       6, 9, 6, 6, 7, 3, 5, 6, 6, 8, 5, 8, 7, 4, 6, 5, 6, 6, 9, 6, 2, 5,
       3, 2, 8, 6, 6, 6, 6, 6, 6, 2, 4, 6, 6, 2, 5, 7, 7, 6, 5, 2, 5, 7,
       6, 6, 6, 5, 7, 6, 6, 6, 6, 6, 6, 6, 5, 6, 9, 6, 6, 6, 6, 6, 7, 5,
       6, 7, 5, 7, 6, 1, 8, 4, 6, 6, 2, 5, 7, 5, 4, 6, 6, 6, 7, 5, 6, 9,
       6, 5, 6, 6, 6, 6, 6, 6, 9, 6, 3, 7, 6, 5, 6, 2, 6, 6, 6, 6])

from sklearn.metrics import mean_squared_error

print('Baseline RMSE: {:.5f}'.format(
                 mean_squared_error(y_test, y_pred.astype('int'), squared=False)))

Baseline RMSE: 0.70373

2.2. Extreme Gradient Boosting with XGBoost and hyper parameter tuning with GridSearch

import xgboost as xgb

train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

params = {
    'max_depth':6,
    'min_child_weight': 1,
    'eta':0.3,
    'subsample': 1,
    'colsample_bytree': 1,
    'objective':'reg:squarederror',
    'eval_metric': 'rmse'
}

cv_results = xgb.cv(
    params,
    train,
    num_boost_round=100,
    seed=42,
    nfold=5,
    metrics={'rmse'},
    early_stopping_rounds=10
)

cv_results

	train-rmse-mean	train-rmse-std	test-rmse-mean	test-rmse-std
0	4.211302	0.009553	4.211142	0.029887
1	2.963511	0.006647	2.962846	0.023045
2	2.086589	0.004590	2.086106	0.018931
3	1.469331	0.003155	1.469113	0.011535
4	1.035416	0.002208	1.035566	0.010592
5	0.729338	0.001536	0.729979	0.007580
6	0.514014	0.001038	0.515827	0.006285
7	0.362344	0.000730	0.364688	0.005393
8	0.255511	0.000529	0.259339	0.006231
9	0.180219	0.000364	0.185675	0.007906
10	0.127121	0.000253	0.135551	0.010946
11	0.089681	0.000176	0.101505	0.014863
12	0.063277	0.000124	0.078882	0.019360
13	0.044657	0.000088	0.064012	0.023895
14	0.031525	0.000064	0.054265	0.027993
15	0.022262	0.000047	0.047811	0.031418
16	0.015726	0.000035	0.043479	0.034129
17	0.011115	0.000027	0.040548	0.036189
18	0.007860	0.000023	0.038538	0.037721
19	0.005563	0.000020	0.037150	0.038841
20	0.003942	0.000020	0.036186	0.039650
21	0.002798	0.000020	0.035513	0.040233
22	0.001991	0.000020	0.035042	0.040650
23	0.001421	0.000022	0.034714	0.040945
24	0.001018	0.000022	0.034482	0.041156
25	0.000735	0.000022	0.034319	0.041305
26	0.000534	0.000023	0.034204	0.041410
27	0.000394	0.000023	0.034122	0.041485
28	0.000294	0.000023	0.034066	0.041536
29	0.000225	0.000022	0.034028	0.041573
30	0.000175	0.000021	0.034000	0.041598
31	0.000141	0.000018	0.033983	0.041614
32	0.000117	0.000016	0.033973	0.041624
33	0.000102	0.000013	0.033968	0.041629
34	0.000092	0.000010	0.033960	0.041636
35	0.000084	0.000008	0.033958	0.041637
36	0.000078	0.000006	0.033958	0.041640
37	0.000077	0.000006	0.033958	0.041640
38	0.000077	0.000006	0.033958	0.041641
39	0.000077	0.000006	0.033958	0.041641
40	0.000077	0.000006	0.033958	0.041641
41	0.000077	0.000006	0.033958	0.041641
42	0.000077	0.000006	0.033958	0.041641

gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(10,15)
    for min_child_weight in range(5,10)
]

min_rmse = float("Inf")
best_params = None

for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))
    
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    
    cv_results = xgb.cv(
        params,
        train,
        num_boost_round=100,
        seed=42,
        nfold=5,
        metrics={'rmse'},
        early_stopping_rounds=10
    )
    
    # update best RMSE
    mean_rmse = cv_results['test-rmse-mean'].min()
    boost_rounds = cv_results['test-rmse-mean'].argmin()
    print("\tRMSE {:.5f} for {} rounds".format(mean_rmse, boost_rounds))
    if mean_rmse < min_rmse:
        min_rmse = mean_rmse
        best_params = (max_depth, min_child_weight)

print("Best params: {}, {}, RMSE: {:.5f}".format(best_params[0], best_params[1], min_rmse))

CV with max_depth=10, min_child_weight=5
	RMSE 0.03549 for 30 rounds
CV with max_depth=10, min_child_weight=6
	RMSE 0.03525 for 22 rounds
CV with max_depth=10, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=10, min_child_weight=8
	RMSE 0.05993 for 20 rounds
CV with max_depth=10, min_child_weight=9
	RMSE 0.10087 for 18 rounds
CV with max_depth=11, min_child_weight=5
	RMSE 0.03524 for 30 rounds
CV with max_depth=11, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=11, min_child_weight=7
	RMSE 0.03502 for 22 rounds
CV with max_depth=11, min_child_weight=8
	RMSE 0.06001 for 20 rounds
CV with max_depth=11, min_child_weight=9
	RMSE 0.10091 for 18 rounds
CV with max_depth=12, min_child_weight=5
	RMSE 0.03549 for 23 rounds
CV with max_depth=12, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=12, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=12, min_child_weight=8
	RMSE 0.05980 for 22 rounds
CV with max_depth=12, min_child_weight=9
	RMSE 0.10092 for 18 rounds
CV with max_depth=13, min_child_weight=5
	RMSE 0.03549 for 23 rounds
CV with max_depth=13, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=13, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=13, min_child_weight=8
	RMSE 0.05948 for 20 rounds
CV with max_depth=13, min_child_weight=9
	RMSE 0.10092 for 18 rounds
CV with max_depth=14, min_child_weight=5
	RMSE 0.03549 for 23 rounds
CV with max_depth=14, min_child_weight=6
	RMSE 0.03526 for 22 rounds
CV with max_depth=14, min_child_weight=7
	RMSE 0.03503 for 22 rounds
CV with max_depth=14, min_child_weight=8
	RMSE 0.05944 for 25 rounds
CV with max_depth=14, min_child_weight=9
	RMSE 0.10092 for 18 rounds
Best params: 11, 7, RMSE: 0.03502

params['max_depth'] = best_params[0]
params['min_child_weight'] = best_params[1]

model = xgb.train(
    params,
    train,
    num_boost_round=35,
    evals=[(test, "Test")],
    early_stopping_rounds=10
)

print("Best RMSE: {:.5f} in {} rounds".format(model.best_score, model.best_iteration+1))

[0]	Test-rmse:4.27663
Will train until Test-rmse hasn't improved in 10 rounds.
[1]	Test-rmse:3.00449
[2]	Test-rmse:2.11192
[3]	Test-rmse:1.48495
[4]	Test-rmse:1.04245
[5]	Test-rmse:0.73402
[6]	Test-rmse:0.51477
[7]	Test-rmse:0.36311
[8]	Test-rmse:0.25605
[9]	Test-rmse:0.18107
[10]	Test-rmse:0.12900
[11]	Test-rmse:0.09295
[12]	Test-rmse:0.06874
[13]	Test-rmse:0.05330
[14]	Test-rmse:0.04382
[15]	Test-rmse:0.03871
[16]	Test-rmse:0.03604
[17]	Test-rmse:0.03500
[18]	Test-rmse:0.03441
[19]	Test-rmse:0.03440
[20]	Test-rmse:0.03474
[21]	Test-rmse:0.03488
[22]	Test-rmse:0.03531
[23]	Test-rmse:0.03549
[24]	Test-rmse:0.03594
[25]	Test-rmse:0.03614
[26]	Test-rmse:0.03662
[27]	Test-rmse:0.03685
[28]	Test-rmse:0.03735
[29]	Test-rmse:0.03757
Stopping. Best iteration:
[19]	Test-rmse:0.03440

Best RMSE: 0.03440 in 20 rounds

My Bui (Mimi)

Ensemble Learning with RandomForest and XGBoost: speed dating dataset

Goals

1. Classification - matched or unmatched date: accuracy = 0.85618

2. Regression - how much a participant likes his/her date: RMSE = 0.70373

Data set description

Attributes

ML tasks

1. Classification: matched or unmatched

2. Regression: how much a participant likes his/her date

Data Pre-processing

1. Classification: matched or unmatched

1.1. Bagging with RandomForestClassifier of 100 independent DecisionTrees

1.2. Stacking with 2 layers of models

Result: Stacking Accuracy = 0.84611 < RandomForestClassifier Accuracy = 0.85618

2. Regression: how much a participant likes his/her date

2.1. Gradient Boosting with GradientBoostingRegressor of 100 independent DecisionTrees

2.2. Extreme Gradient Boosting with XGBoost and hyper parameter tuning with GridSearch

Result: GradientBoostingRegressor RMSE = 0.70373 > XGBoost RMSE = 0.03440

Ensemble Learning with RandomForest and XGBoost: speed dating dataset

Goals

1. Classification - matched or unmatched date: accuracy = 0.85618

2. Regression - how much a participant likes his/her date: RMSE = 0.70373

Data set description

Attributes

ML tasks

1. Classification: matched or unmatched

2. Regression: how much a participant likes his/her date

Data Pre-processing

1. Classification: matched or unmatched

1.1. Bagging with RandomForestClassifier of 100 independent DecisionTrees

Feature Importance: being attractive and funny, and sharing same interests are important for matching a partner

1.2. Stacking with 2 layers of models

Result: Stacking Accuracy = 0.84611 < RandomForestClassifier Accuracy = 0.85618

2. Regression: how much a participant likes his/her date

2.1. Gradient Boosting with GradientBoostingRegressor of 100 independent DecisionTrees

2.2. Extreme Gradient Boosting with XGBoost and hyper parameter tuning with GridSearch

Result: GradientBoostingRegressor RMSE = 0.70373 > XGBoost RMSE = 0.03440