Data Engineer & DataOps
My LinkedIn
My GitHub
This data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute “first date” with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information.
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('speeddating.csv')
data.dropna(inplace=True)
# drop field since this attribute contains many messy values
data.drop(columns='field', index=1, inplace=True)
data
gender | age | age_o | d_age | race | race_o | samerace | importance_same_race | importance_same_religion | pref_o_attractive | ... | interests_correlate | expected_happy_with_sd_people | expected_num_interested_in_me | expected_num_matches | like | guess_prob_liked | met | decision | decision_o | match | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | female | 21.0 | 27.0 | 6 | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | 35.0 | ... | 0.14 | 3.0 | 2.0 | 4.0 | 7.0 | 6.0 | 0.0 | 1 | 0 | 0 |
3 | female | 21.0 | 23.0 | 2 | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | 30.0 | ... | 0.61 | 3.0 | 2.0 | 4.0 | 7.0 | 6.0 | 0.0 | 1 | 1 | 1 |
4 | female | 21.0 | 24.0 | 3 | 'Asian/Pacific Islander/Asian-American' | 'Latino/Hispanic American' | 0 | 2.0 | 4.0 | 30.0 | ... | 0.21 | 3.0 | 2.0 | 4.0 | 6.0 | 6.0 | 0.0 | 1 | 1 | 1 |
5 | female | 21.0 | 25.0 | 4 | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | 50.0 | ... | 0.25 | 3.0 | 2.0 | 4.0 | 6.0 | 5.0 | 0.0 | 0 | 1 | 0 |
6 | female | 21.0 | 30.0 | 9 | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 2.0 | 4.0 | 35.0 | ... | 0.34 | 3.0 | 2.0 | 4.0 | 6.0 | 5.0 | 0.0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1836 | male | 19.0 | 20.0 | 1 | 'Asian/Pacific Islander/Asian-American' | Other | 0 | 4.0 | 1.0 | 15.0 | ... | 0.35 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1837 | male | 19.0 | 21.0 | 2 | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 4.0 | 1.0 | 15.0 | ... | 0.45 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1838 | male | 19.0 | 20.0 | 1 | 'Asian/Pacific Islander/Asian-American' | 'Black/African American' | 0 | 4.0 | 1.0 | 20.0 | ... | 0.13 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1840 | male | 19.0 | 21.0 | 2 | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American | 0 | 4.0 | 1.0 | 15.0 | ... | 0.54 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1843 | male | 19.0 | 20.0 | 1 | 'Asian/Pacific Islander/Asian-American' | 'Latino/Hispanic American' | 0 | 4.0 | 1.0 | 10.0 | ... | 0.54 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1047 rows × 65 columns
num_data = data.select_dtypes(exclude='object').reset_index(drop=True)
o_data = data.select_dtypes(include='object')
num_data
age | age_o | d_age | samerace | importance_same_race | importance_same_religion | pref_o_attractive | pref_o_sincere | pref_o_intelligence | pref_o_funny | ... | interests_correlate | expected_happy_with_sd_people | expected_num_interested_in_me | expected_num_matches | like | guess_prob_liked | met | decision | decision_o | match | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 27.0 | 6 | 0 | 2.0 | 4.0 | 35.0 | 20.0 | 20.0 | 20.0 | ... | 0.14 | 3.0 | 2.0 | 4.0 | 7.0 | 6.0 | 0.0 | 1 | 0 | 0 |
1 | 21.0 | 23.0 | 2 | 0 | 2.0 | 4.0 | 30.0 | 5.0 | 15.0 | 40.0 | ... | 0.61 | 3.0 | 2.0 | 4.0 | 7.0 | 6.0 | 0.0 | 1 | 1 | 1 |
2 | 21.0 | 24.0 | 3 | 0 | 2.0 | 4.0 | 30.0 | 10.0 | 20.0 | 10.0 | ... | 0.21 | 3.0 | 2.0 | 4.0 | 6.0 | 6.0 | 0.0 | 1 | 1 | 1 |
3 | 21.0 | 25.0 | 4 | 0 | 2.0 | 4.0 | 50.0 | 0.0 | 30.0 | 10.0 | ... | 0.25 | 3.0 | 2.0 | 4.0 | 6.0 | 5.0 | 0.0 | 0 | 1 | 0 |
4 | 21.0 | 30.0 | 9 | 0 | 2.0 | 4.0 | 35.0 | 15.0 | 25.0 | 10.0 | ... | 0.34 | 3.0 | 2.0 | 4.0 | 6.0 | 5.0 | 0.0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1042 | 19.0 | 20.0 | 1 | 0 | 4.0 | 1.0 | 15.0 | 15.0 | 20.0 | 25.0 | ... | 0.35 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1043 | 19.0 | 21.0 | 2 | 0 | 4.0 | 1.0 | 15.0 | 15.0 | 25.0 | 25.0 | ... | 0.45 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1044 | 19.0 | 20.0 | 1 | 0 | 4.0 | 1.0 | 20.0 | 20.0 | 20.0 | 20.0 | ... | 0.13 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1045 | 19.0 | 21.0 | 2 | 0 | 4.0 | 1.0 | 15.0 | 15.0 | 25.0 | 25.0 | ... | 0.54 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1046 | 19.0 | 20.0 | 1 | 0 | 4.0 | 1.0 | 10.0 | 10.0 | 35.0 | 35.0 | ... | 0.54 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1047 rows × 62 columns
o_data
gender | race | race_o | |
---|---|---|---|
0 | female | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American |
3 | female | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American |
4 | female | 'Asian/Pacific Islander/Asian-American' | 'Latino/Hispanic American' |
5 | female | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American |
6 | female | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American |
... | ... | ... | ... |
1836 | male | 'Asian/Pacific Islander/Asian-American' | Other |
1837 | male | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American |
1838 | male | 'Asian/Pacific Islander/Asian-American' | 'Black/African American' |
1840 | male | 'Asian/Pacific Islander/Asian-American' | European/Caucasian-American |
1843 | male | 'Asian/Pacific Islander/Asian-American' | 'Latino/Hispanic American' |
1047 rows × 3 columns
o_encoder = OneHotEncoder()
o_trans = o_encoder.fit_transform(o_data)
o_att = list()
for i in o_encoder.categories_:
for n in i:
o_att.append(str(n).replace("'", ''))
for i in o_att[7:]:
o_att.append(i+'_o')
o_att.remove(i)
o_att
['female',
'male',
'Asian/Pacific Islander/Asian-American',
'Black/African American',
'Latino/Hispanic American',
'European/Caucasian-American',
'Other',
'Asian/Pacific Islander/Asian-American_o',
'Black/African American_o',
'Latino/Hispanic American_o',
'European/Caucasian-American_o',
'Other_o']
o_data = pd.DataFrame(o_trans.toarray(), columns=o_att)
trans_data = pd.concat([o_data, num_data], axis=1)
trans_data
female | male | Asian/Pacific Islander/Asian-American | Black/African American | Latino/Hispanic American | European/Caucasian-American | Other | Asian/Pacific Islander/Asian-American_o | Black/African American_o | Latino/Hispanic American_o | ... | interests_correlate | expected_happy_with_sd_people | expected_num_interested_in_me | expected_num_matches | like | guess_prob_liked | met | decision | decision_o | match | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.14 | 3.0 | 2.0 | 4.0 | 7.0 | 6.0 | 0.0 | 1 | 0 | 0 |
1 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.61 | 3.0 | 2.0 | 4.0 | 7.0 | 6.0 | 0.0 | 1 | 1 | 1 |
2 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.21 | 3.0 | 2.0 | 4.0 | 6.0 | 6.0 | 0.0 | 1 | 1 | 1 |
3 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.25 | 3.0 | 2.0 | 4.0 | 6.0 | 5.0 | 0.0 | 0 | 1 | 0 |
4 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.34 | 3.0 | 2.0 | 4.0 | 6.0 | 5.0 | 0.0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1042 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.35 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1043 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.45 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1044 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.13 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1045 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.54 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1046 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.54 | 5.0 | 0.0 | 0.0 | 5.0 | 1.0 | 0.0 | 0 | 0 | 0 |
1047 rows × 74 columns
X_train, X_test, y_train, y_test = train_test_split(trans_data.iloc[:, :-10], trans_data.iloc[:, -1], random_state=42)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
from sklearn.metrics import accuracy_score
print('Baseline Accuracy score: {:.5f}'.format(accuracy_score(y_test, y_pred)))
Baseline Accuracy score: 0.82061
features = pd.DataFrame(data=rf.feature_importances_, index=trans_data.columns[:-10], columns=['importance'])
features.sort_values('importance', ascending=False)[:10]
importance | |
---|---|
attractive_o | 0.060663 |
attractive_partner | 0.057318 |
shared_interests_o | 0.056460 |
funny_o | 0.046871 |
shared_interests_partner | 0.044348 |
funny_partner | 0.042942 |
age_o | 0.029814 |
pref_o_attractive | 0.024023 |
pref_o_funny | 0.023656 |
ambitous_o | 0.022849 |
X, y = trans_data.iloc[:, :-10], trans_data.iloc[:, -1]
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
level_0 = [('lr', LogisticRegression()),
('knn', KNeighborsClassifier()),
('rf', RandomForestClassifier()),
('svm', SVC())]
level_1 = RandomForestClassifier()
stacking = StackingClassifier(estimators=level_0, final_estimator=level_1, cv=5)
models = {'lr': LogisticRegression(),
'knn': KNeighborsClassifier(),
'rf': RandomForestClassifier(),
'svm': SVC(),
'stacking': stacking}
def evaluate_model(model, X, y):
cv = RepeatedStratifiedKFold(n_splits=5, random_state=42)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
return scores
results = list()
names = list()
for name, model in models.items():
scores = evaluate_model(model, X_train, y_train)
results.append(scores)
names.append(name)
print((name, round(np.mean(scores), 5), round(np.std(scores), 5)))
('lr', 0.84408, 0.02193)
('knn', 0.81452, 0.01951)
('rf', 0.85618, 0.01361)
('svm', 0.83057, 0.00312)
('stacking', 0.84611, 0.02071)
X_train, X_test, y_train, y_test = train_test_split(trans_data.iloc[:, :-3], trans_data.iloc[:, -6], random_state=42)
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
y_pred.astype('int')
array([6, 6, 6, 7, 4, 6, 8, 6, 6, 7, 6, 6, 6, 6, 8, 4, 2, 5, 6, 7, 4, 8,
6, 6, 7, 5, 6, 6, 7, 6, 6, 8, 6, 8, 6, 7, 6, 2, 6, 6, 6, 6, 6, 6,
6, 6, 2, 7, 5, 8, 6, 3, 7, 6, 6, 5, 7, 6, 6, 6, 9, 9, 6, 6, 6, 5,
4, 4, 6, 7, 7, 2, 9, 7, 6, 7, 4, 5, 6, 6, 3, 8, 6, 5, 8, 3, 4, 6,
6, 6, 6, 6, 7, 6, 1, 7, 6, 6, 0, 5, 6, 8, 6, 7, 6, 5, 6, 6, 7, 2,
5, 8, 6, 6, 8, 6, 4, 6, 6, 6, 7, 6, 6, 7, 1, 6, 6, 7, 6, 6, 3, 6,
6, 6, 6, 2, 6, 6, 6, 7, 6, 7, 7, 7, 6, 6, 6, 5, 6, 9, 6, 6, 6, 6,
6, 9, 6, 6, 7, 3, 5, 6, 6, 8, 5, 8, 7, 4, 6, 5, 6, 6, 9, 6, 2, 5,
3, 2, 8, 6, 6, 6, 6, 6, 6, 2, 4, 6, 6, 2, 5, 7, 7, 6, 5, 2, 5, 7,
6, 6, 6, 5, 7, 6, 6, 6, 6, 6, 6, 6, 5, 6, 9, 6, 6, 6, 6, 6, 7, 5,
6, 7, 5, 7, 6, 1, 8, 4, 6, 6, 2, 5, 7, 5, 4, 6, 6, 6, 7, 5, 6, 9,
6, 5, 6, 6, 6, 6, 6, 6, 9, 6, 3, 7, 6, 5, 6, 2, 6, 6, 6, 6])
from sklearn.metrics import mean_squared_error
print('Baseline RMSE: {:.5f}'.format(
mean_squared_error(y_test, y_pred.astype('int'), squared=False)))
Baseline RMSE: 0.70373
import xgboost as xgb
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)
params = {
'max_depth':6,
'min_child_weight': 1,
'eta':0.3,
'subsample': 1,
'colsample_bytree': 1,
'objective':'reg:squarederror',
'eval_metric': 'rmse'
}
cv_results = xgb.cv(
params,
train,
num_boost_round=100,
seed=42,
nfold=5,
metrics={'rmse'},
early_stopping_rounds=10
)
cv_results
train-rmse-mean | train-rmse-std | test-rmse-mean | test-rmse-std | |
---|---|---|---|---|
0 | 4.211302 | 0.009553 | 4.211142 | 0.029887 |
1 | 2.963511 | 0.006647 | 2.962846 | 0.023045 |
2 | 2.086589 | 0.004590 | 2.086106 | 0.018931 |
3 | 1.469331 | 0.003155 | 1.469113 | 0.011535 |
4 | 1.035416 | 0.002208 | 1.035566 | 0.010592 |
5 | 0.729338 | 0.001536 | 0.729979 | 0.007580 |
6 | 0.514014 | 0.001038 | 0.515827 | 0.006285 |
7 | 0.362344 | 0.000730 | 0.364688 | 0.005393 |
8 | 0.255511 | 0.000529 | 0.259339 | 0.006231 |
9 | 0.180219 | 0.000364 | 0.185675 | 0.007906 |
10 | 0.127121 | 0.000253 | 0.135551 | 0.010946 |
11 | 0.089681 | 0.000176 | 0.101505 | 0.014863 |
12 | 0.063277 | 0.000124 | 0.078882 | 0.019360 |
13 | 0.044657 | 0.000088 | 0.064012 | 0.023895 |
14 | 0.031525 | 0.000064 | 0.054265 | 0.027993 |
15 | 0.022262 | 0.000047 | 0.047811 | 0.031418 |
16 | 0.015726 | 0.000035 | 0.043479 | 0.034129 |
17 | 0.011115 | 0.000027 | 0.040548 | 0.036189 |
18 | 0.007860 | 0.000023 | 0.038538 | 0.037721 |
19 | 0.005563 | 0.000020 | 0.037150 | 0.038841 |
20 | 0.003942 | 0.000020 | 0.036186 | 0.039650 |
21 | 0.002798 | 0.000020 | 0.035513 | 0.040233 |
22 | 0.001991 | 0.000020 | 0.035042 | 0.040650 |
23 | 0.001421 | 0.000022 | 0.034714 | 0.040945 |
24 | 0.001018 | 0.000022 | 0.034482 | 0.041156 |
25 | 0.000735 | 0.000022 | 0.034319 | 0.041305 |
26 | 0.000534 | 0.000023 | 0.034204 | 0.041410 |
27 | 0.000394 | 0.000023 | 0.034122 | 0.041485 |
28 | 0.000294 | 0.000023 | 0.034066 | 0.041536 |
29 | 0.000225 | 0.000022 | 0.034028 | 0.041573 |
30 | 0.000175 | 0.000021 | 0.034000 | 0.041598 |
31 | 0.000141 | 0.000018 | 0.033983 | 0.041614 |
32 | 0.000117 | 0.000016 | 0.033973 | 0.041624 |
33 | 0.000102 | 0.000013 | 0.033968 | 0.041629 |
34 | 0.000092 | 0.000010 | 0.033960 | 0.041636 |
35 | 0.000084 | 0.000008 | 0.033958 | 0.041637 |
36 | 0.000078 | 0.000006 | 0.033958 | 0.041640 |
37 | 0.000077 | 0.000006 | 0.033958 | 0.041640 |
38 | 0.000077 | 0.000006 | 0.033958 | 0.041641 |
39 | 0.000077 | 0.000006 | 0.033958 | 0.041641 |
40 | 0.000077 | 0.000006 | 0.033958 | 0.041641 |
41 | 0.000077 | 0.000006 | 0.033958 | 0.041641 |
42 | 0.000077 | 0.000006 | 0.033958 | 0.041641 |
gridsearch_params = [
(max_depth, min_child_weight)
for max_depth in range(10,15)
for min_child_weight in range(5,10)
]
min_rmse = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
print("CV with max_depth={}, min_child_weight={}".format(
max_depth,
min_child_weight))
params['max_depth'] = max_depth
params['min_child_weight'] = min_child_weight
cv_results = xgb.cv(
params,
train,
num_boost_round=100,
seed=42,
nfold=5,
metrics={'rmse'},
early_stopping_rounds=10
)
# update best RMSE
mean_rmse = cv_results['test-rmse-mean'].min()
boost_rounds = cv_results['test-rmse-mean'].argmin()
print("\tRMSE {:.5f} for {} rounds".format(mean_rmse, boost_rounds))
if mean_rmse < min_rmse:
min_rmse = mean_rmse
best_params = (max_depth, min_child_weight)
print("Best params: {}, {}, RMSE: {:.5f}".format(best_params[0], best_params[1], min_rmse))
CV with max_depth=10, min_child_weight=5
RMSE 0.03549 for 30 rounds
CV with max_depth=10, min_child_weight=6
RMSE 0.03525 for 22 rounds
CV with max_depth=10, min_child_weight=7
RMSE 0.03503 for 22 rounds
CV with max_depth=10, min_child_weight=8
RMSE 0.05993 for 20 rounds
CV with max_depth=10, min_child_weight=9
RMSE 0.10087 for 18 rounds
CV with max_depth=11, min_child_weight=5
RMSE 0.03524 for 30 rounds
CV with max_depth=11, min_child_weight=6
RMSE 0.03526 for 22 rounds
CV with max_depth=11, min_child_weight=7
RMSE 0.03502 for 22 rounds
CV with max_depth=11, min_child_weight=8
RMSE 0.06001 for 20 rounds
CV with max_depth=11, min_child_weight=9
RMSE 0.10091 for 18 rounds
CV with max_depth=12, min_child_weight=5
RMSE 0.03549 for 23 rounds
CV with max_depth=12, min_child_weight=6
RMSE 0.03526 for 22 rounds
CV with max_depth=12, min_child_weight=7
RMSE 0.03503 for 22 rounds
CV with max_depth=12, min_child_weight=8
RMSE 0.05980 for 22 rounds
CV with max_depth=12, min_child_weight=9
RMSE 0.10092 for 18 rounds
CV with max_depth=13, min_child_weight=5
RMSE 0.03549 for 23 rounds
CV with max_depth=13, min_child_weight=6
RMSE 0.03526 for 22 rounds
CV with max_depth=13, min_child_weight=7
RMSE 0.03503 for 22 rounds
CV with max_depth=13, min_child_weight=8
RMSE 0.05948 for 20 rounds
CV with max_depth=13, min_child_weight=9
RMSE 0.10092 for 18 rounds
CV with max_depth=14, min_child_weight=5
RMSE 0.03549 for 23 rounds
CV with max_depth=14, min_child_weight=6
RMSE 0.03526 for 22 rounds
CV with max_depth=14, min_child_weight=7
RMSE 0.03503 for 22 rounds
CV with max_depth=14, min_child_weight=8
RMSE 0.05944 for 25 rounds
CV with max_depth=14, min_child_weight=9
RMSE 0.10092 for 18 rounds
Best params: 11, 7, RMSE: 0.03502
params['max_depth'] = best_params[0]
params['min_child_weight'] = best_params[1]
model = xgb.train(
params,
train,
num_boost_round=35,
evals=[(test, "Test")],
early_stopping_rounds=10
)
print("Best RMSE: {:.5f} in {} rounds".format(model.best_score, model.best_iteration+1))
[0] Test-rmse:4.27663
Will train until Test-rmse hasn't improved in 10 rounds.
[1] Test-rmse:3.00449
[2] Test-rmse:2.11192
[3] Test-rmse:1.48495
[4] Test-rmse:1.04245
[5] Test-rmse:0.73402
[6] Test-rmse:0.51477
[7] Test-rmse:0.36311
[8] Test-rmse:0.25605
[9] Test-rmse:0.18107
[10] Test-rmse:0.12900
[11] Test-rmse:0.09295
[12] Test-rmse:0.06874
[13] Test-rmse:0.05330
[14] Test-rmse:0.04382
[15] Test-rmse:0.03871
[16] Test-rmse:0.03604
[17] Test-rmse:0.03500
[18] Test-rmse:0.03441
[19] Test-rmse:0.03440
[20] Test-rmse:0.03474
[21] Test-rmse:0.03488
[22] Test-rmse:0.03531
[23] Test-rmse:0.03549
[24] Test-rmse:0.03594
[25] Test-rmse:0.03614
[26] Test-rmse:0.03662
[27] Test-rmse:0.03685
[28] Test-rmse:0.03735
[29] Test-rmse:0.03757
Stopping. Best iteration:
[19] Test-rmse:0.03440
Best RMSE: 0.03440 in 20 rounds