An analysis of the Titanic dataset

12 minute read

Published: March 15, 2017

# Imports

# pandas
import pandas as pd
from pandas import Series,DataFrame

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_style('whitegrid')
%matplotlib inline

# get titanic & test csv files as a DataFrame
titanic_df = pd.read_csv("train.csv")
test_df    = pd.read_csv("test.csv")

# preview the data
titanic_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Passenger ID, name, and ticket will be dropped. Fare could be other candidate to be dropped (correlation with class), and perhaps cabin too. I’ll initially assume that cabin doesn’t help predict Survive, and let’s first see what happens with fare and class.

train = titanic_df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1)
test    = test_df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1)
#We also add a 1/0 variable for Sex
df_sex=pd.get_dummies(train['Sex'],drop_first=True)
train=train.join(df_sex)

df_sex_2=pd.get_dummies(test['Sex'],drop_first=True)
test=test.join(df_sex_2)
#Also df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)


#Dummies for Pclass too

df_pclass=pd.get_dummies(train['Pclass'],prefix='Class').astype(int)
train=train.join(df_pclass)

df_pclass_2=pd.get_dummies(test['Pclass'],prefix='Class').astype(int)
test=test.join(df_pclass_2)
train.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	male	Class_1	Class_3
0	0	3	male	22.0	1	7.2500	S	1	0	1
1	1	1	female	38.0	1	71.2833	C	0	1	0
2	1	3	female	26.0	0	7.9250	S	0	0	1
3	1	1	female	35.0	1	53.1000	S	0	1	0
4	0	3	male	35.0	0	8.0500	S	1	0	1

#Deal with age

avg_age_train=train['Age'].mean()
std_age_train=train['Age'].std()
nans_age_train=train['Age'].isnull().sum()

avg_age_test=test['Age'].mean()
std_age_test=test['Age'].std()
nans_age_test=test['Age'].isnull().sum()

#Generate random ages
rand_1 = np.random.randint(avg_age_train-std_age_train,avg_age_train+std_age_train,size=nans_age_train)
rand_2 = np.random.randint(avg_age_test-std_age_test,avg_age_test+std_age_test,size=nans_age_test)

#Fill NaNs
#train["Age"][np.isnan(train["Age"])] = rand_1
#test["Age"][np.isnan(test["Age"])] = rand_2
#Median better than mean to avoid outliers
train['Age'].fillna(train['Age'].median(), inplace=True)
test['Age'].fillna(test['Age'].median(), inplace=True)
#Check
np.all(~np.isnan(train["Age"]))

True

corrmat=train[['Survived','Class_1','Class_2','Class_3','SibSp','Parch','Fare','male','Age']].corr()
print(corrmat)
sns.heatmap(corrmat,vmax=.8,annot=True)

          Survived   Class_1   Class_2   Class_3     SibSp     Parch  \
Survived  1.000000  0.285904  0.093349 -0.322308 -0.035322  0.081629   
Class_1   0.285904  1.000000 -0.288585 -0.626738 -0.054582 -0.017633   
Class_2   0.093349 -0.288585  1.000000 -0.565210 -0.055932 -0.000734   
Class_3  -0.322308 -0.626738 -0.565210  1.000000  0.092548  0.015790   
SibSp    -0.035322 -0.054582 -0.055932  0.092548  1.000000  0.414838   
Parch     0.081629 -0.017633 -0.000734  0.015790  0.414838  1.000000   
Fare      0.257307  0.591711 -0.118557 -0.413333  0.159651  0.216225   
male     -0.543351 -0.098013 -0.064746  0.137143 -0.114631 -0.245489   
Age      -0.064910  0.323896  0.015831 -0.291955 -0.233296 -0.172482   

              Fare      male       Age  
Survived  0.257307 -0.543351 -0.064910  
Class_1   0.591711 -0.098013  0.323896  
Class_2  -0.118557 -0.064746  0.015831  
Class_3  -0.413333  0.137143 -0.291955  
SibSp     0.159651 -0.114631 -0.233296  
Parch     0.216225 -0.245489 -0.172482  
Fare      1.000000 -0.182333  0.096688  
male     -0.182333  1.000000  0.081163  
Age       0.096688  0.081163  1.000000  





<matplotlib.axes._subplots.AxesSubplot at 0x7f8d623df748>

png

There doesn’t seem to be NaNs or weird things, no need for cleaning. But we do need to fill some values for age. Also, I should check whether embarked does something. Let’s go with that first.

sns.countplot(x="Embarked", data=train);
plt.figure();
sns.barplot(x='Embarked',y="Survived", data=train);

png

But is this spurious due to higher variability in C and Q because there are less of them? Maybe. Perhaps it’d be worthwhile to keep these around, so let’s dummy them, and run a regression.

It seems to make sense to impute the missing values to S, as it is the most common, and near the average

train['Embarked']=train['Embarked'].fillna('S')
test['Embarked']=test['Embarked'].fillna('S')

df_em=pd.get_dummies(train['Embarked'],prefix='Embarked')

Let’s study the fare-class relationship

# First, I need to extract the fares and  group them by Pclass, then plot.
#pclass_fare=train[['Fare','Pclass']]
#print(pclass_fare.groupby(['Pclass']).describe())
#print(pclass_fare.groupby(['Pclass']).mean())
#sns.boxplot(x="Pclass", y="Fare", data=pclass_fare);

As expected, the correlation seems to be there. But how strong is the correlation?

Hm, perhaps better to keep it. $R^2$ is not that high

For fun, how are fares by Pclass distributed?

sns.distplot(train[train['Class_1']==1]['Fare'])
plt.figure()
sns.distplot(train[train['Class_2']==1]['Fare'])
sns.distplot(train[train['Class_3']==1]['Fare'])

/home/artir/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

<matplotlib.axes._subplots.AxesSubplot at 0x7f8d415d74e0>

png

Back to Embarked and survived:

# result = sm.ols(formula="Survived ~Embarked_Q+Embarked_C+Embarked_S ", data=train).fit()
# print(result.params)
# print (result.summary())
#R2 too small, disregard.

So finally, the choosen features will be Pclass, Fare, male, and age. Time to do some predictions.

# define training and testing sets
#There happens to be one missing element in Fare. So let's fix that
test["Fare"].fillna(test["Fare"].median(), inplace=True)
X_train = train[['Pclass','Fare','male','Age']]
Y_train = train[["Survived"]]
X_test  = test[['Pclass','Fare','male','Age']]

# Logistic Regression
# machine learning
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

logreg.fit(X_train, Y_train.values.ravel())

Y_pred = logreg.predict(X_test)
logreg.score(X_train, Y_train)

0.7912457912457912

Let’s add some more features

train['FamilySize'] = train['SibSp'] + train['Parch']
test['FamilySize'] = test['SibSp'] + test['Parch']

# If greater than zero, it'll be one. If not, zero
train['HaveFamily'] = 0
test['HaveFamily'] = 0
test['FamilySize'] = test['SibSp'] + test['Parch']
train['FamilySize'] = train['SibSp'] + train['Parch']

test.loc[test['FamilySize']>0,'HaveFamily']=1
train.loc[train['FamilySize']>0,'HaveFamily']=1

#Childs
train['child'] = 0
test['child'] = 0
test.loc[test['Age']<15,'child']=1
train.loc[train['Age']<15,'child']=1
predictors=['Parch','SibSp','Fare','male','Age','FamilySize','HaveFamily','Class_1','Class_2','Class_3','child']
X_train = train[predictors]

Y_train = train[["Survived"]]
X_test  = test[predictors]

corrmat=train[predictors].corr()
print(corrmat)

a=sns.heatmap(corrmat,annot=True)

               Parch     SibSp      Fare      male       Age  FamilySize  \
Parch       1.000000  0.414838  0.216225 -0.245489 -0.172482    0.783111   
SibSp       0.414838  1.000000  0.159651 -0.114631 -0.233296    0.890712   
Fare        0.216225  0.159651  1.000000 -0.182333  0.096688    0.217138   
male       -0.245489 -0.114631 -0.182333  1.000000  0.081163   -0.200988   
Age        -0.172482 -0.233296  0.096688  0.081163  1.000000   -0.245619   
FamilySize  0.783111  0.890712  0.217138 -0.200988 -0.245619    1.000000   
HaveFamily  0.583398  0.584471  0.271832 -0.303646 -0.171647    0.690922   
Class_1    -0.017633 -0.054582  0.591711 -0.098013  0.323896   -0.046114   
Class_2    -0.000734 -0.055932 -0.118557 -0.064746  0.015831   -0.038594   
Class_3     0.015790  0.092548 -0.413333  0.137143 -0.291955    0.071142   
child       0.361001  0.364654 -0.003117 -0.095692 -0.560466    0.429578   

            HaveFamily   Class_1   Class_2   Class_3     child  
Parch         0.583398 -0.017633 -0.000734  0.015790  0.361001  
SibSp         0.584471 -0.054582 -0.055932  0.092548  0.364654  
Fare          0.271832  0.591711 -0.118557 -0.413333 -0.003117  
male         -0.303646 -0.098013 -0.064746  0.137143 -0.095692  
Age          -0.171647  0.323896  0.015831 -0.291955 -0.560466  
FamilySize    0.690922 -0.046114 -0.038594  0.071142  0.429578  
HaveFamily    1.000000  0.113364  0.039070 -0.129472  0.349033  
Class_1       0.113364  1.000000 -0.288585 -0.626738 -0.128886  
Class_2       0.039070 -0.288585  1.000000 -0.565210  0.028373  
Class_3      -0.129472 -0.626738 -0.565210  1.000000  0.087957  
child         0.349033 -0.128886  0.028373  0.087957  1.000000  

png

logreg = LogisticRegression()

logreg.fit(X_train, Y_train.values.ravel())

Y_pred = logreg.predict(X_test)
logreg.score(X_train, Y_train)

0.80920314253647585

Including the family bits and Class increases the score. (!) Guess it wasn’t that good of an idea to disregard those at the beginning. Let’s now try a bunch of other methods.

# Import the random forest package
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import learning_curve


# Create the random forest object which will include all the parameters
# for the fit
forest=RandomForestClassifier(max_features='sqrt',max_depth=8,n_estimators=240)

# Fit the training data to the Survived labels and create the decision trees
forest.fit(X_train,Y_train.values.ravel())

# Take the same decision trees and run it on the test data
Y_pred = forest.predict(X_test)
forest.score(X_train, Y_train)
# What features are important?
features = pd.DataFrame()
features['feature'] = X_train.columns
features['importance'] = forest.feature_importances_
features.sort(['importance'],ascending=False)

/home/artir/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:20: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)

	feature	importance
3	male	0.364375
2	Fare	0.201354
4	Age	0.154100
9	Class_3	0.075587
5	FamilySize	0.056493
7	Class_1	0.034916
1	SibSp	0.031640
10	child	0.028228
0	Parch	0.022727
8	Class_2	0.017080
6	HaveFamily	0.013501

from sklearn.model_selection import validation_curve
X, y = X_train,Y_train['Survived']
#X, y = train_new,Y_train['Survived']
def cross_val(model,X,y,pname):
    param_range = np.array([100,110,140,150])
    train_scores, test_scores = validation_curve(
        model, X, y, param_name=pname, param_range=param_range, scoring="accuracy", n_jobs=4)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.title("Validation Curve with Random Forest")
    plt.xlabel("Number of estimators")
    plt.ylabel("Score")
    lw = 2
    plt.plot(param_range, train_scores_mean, label="Training score",
                 color="darkorange", lw=lw)
    plt.fill_between(param_range, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.2,
                     color="darkorange", lw=lw)
    plt.plot(param_range, test_scores_mean, label="Cross-validation score",
                 color="navy", lw=lw)
    plt.fill_between(param_range, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.2,
                     color="navy", lw=lw)
    plt.legend(loc="best")
    plt.show()
cross_val(RandomForestClassifier(),X,y,'n_estimators')

png

from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



title = "Learning Curves (RandomForest)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
#cv = ShuffleSplit(n_splits=20, test_size=0.33, random_state=0)

#estimator=RandomForestClassifier(n_estimators=100)
#plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)

Hmm, underfitting :/

The prediction seems good enough for now, let’s submit’…

… Or maybe not. First public submission got 0.65. Time to try another method?

About 0.775 with RandomForests. One final trial, with XGBoost

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
# fit model no training data
model = XGBClassifier()
model_params = {
    'learning_rate': [0.05, 0.1],
    'n_estimators': [100,150,200],
    'max_depth': [2, 3,7, 10],
}
cv = StratifiedKFold()
cv.get_n_splits(X_train,Y_train)
grid= GridSearchCV(model,model_params,scoring='roc_auc',cv=cv,verbose=2)
best_params={'n_estimators': 100, 'max_depth': 2, 'learning_rate': 0.05}
#grid.fit(X_train, Y_train.values.ravel())
model = XGBClassifier(n_estimators= 100, max_depth= 2, learning_rate= 0.05)
model.fit(X_train,Y_train.values.ravel())

# Take the same decision trees and run it on the test data
Y_pred = model.predict(X_test)
#Y_pred = [round(value) for value in Y_pred]

#cross_val(XGBClassifier(),X,y,'n_estimators')
#About 200 seems okay

title = "Learning Curves (RandomForest)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=10, test_size=0.33, random_state=0)
mpl.rcParams['figure.figsize'] = [8.0, 6.0]

plot_learning_curve(model, title, X, y, ylim=(0.7, 1.01), cv=cv)

<module 'matplotlib.pyplot' from '/home/artir/anaconda3/lib/python3.5/site-packages/matplotlib/pyplot.py'>

png

import matplotlib as mpl
sns.set()
cross_val(model,X,y,'n_estimators')

png

test.head()

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	male	Class_2	Class_3	FamilySize	HaveFamily
0	3	male	34.5	0	0	7.8292	Q	1	0	1	0	0
1	3	female	47.0	1	0	7.0000	S	0	0	1	1	1
2	2	male	62.0	0	0	9.6875	Q	1	1	0	0	0
3	3	male	27.0	0	0	8.6625	S	1	0	1	0	0
4	3	female	22.0	1	1	12.2875	S	0	0	1	2	1

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('titanic.csv', index=False)

import sys
print(sys.version)

3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Lessons learned

Establishing a process for analysing the data is important, even if that process requires iteration
It really comes down to explore the data, generate features, then feed to the algorithm
Which algorithm? Apparently XGBoost is the best. Also, try neural nets with keras.
Correlation plots are interesting to heuristically decide what to keep/drop. In the end
Some plots are better than others, depending on the case

Twitter Facebook Google+ LinkedIn

José Luis Ricón

An analysis of the Titanic dataset

Lessons learned

MNIST convnet

A study of volatility timing for different assets

A study of volatility timing for different assets

Bitcoin volatility analysis

Analysis of Bitcoin Volatility

Gradient descent vs Newton Raphson

Gradient descent vs Newton Raphson