An analysis of the Titanic dataset

12 minute read

Published:

An analysis of the Titanic dataset

# Imports

# pandas
import pandas as pd
from pandas import Series,DataFrame

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_style('whitegrid')
%matplotlib inline
# get titanic & test csv files as a DataFrame
titanic_df = pd.read_csv("train.csv")
test_df    = pd.read_csv("test.csv")

# preview the data
titanic_df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Passenger ID, name, and ticket will be dropped. Fare could be other candidate to be dropped (correlation with class), and perhaps cabin too. I’ll initially assume that cabin doesn’t help predict Survive, and let’s first see what happens with fare and class.

train = titanic_df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1)
test    = test_df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1)
#We also add a 1/0 variable for Sex
df_sex=pd.get_dummies(train['Sex'],drop_first=True)
train=train.join(df_sex)

df_sex_2=pd.get_dummies(test['Sex'],drop_first=True)
test=test.join(df_sex_2)
#Also df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)


#Dummies for Pclass too

df_pclass=pd.get_dummies(train['Pclass'],prefix='Class').astype(int)
train=train.join(df_pclass)

df_pclass_2=pd.get_dummies(test['Pclass'],prefix='Class').astype(int)
test=test.join(df_pclass_2)
train.head()
SurvivedPclassSexAgeSibSpParchFareEmbarkedmaleClass_1Class_2Class_3
003male22.0107.2500S1001
111female38.01071.2833C0100
213female26.0007.9250S0001
311female35.01053.1000S0100
403male35.0008.0500S1001
#Deal with age

avg_age_train=train['Age'].mean()
std_age_train=train['Age'].std()
nans_age_train=train['Age'].isnull().sum()

avg_age_test=test['Age'].mean()
std_age_test=test['Age'].std()
nans_age_test=test['Age'].isnull().sum()

#Generate random ages
rand_1 = np.random.randint(avg_age_train-std_age_train,avg_age_train+std_age_train,size=nans_age_train)
rand_2 = np.random.randint(avg_age_test-std_age_test,avg_age_test+std_age_test,size=nans_age_test)

#Fill NaNs
#train["Age"][np.isnan(train["Age"])] = rand_1
#test["Age"][np.isnan(test["Age"])] = rand_2
#Median better than mean to avoid outliers
train['Age'].fillna(train['Age'].median(), inplace=True)
test['Age'].fillna(test['Age'].median(), inplace=True)
#Check
np.all(~np.isnan(train["Age"]))
True
corrmat=train[['Survived','Class_1','Class_2','Class_3','SibSp','Parch','Fare','male','Age']].corr()
print(corrmat)
sns.heatmap(corrmat,vmax=.8,annot=True)
          Survived   Class_1   Class_2   Class_3     SibSp     Parch  \
Survived  1.000000  0.285904  0.093349 -0.322308 -0.035322  0.081629   
Class_1   0.285904  1.000000 -0.288585 -0.626738 -0.054582 -0.017633   
Class_2   0.093349 -0.288585  1.000000 -0.565210 -0.055932 -0.000734   
Class_3  -0.322308 -0.626738 -0.565210  1.000000  0.092548  0.015790   
SibSp    -0.035322 -0.054582 -0.055932  0.092548  1.000000  0.414838   
Parch     0.081629 -0.017633 -0.000734  0.015790  0.414838  1.000000   
Fare      0.257307  0.591711 -0.118557 -0.413333  0.159651  0.216225   
male     -0.543351 -0.098013 -0.064746  0.137143 -0.114631 -0.245489   
Age      -0.064910  0.323896  0.015831 -0.291955 -0.233296 -0.172482   

              Fare      male       Age  
Survived  0.257307 -0.543351 -0.064910  
Class_1   0.591711 -0.098013  0.323896  
Class_2  -0.118557 -0.064746  0.015831  
Class_3  -0.413333  0.137143 -0.291955  
SibSp     0.159651 -0.114631 -0.233296  
Parch     0.216225 -0.245489 -0.172482  
Fare      1.000000 -0.182333  0.096688  
male     -0.182333  1.000000  0.081163  
Age       0.096688  0.081163  1.000000  





<matplotlib.axes._subplots.AxesSubplot at 0x7f8d623df748>

png

There doesn’t seem to be NaNs or weird things, no need for cleaning. But we do need to fill some values for age. Also, I should check whether embarked does something. Let’s go with that first.

sns.countplot(x="Embarked", data=train);
plt.figure();
sns.barplot(x='Embarked',y="Survived", data=train);

png

png

But is this spurious due to higher variability in C and Q because there are less of them? Maybe. Perhaps it’d be worthwhile to keep these around, so let’s dummy them, and run a regression.

It seems to make sense to impute the missing values to S, as it is the most common, and near the average

train['Embarked']=train['Embarked'].fillna('S')
test['Embarked']=test['Embarked'].fillna('S')

df_em=pd.get_dummies(train['Embarked'],prefix='Embarked')

Let’s study the fare-class relationship

# First, I need to extract the fares and  group them by Pclass, then plot.
#pclass_fare=train[['Fare','Pclass']]
#print(pclass_fare.groupby(['Pclass']).describe())
#print(pclass_fare.groupby(['Pclass']).mean())
#sns.boxplot(x="Pclass", y="Fare", data=pclass_fare);

As expected, the correlation seems to be there. But how strong is the correlation?

Hm, perhaps better to keep it. $R^2$ is not that high

For fun, how are fares by Pclass distributed?

sns.distplot(train[train['Class_1']==1]['Fare'])
plt.figure()
sns.distplot(train[train['Class_2']==1]['Fare'])
sns.distplot(train[train['Class_3']==1]['Fare'])
/home/artir/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j





<matplotlib.axes._subplots.AxesSubplot at 0x7f8d415d74e0>

png

png

Back to Embarked and survived:

# result = sm.ols(formula="Survived ~Embarked_Q+Embarked_C+Embarked_S ", data=train).fit()
# print(result.params)
# print (result.summary())
#R2 too small, disregard.

So finally, the choosen features will be Pclass, Fare, male, and age. Time to do some predictions.

# define training and testing sets
#There happens to be one missing element in Fare. So let's fix that
test["Fare"].fillna(test["Fare"].median(), inplace=True)
X_train = train[['Pclass','Fare','male','Age']]
Y_train = train[["Survived"]]
X_test  = test[['Pclass','Fare','male','Age']]
# Logistic Regression
# machine learning
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

logreg.fit(X_train, Y_train.values.ravel())

Y_pred = logreg.predict(X_test)
logreg.score(X_train, Y_train)
0.7912457912457912

Let’s add some more features

train['FamilySize'] = train['SibSp'] + train['Parch']
test['FamilySize'] = test['SibSp'] + test['Parch']

# If greater than zero, it'll be one. If not, zero
train['HaveFamily'] = 0
test['HaveFamily'] = 0
test['FamilySize'] = test['SibSp'] + test['Parch']
train['FamilySize'] = train['SibSp'] + train['Parch']

test.loc[test['FamilySize']>0,'HaveFamily']=1
train.loc[train['FamilySize']>0,'HaveFamily']=1

#Childs
train['child'] = 0
test['child'] = 0
test.loc[test['Age']<15,'child']=1
train.loc[train['Age']<15,'child']=1
predictors=['Parch','SibSp','Fare','male','Age','FamilySize','HaveFamily','Class_1','Class_2','Class_3','child']
X_train = train[predictors]

Y_train = train[["Survived"]]
X_test  = test[predictors]
corrmat=train[predictors].corr()
print(corrmat)

a=sns.heatmap(corrmat,annot=True)
               Parch     SibSp      Fare      male       Age  FamilySize  \
Parch       1.000000  0.414838  0.216225 -0.245489 -0.172482    0.783111   
SibSp       0.414838  1.000000  0.159651 -0.114631 -0.233296    0.890712   
Fare        0.216225  0.159651  1.000000 -0.182333  0.096688    0.217138   
male       -0.245489 -0.114631 -0.182333  1.000000  0.081163   -0.200988   
Age        -0.172482 -0.233296  0.096688  0.081163  1.000000   -0.245619   
FamilySize  0.783111  0.890712  0.217138 -0.200988 -0.245619    1.000000   
HaveFamily  0.583398  0.584471  0.271832 -0.303646 -0.171647    0.690922   
Class_1    -0.017633 -0.054582  0.591711 -0.098013  0.323896   -0.046114   
Class_2    -0.000734 -0.055932 -0.118557 -0.064746  0.015831   -0.038594   
Class_3     0.015790  0.092548 -0.413333  0.137143 -0.291955    0.071142   
child       0.361001  0.364654 -0.003117 -0.095692 -0.560466    0.429578   

            HaveFamily   Class_1   Class_2   Class_3     child  
Parch         0.583398 -0.017633 -0.000734  0.015790  0.361001  
SibSp         0.584471 -0.054582 -0.055932  0.092548  0.364654  
Fare          0.271832  0.591711 -0.118557 -0.413333 -0.003117  
male         -0.303646 -0.098013 -0.064746  0.137143 -0.095692  
Age          -0.171647  0.323896  0.015831 -0.291955 -0.560466  
FamilySize    0.690922 -0.046114 -0.038594  0.071142  0.429578  
HaveFamily    1.000000  0.113364  0.039070 -0.129472  0.349033  
Class_1       0.113364  1.000000 -0.288585 -0.626738 -0.128886  
Class_2       0.039070 -0.288585  1.000000 -0.565210  0.028373  
Class_3      -0.129472 -0.626738 -0.565210  1.000000  0.087957  
child         0.349033 -0.128886  0.028373  0.087957  1.000000  

png

logreg = LogisticRegression()

logreg.fit(X_train, Y_train.values.ravel())

Y_pred = logreg.predict(X_test)
logreg.score(X_train, Y_train)
0.80920314253647585

Including the family bits and Class increases the score. (!) Guess it wasn’t that good of an idea to disregard those at the beginning. Let’s now try a bunch of other methods.

# Import the random forest package
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import learning_curve


# Create the random forest object which will include all the parameters
# for the fit
forest=RandomForestClassifier(max_features='sqrt',max_depth=8,n_estimators=240)

# Fit the training data to the Survived labels and create the decision trees
forest.fit(X_train,Y_train.values.ravel())

# Take the same decision trees and run it on the test data
Y_pred = forest.predict(X_test)
forest.score(X_train, Y_train)
# What features are important?
features = pd.DataFrame()
features['feature'] = X_train.columns
features['importance'] = forest.feature_importances_
features.sort(['importance'],ascending=False)
/home/artir/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:20: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
featureimportance
3male0.364375
2Fare0.201354
4Age0.154100
9Class_30.075587
5FamilySize0.056493
7Class_10.034916
1SibSp0.031640
10child0.028228
0Parch0.022727
8Class_20.017080
6HaveFamily0.013501
from sklearn.model_selection import validation_curve
X, y = X_train,Y_train['Survived']
#X, y = train_new,Y_train['Survived']
def cross_val(model,X,y,pname):
    param_range = np.array([100,110,140,150])
    train_scores, test_scores = validation_curve(
        model, X, y, param_name=pname, param_range=param_range, scoring="accuracy", n_jobs=4)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.title("Validation Curve with Random Forest")
    plt.xlabel("Number of estimators")
    plt.ylabel("Score")
    lw = 2
    plt.plot(param_range, train_scores_mean, label="Training score",
                 color="darkorange", lw=lw)
    plt.fill_between(param_range, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.2,
                     color="darkorange", lw=lw)
    plt.plot(param_range, test_scores_mean, label="Cross-validation score",
                 color="navy", lw=lw)
    plt.fill_between(param_range, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.2,
                     color="navy", lw=lw)
    plt.legend(loc="best")
    plt.show()
cross_val(RandomForestClassifier(),X,y,'n_estimators')

png

from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



title = "Learning Curves (RandomForest)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
#cv = ShuffleSplit(n_splits=20, test_size=0.33, random_state=0)

#estimator=RandomForestClassifier(n_estimators=100)
#plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)

Hmm, underfitting :/

The prediction seems good enough for now, let’s submit’…

… Or maybe not. First public submission got 0.65. Time to try another method?

About 0.775 with RandomForests. One final trial, with XGBoost

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
# fit model no training data
model = XGBClassifier()
model_params = {
    'learning_rate': [0.05, 0.1],
    'n_estimators': [100,150,200],
    'max_depth': [2, 3,7, 10],
}
cv = StratifiedKFold()
cv.get_n_splits(X_train,Y_train)
grid= GridSearchCV(model,model_params,scoring='roc_auc',cv=cv,verbose=2)
best_params={'n_estimators': 100, 'max_depth': 2, 'learning_rate': 0.05}
#grid.fit(X_train, Y_train.values.ravel())
model = XGBClassifier(n_estimators= 100, max_depth= 2, learning_rate= 0.05)
model.fit(X_train,Y_train.values.ravel())

# Take the same decision trees and run it on the test data
Y_pred = model.predict(X_test)
#Y_pred = [round(value) for value in Y_pred]
#cross_val(XGBClassifier(),X,y,'n_estimators')
#About 200 seems okay
title = "Learning Curves (RandomForest)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=10, test_size=0.33, random_state=0)
mpl.rcParams['figure.figsize'] = [8.0, 6.0]

plot_learning_curve(model, title, X, y, ylim=(0.7, 1.01), cv=cv)
<module 'matplotlib.pyplot' from '/home/artir/anaconda3/lib/python3.5/site-packages/matplotlib/pyplot.py'>

png

import matplotlib as mpl
sns.set()
cross_val(model,X,y,'n_estimators')

png

test.head()
PclassSexAgeSibSpParchFareEmbarkedmaleClass_1Class_2Class_3FamilySizeHaveFamilychild
03male34.5007.8292Q1001000
13female47.0107.0000S0001110
22male62.0009.6875Q1010000
33male27.0008.6625S1001000
43female22.01112.2875S0001210
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('titanic.csv', index=False)
import sys
print(sys.version)
3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Lessons learned

  1. Establishing a process for analysing the data is important, even if that process requires iteration
  2. It really comes down to explore the data, generate features, then feed to the algorithm
  3. Which algorithm? Apparently XGBoost is the best. Also, try neural nets with keras.
  4. Correlation plots are interesting to heuristically decide what to keep/drop. In the end
  5. Some plots are better than others, depending on the case