This is the "Investigate a Dataset" project, part of Introduction to Programming Nanodegree.
In this project, a given dataset (Titanic data from Kaggle.com) will be analyzed using the Python libraries NumPy, Pandas, and Matplotlib.
Before beginning it is already possible to create some questions:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from scipy import stats
import statsmodels.api as sm
%matplotlib inline
#Load data and check how data looks like
data = '/home/camila/Desktop/Titanic/titanic_data.csv'
df_titanic = pd.read_csv(data)
print(df_titanic.head())
#Check if any data is missing
print(df_titanic.isnull().sum())
It looks like a lot of information on age and cabin is missing. There are many ways to deal with missing data. One of the most straight forward methods is to simply replace the missing values with the mean. This method, of course, does not change the mean, but attenuates any correlations involving the variable(s) that are imputed.
<a href= 'https://en.wikipedia.org/wiki/Imputation_(statistics)'> Source</a>#Let's replace missing age values by the mean:
mean_age = df_titanic['Age'].mean()
df_titanic['Age'] = df_titanic['Age'].fillna(mean_age)
print(df_titanic.isnull().sum())
#Now, check if any data is duplicated
print(df_titanic.duplicated().any())
df_titanic.describe()
It is possible to come to some observations:
#How many exactly survived? Answer: 342 people.
print(df_titanic['Survived'].value_counts().reset_index())
#How many were traveling for free? And how many paid the most? Answer: 15 and 3 people, in this order.
fare = df_titanic['Fare'].value_counts().reset_index()
fare_sorted = fare.sort_values(by=['index'], ascending=True)
print(fare_sorted.head())
print(fare_sorted.tail())
#Let's check how the variable Fare is distributed:
data_fare = df_titanic['Fare']
fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot(data_fare)
ax.set_xlabel('Fare')
ax.set_ylabel('Fare price')
ax.set_title('Fare boxplot')
plt.show()
It is easy to see that there is an 'outlier'.
#Continuying, fare prices divided by classes:
first_class = df_titanic.loc[ df_titanic['Pclass'] == 1]
second_class = df_titanic.loc[ df_titanic['Pclass'] ==2]
third_class = df_titanic.loc[ df_titanic['Pclass'] == 3]
data_classes = (first_class['Fare'], second_class['Fare'], third_class['Fare'])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.boxplot(data_classes)
ax.set_xlabel('Classes')
ax.set_ylabel('Fare price')
ax.set_title('Fare boxplot divided by classes')
plt.show()
It looks like some people paid an unfairly expensive ticket price. However, it is not only a matter of "abusive pricing": outliers may mislead our data interpretation.
#So let's deal with the 'fare outlier'... This took me a while to realize, but turns out the 'outlier' is
#actually more than one cell. So, we have to find a way to determine these cells and eliminate/replace it
#accordingly.
max_fare = df_titanic['Fare'].max()
print(max_fare)
max_fare_people = df_titanic.loc[ df_titanic['Fare'] == max_fare]
print (max_fare_people)
#Delete outlier rows:
df_titanic_wtout_out = df_titanic.drop(df_titanic.index[[258, 679, 737]])
# Deleting rows causes a mess with the indexes, so it is good to reset them. Moreover,
#I created a new variable to store the new shorter dataframe, just in case we need the complete one again.
new_df_titanic = df_titanic_wtout_out.reset_index(drop=True)
#New description of 'Fares' :
new_df_titanic['Fare'].describe()
#Moving to another variable, what was the proportion of women/men? Answer:314/577
print (df_titanic['Sex'].value_counts())
#What about the classes? How is this population distributed?
#Answer: The vast majority was in the third class (491 passengers), followed by the first class (216) and
#then second class(184).
pclass = df_titanic['Pclass'].value_counts().reset_index()
print (pclass)
#Now let's take a look at family status. First, parents and children on board:
parents_children = df_titanic['Parch'].value_counts().reset_index()
siblings = df_titanic['SibSp'].value_counts().reset_index()
print (parents_children)
#Now, let's check the number of siblings:
print (siblings)
#It looks like most had no kids nor spouses travelling together with them.
#Let's visualize:
explode = (0.05, 0, 0, 0, 0, 0, 0)
fig,(ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
ax1.pie(siblings['SibSp'], explode=explode, labels=siblings.index)
ax1.set_title('Number of siblings on board')
ax2.pie(parents_children['Parch'], explode=explode, labels=parents_children.index)
ax2.set_title('Number of parents/children on board')
plt.show()
Finally, a comment on the ports because I had no idea what this was all about. Southampton, is a place on the south coast of England and it was the location of the Titanic’s departure. Cherbourg in France was the first of two stops to pick up passengers. Queenstown on the South Coast of Southern Ireland was the last passenger pick-up of the Titanic. Now here is an interesting piece of information: "at Cherbourg,twenty-four passengers who had booked passage only cross-channel from Southampton left aboard the tenders to be conveyed to shore" and "at Queenstown eight people left the ship and 123 joined for the onward journey". Given these statements, I am not sure about people who survived because they left the ship before she headed to the US.
<a href = 'https://en.wikipedia.org/wiki/RMS_Titanic'> Source</a>#About the ports
ports = df_titanic['Embarked'].value_counts().reset_index()
print (ports)
#First let's see if there is any correlation between the parametric variables, specially regarding the survival
#column.
correlation = (df_titanic.corr())
print(correlation)
#Correlation matrix may help us visualize
labels = ['PassId', 'Surv', 'Class', 'Age', 'SibSp', 'Parch','Fare']
data_corr = correlation
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(data_corr, interpolation='nearest', cmap=cm.coolwarm, vmin= -1, vmax= 1)
fig.colorbar(cax)
ax.set_xticklabels(['']+labels)
ax.set_yticklabels(['']+labels)
ax.set_title('Heatmap of correlation coefficients')
plt.show()
As the table and matrix show, there is a weak negative correlation between survival and class. This means that the higher the value of 'class' (i.e. 3 or 2) the lesser chance of survival had the passenger. And of course there is an obvious but yet beautiful negative correlation between fare and class.
For the nominal variables (for example: sex, embarked, cabin, etc), we have to resort to other tests. The Cabin variable looks not so promising because much information is missing. However, the variable 'sex' can be of great value, in order to test if there is some correlation, I will use Fisher's test, the null hypothesis being that there is no difference between men and women.
#Create function to generate 2x2 table, between 2 variables:
def gen_cross_tab( var1, var2):
cross_tab = pd.crosstab( var1, var2)
return cross_tab
sex_survive = gen_cross_tab( new_df_titanic.Sex, new_df_titanic.Survived)
print (sex_survive)
#Now let's do a simple percentage of survival for both genders, just for curiosity.
percent_sex_survive = pd.crosstab(new_df_titanic.Sex, new_df_titanic.Survived).apply(lambda r: r/r.sum(), axis=1)
print (percent_sex_survive)
#Bar plot of percentages:
percent_sex_survive.plot(kind='bar', stacked=True)
plt.title('Survival vs Sex')
plt.xlabel('Pass. Sex')
plt.ylabel('Percent')
plt.legend(['Not survived', 'Survived'])
plt.show()
#It is very clear the chances were much better for women. Nonetheless,
#let's apply the Fisher's exact test to make it even clearer:
oddsratio, pvalue = stats.fisher_exact([[232, 81], [107, 468]])
print (oddsratio, pvalue)
Conclusion: the null hypothesis is refuted (p < 0.05), women had a better chance of survival (OR = 12.3, this means a woman had 12 times the odds of survival of a men).
#Talking about null hypothesis, let's test if the survived distribution we observed for the variable 'class'
#is due to chance (null hypothesis).
#Create crosstab, between 'class' and 'survival':
class_survive = gen_cross_tab(new_df_titanic['Pclass'], new_df_titanic['Survived'])
print(class_survive)
#Bar plot of survival rate (%) divided by classes:
pct_class_survive = pd.crosstab(new_df_titanic.Pclass, new_df_titanic.Survived).apply(lambda r: r/r.sum(), axis=1)
pct_class_survive.plot(kind='bar', stacked=True)
plt.title('Survival vs Class')
plt.xlabel('Pass. Class')
plt.ylabel('Percent')
plt.legend(['Not survived', 'Survived'])
plt.show()
#Chi-square independence test (null hypothesis: the differences are due to chance):
from scipy.stats import chi2_contingency
class_surv_obs = ([[133, 80], [87, 97], [119,372]])
chi2, p, dof, expected = chi2_contingency(class_surv_obs)
print(p)
Again, the null hypothesis was refuted, and in the Titanic, as a lot of times in life, it was better to be rich in the first class.
#Now let's try the variable 'embarked':
#Create crosstab, between 'Port embarked' and 'survival'
df_embrkd_survive = gen_cross_tab(new_df_titanic['Embarked'], new_df_titanic['Survived'])
print(df_embrkd_survive)
#Chi-squared independence test:
from scipy.stats import chi2_contingency
df_embrkd_survive = ([[90, 75], [30, 47], [217,427]])
chi2, p, dof, expected = chi2_contingency(df_embrkd_survive)
print (p)
It looks like again the null hypothesis can be discarded. But, to be very honest, I was not expecting this result, maybe some more analysis is needed? I suspect there could be a confusion factor in play.
#Create table embarked vs class:
embrkd_class = gen_cross_tab(new_df_titanic['Embarked'], new_df_titanic['Pclass'])
print(embrkd_class)
#Plot of class division, by port of embarkment:
pct_embrkd_class = pd.crosstab(new_df_titanic.Embarked,new_df_titanic.Pclass).apply(lambda r: r/r.sum(), axis=1)
pct_embrkd_class.plot(kind='bar', stacked=True)
plt.title('Embarked vs Class')
plt.xlabel('Port of embarkment')
plt.ylabel('Number of people')
plt.legend(['1st Class', '2nd Class', '3rd Class'])
plt.show()
So, there is in fact a distribution difference, that could be responsible for the p we have previously found. Maybe it would be nice to stratificate or create a logistic regression model to check the possibility of this correlation be due to a confounding factor (classes in this case).
#First we have to create dummy variables for the classes, 'get_dummies' does the one hot encoding:
def gen_dummy_var(series):
dummy_var = pd.get_dummies(series)
return dummy_var
dummy_class = gen_dummy_var(new_df_titanic['Pclass'])
print(dummy_class.head())
#Then we add the dummy variables to our dataset. The first class will be left out to avoid the so called
#'dummy variable trap'. Source: https://en.wikipedia.org/wiki/Dummy_variable_(statistics)
# Also, fare and age variables were added to the analysis, just for curiosity.
columns = ['Survived', 'Fare', 'Age']
data = new_df_titanic[columns].join(dummy_class.ix[:,'2':])
print(data.head())
#Then repeat the process for the ports:
dummy_emb = gen_dummy_var(new_df_titanic['Embarked'])
print(dummy_emb.head())
#Due to the 'dummy variable trap', I did not create a function to join the dummy variables to the dataframe.
data = data.join(dummy_emb.ix[:,'Q':])
print(data.head())
#Then,once more, repeat the process for the genders:
dummy_sex = gen_dummy_var(new_df_titanic['Sex'])
print(dummy_sex.head())
data = data.join(dummy_sex.ix[:,'female'])
print(data.head())
#Now the logit part:
train_cols = data.columns[1:]
logit = sm.Logit(new_df_titanic['Survived'], data[train_cols])
result = logit.fit()
print (result.summary())
#It is easier to assess the OR:
print(np.exp(result.params))
As we have previously learned, the chances of survival were much higher for women. What else can we learn about them?
# Split the dataframe, selecting only women:
female_titanic = new_df_titanic.loc[new_df_titanic['Sex'] == 'female']
female_titanic.describe()
As we can see, the mean age is not far from the original mean (28 y.), the oldest was 96 y.o. and the youngest less than 1 y.o.
From our previous results we know that women and first class passengers had a better chance of survival. Let's see how the women in the different classes survived:
#Let's see how the women in the different classes survived:
fem_class_survive = gen_cross_tab(female_titanic.Pclass,female_titanic.Survived)
print(fem_class_survive)
#Clearly the survival rate was much better for the first class. In the bar plot this should be even
#easier to spot:
fem_class_survive.plot(kind='bar')
plt.title('Survival in the different classes')
plt.xlabel('Pass. Class')
plt.ylabel('Number of people')
plt.legend(['Not survived', 'Survived'])
plt.show()
Clearly the first and second classes were better off.
Maybe it is possible to estimate how many women were married, by filtering their titles in the name cells. I got this idea from Kunal Jain's excellent website
#Create function to extract the first names:
def name_extractor(name):
last, first = name.split(',')
first_splited = first.split('.')
return first_splited[0]
#Apply function:
salutations = new_df_titanic['Name'].apply(name_extractor)
salutations.value_counts()
There was a Countess in the Titanic! Quick Google search reveals she could be the Countess of Rothes. Now, adding all the 'Mrs', 'Ms', 'Mme', it would return aproximately the number of married women (127 in total). Considering it was 1912, I am not expecting any of the 'Dr' to be a woman, sadly.
We know:
#No luck searching for names
print(df_titanic[df_titanic.Name == 'Rose DeWitt Bukater'])
print(df_titanic[df_titanic.Name == 'Jack Dawson'])
#Maybe James Cameron was protecting their identities, let's try profiles:
seventeen = female_titanic[female_titanic.Age == 17.0]
first_class = seventeen[seventeen.Pclass == 1]
southhampton = first_class[first_class.Embarked == "S"]
print(southhampton)
#Interesting, it looks like our Rose is married. Let's try Jack:
men = df_titanic[df_titanic.Sex == "male"]
third_class = men[men.Pclass == 1]
print(len(third_class.index))
#... And we have 122 candidates for Jack!
Conclusion: no Rose nor Jack were found :(