Titanic survival prediction

df <- read.csv("train.csv")
df <- drop_na(df, Age)
df$Survived <- factor(df$Survived)
df$Sex <- factor(df$Sex)
df$Pclass <- factor(df$Pclass)
summary(df)
  PassengerId    Survived Pclass      Name               Sex     
 Min.   :  1.0   0:424    1:186   Length:714         female:261  
 1st Qu.:222.2   1:290    2:173   Class :character   male  :453  
 Median :445.0            3:355   Mode  :character               
 Mean   :448.6                                                   
 3rd Qu.:677.8                                                   
 Max.   :891.0                                                   
      Age            SibSp            Parch           Ticket         
 Min.   : 0.42   Min.   :0.0000   Min.   :0.0000   Length:714        
 1st Qu.:20.12   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
 Median :28.00   Median :0.0000   Median :0.0000   Mode  :character  
 Mean   :29.70   Mean   :0.5126   Mean   :0.4314                     
 3rd Qu.:38.00   3rd Qu.:1.0000   3rd Qu.:1.0000                     
 Max.   :80.00   Max.   :5.0000   Max.   :6.0000                     
      Fare           Cabin             Embarked        
 Min.   :  0.00   Length:714         Length:714        
 1st Qu.:  8.05   Class :character   Class :character  
 Median : 15.74   Mode  :character   Mode  :character  
 Mean   : 34.69                                        
 3rd Qu.: 33.38                                        
 Max.   :512.33                                        

Comparing survival rate of different classes

#plot(df$Survived, as.numeric(df$Pclass))

There is a significantly larger number of Class 3 individuals who did not survive compared to those who survived.

plot(df$Survived, df$Sex)

There is even larger difference between male and female passengers. Males were more probably to not survive.

plot(df$Survived, df$Age)

For age, there is not a significant visible difference between those who survived, and those who do not. But we will run a test to compare two means.

We run a t test, to compare means.

survived_index <- which(df$Survived==1)

test <- t.test(df$Age[survived_index], df$Age[-survived_index])
test$p.value
[1] 0.04118965

p value of ttest equals 0.0411897, so we reject null hypothesis that there is no statistical significant difference between age for survived, and not.

Mean age of not survived ones is aroud 30, which is smaller that survivals (28).

todo model


Call:
glm(formula = Survived ~ Sex + Pclass + Age, family = binomial, 
    data = train_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.637472   0.469343   7.750 9.18e-15 ***
Sexmale     -2.498501   0.249152 -10.028  < 2e-16 ***
Pclass2     -1.348374   0.330159  -4.084 4.43e-05 ***
Pclass3     -2.702099   0.341989  -7.901 2.76e-15 ***
Age         -0.029290   0.008898  -3.292 0.000995 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.73  on 496  degrees of freedom
Residual deviance: 449.95  on 492  degrees of freedom
  (128 obserwacji zostało skasowanych z uwagi na braki w nich zawarte)
AIC: 459.95

Number of Fisher Scoring iterations: 5
    
pred   0   1
   0 144  41
   1  20  61
  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
5           5        0      3
6           6        0      3
8           8        0      3
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
5                            Allen, Mr. William Henry   male  35     0     0
6                                    Moran, Mr. James   male  NA     0     0
8                      Palsson, Master. Gosta Leonard   male   2     3     1
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
5           373450  8.0500              S
6           330877  8.4583              Q
8           349909 21.0750              S

Call:
glm(formula = Survived ~ Pclass + Sex + Age, family = "binomial", 
    data = train_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.637472   0.469343   7.750 9.18e-15 ***
Pclass2     -1.348374   0.330159  -4.084 4.43e-05 ***
Pclass3     -2.702099   0.341989  -7.901 2.76e-15 ***
Sexmale     -2.498501   0.249152 -10.028  < 2e-16 ***
Age         -0.029290   0.008898  -3.292 0.000995 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.73  on 496  degrees of freedom
Residual deviance: 449.95  on 492  degrees of freedom
  (128 obserwacji zostało skasowanych z uwagi na braki w nich zawarte)
AIC: 459.95

Number of Fisher Scoring iterations: 5