df <- read.csv("train.csv")
df <- drop_na(df, Age)
df$Survived <- factor(df$Survived)
df$Sex <- factor(df$Sex)
df$Pclass <- factor(df$Pclass)Titanic survival prediction
summary(df) PassengerId Survived Pclass Name Sex
Min. : 1.0 0:424 1:186 Length:714 female:261
1st Qu.:222.2 1:290 2:173 Class :character male :453
Median :445.0 3:355 Mode :character
Mean :448.6
3rd Qu.:677.8
Max. :891.0
Age SibSp Parch Ticket
Min. : 0.42 Min. :0.0000 Min. :0.0000 Length:714
1st Qu.:20.12 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
Median :28.00 Median :0.0000 Median :0.0000 Mode :character
Mean :29.70 Mean :0.5126 Mean :0.4314
3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :80.00 Max. :5.0000 Max. :6.0000
Fare Cabin Embarked
Min. : 0.00 Length:714 Length:714
1st Qu.: 8.05 Class :character Class :character
Median : 15.74 Mode :character Mode :character
Mean : 34.69
3rd Qu.: 33.38
Max. :512.33
Comparing survival rate of different classes
#plot(df$Survived, as.numeric(df$Pclass))There is a significantly larger number of Class 3 individuals who did not survive compared to those who survived.
plot(df$Survived, df$Sex)There is even larger difference between male and female passengers. Males were more probably to not survive.
plot(df$Survived, df$Age)For age, there is not a significant visible difference between those who survived, and those who do not. But we will run a test to compare two means.
We run a t test, to compare means.
survived_index <- which(df$Survived==1)
test <- t.test(df$Age[survived_index], df$Age[-survived_index])
test$p.value[1] 0.04118965
p value of ttest equals 0.0411897, so we reject null hypothesis that there is no statistical significant difference between age for survived, and not.
Mean age of not survived ones is aroud 30, which is smaller that survivals (28).
todo model
Call:
glm(formula = Survived ~ Sex + Pclass + Age, family = binomial,
data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.637472 0.469343 7.750 9.18e-15 ***
Sexmale -2.498501 0.249152 -10.028 < 2e-16 ***
Pclass2 -1.348374 0.330159 -4.084 4.43e-05 ***
Pclass3 -2.702099 0.341989 -7.901 2.76e-15 ***
Age -0.029290 0.008898 -3.292 0.000995 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.73 on 496 degrees of freedom
Residual deviance: 449.95 on 492 degrees of freedom
(128 obserwacji zostało skasowanych z uwagi na braki w nich zawarte)
AIC: 459.95
Number of Fisher Scoring iterations: 5
pred 0 1
0 144 41
1 20 61
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
5 5 0 3
6 6 0 3
8 8 0 3
Name Sex Age SibSp Parch
1 Braund, Mr. Owen Harris male 22 1 0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
3 Heikkinen, Miss. Laina female 26 0 0
5 Allen, Mr. William Henry male 35 0 0
6 Moran, Mr. James male NA 0 0
8 Palsson, Master. Gosta Leonard male 2 3 1
Ticket Fare Cabin Embarked
1 A/5 21171 7.2500 S
2 PC 17599 71.2833 C85 C
3 STON/O2. 3101282 7.9250 S
5 373450 8.0500 S
6 330877 8.4583 Q
8 349909 21.0750 S
Call:
glm(formula = Survived ~ Pclass + Sex + Age, family = "binomial",
data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.637472 0.469343 7.750 9.18e-15 ***
Pclass2 -1.348374 0.330159 -4.084 4.43e-05 ***
Pclass3 -2.702099 0.341989 -7.901 2.76e-15 ***
Sexmale -2.498501 0.249152 -10.028 < 2e-16 ***
Age -0.029290 0.008898 -3.292 0.000995 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.73 on 496 degrees of freedom
Residual deviance: 449.95 on 492 degrees of freedom
(128 obserwacji zostało skasowanych z uwagi na braki w nich zawarte)
AIC: 459.95
Number of Fisher Scoring iterations: 5