泰坦尼克号沉没是历史上最著名的沉船事故之一。1912年4月15日,在她的处女航中,泰坦尼克号在与冰山相撞后沉没,在2224名乘客和机组人员中造成1502人死亡。这场耸人听闻的悲剧震惊了国际社会,并为船舶制定了更好的安全规定。造成海难失事的原因之一是乘客和机组人员没有足够的救生艇。尽管幸存下沉有一些运气因素,但有些人比其他人更容易生存,例如妇女,儿童和上流社会。在这个案例中我们将运用机器学习来预测哪些乘客可以幸免于悲剧。

数据集链接:https://pan.baidu.com/s/1bVnIM5JVZjib1znZIDn10g 。提取码:1htm 。

读取titanic_train数据集

  1. import pandas
  2.  
  3. # 读取数据集
  4. titanic = pandas.read_csv(\'titanic_train.csv\')
  5. titanic.head(10)

查看数据集前10行

特征名词解释

特征名称 特征解释
PassengerId    乘客id,对结果没有影响
Survived 1表示存活,0表示未存活
Pclass 船舱等级,越有钱船舱等级越高,所以对结果有影响
Name 乘客名字,先暂时认为对结果没有影响
Sex 性别,毫无疑问,女生优先,所以肯定对结果有影响
Age 年龄,不用说这列也有影响
SibSp 兄弟姐妹,对结果也有影响
Parch 父母和小孩,不用说也有影响
Ticket 票的编号,貌似没啥影响
Fare  船票价格,和船舱等级一样,不能忽略
Cabin 船舱号,应该也没啥影响
Embarked 登船地点,不同地点登船可能身份不一样

可以看到Age列有缺失值(NaN)。一般来说,数据发生缺失的话有两种处理方法,一种填充缺失值,一种直接舍弃这个特征。这里一般来说Age对结果是有较大影响的,我们可以对缺失值进行填充,这里可以填充平均值 。 

  1. # Age 缺失值填充
  2. titanic[\'Age\'] = titanic[\'Age\'].fillna(titanic[\'Age\'].median())
  3. print(titanic.describe())

填充后查看数据集的描述

  

机器学习算法一般来说解决不了对字符的分类。因为我们是要对Survived这列‘’0‘’和”1″进行分类嘛。所以我们就要把”Sex”这一列的数据进行处理,把它映射为数值型。那我们就把”male”和“female”进行处理,分别用0和1替代。

  1. # print(titanic[\'Sex\'].unique())
  2.  
  3. # Replace all the occurences of male with the number 0.
  4. # 用数字0替换所有出现的男性。
  5. titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
  6. titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

同时,我们也把”Embarked”这一列数据进行同样的处理

  1. # print(titanic[\'Embarked\'].unique())
  2.  
  3. # Embarked:上船港口,有三个取值,C/S/Q,是文字形式,不利于分析,
  4. # 故可能需要映射到数值的值,而且有2个缺失值
  5. titanic[\'Embarked\'] = titanic[\'Embarked\'].fillna(\'S\') # 缺失值填充为这一列的众数\'S\'
  6. titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
  7. titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
  8. titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

首先使用线性回归算法来进行分类

  1. # Import the linear regression(回归) class
  2. # 注意不要导错库
  3. from sklearn.linear_model import LinearRegression
  4. from sklearn.model_selection import KFold
  5.  
  6. # The columns we\'ll use to predict the target
  7. predictors = [\'Pclass\', \'Sex\', \'Age\', \'SibSp\', \'Parch\', \'Fare\', \'Embarked\']
  8.  
  9. # Initialize our algorithm class
  10. alg = LinearRegression()
  11.  
  12. # Generate cross validation folds for the titanic dataset. It return the row indices
  13. # corresponding(相应的) to train and test.
  14. # 为Titanic数据集生成交叉验证折叠。它返回与训练和测试相对应的行索引。
  15. # We set random_state to ensure we get the same splits every time we run this.
  16. # kf = KFold(titanic.shape[0], n_folds=3, random_state=1) 写法错误已被弃用
  17.  
  18. # 样本平均分成3份,3折交叉验证
  19. kf = KFold(n_splits=3,shuffle=False, random_state=1)
  20.  
  21. # 注意这里不是kf.split(titanic.shape[0]),会报如下错误:
  22. # Singleton array array(891) cannot be considered a valid collection.
  23.  
  24. predictions = []
    # 交叉验证 划分训练集 验证集
  25. for train, test in kf.split(titanic):
  26. # The predictors we\'re using the train the algorithm. Note how we only take
  27. # the rows in the train folds
  28. # 注意我们只得到训练集的rows
  29. train_predictors = titanic[predictors].iloc[train, :]
  30. # The target we\'re using to train the algorithm.
  31. train_target = titanic[\'Survived\'].iloc[train]
  32. # Training the algorithm using the predictors and target
  33. alg.fit(train_predictors, train_target)
  34. # We can now make predictions on the test fold
  35. test_predictions = alg.predict(titanic[predictors].iloc[test, :])
  36. predictions.append(test_predictions)

查看线性回归准确率

  1. import numpy as np
  2.  
  3. # The predictions are in three separate numpy arrays. Concatenate them into one.
  4. # We concatenate them on asix 0, as they only have one axis.
  5. predictions = np.concatenate(predictions,axis=0)
  6.  
  7. # Map predictions to outcomes (only possible outcomes are 1 and 0)
  8. predictions[predictions > .5] = 1 # 映射成分类结果 计算准确率
  9. predictions[predictions <= .5] = 0
  10.  
  11. # 注意这一行与源代码有出入
  12. accuracy = sum(predictions==titanic[\'Survived\'])/len(predictions)
  13.  
  14. # 验证集的准确率
  15. print(accuracy)

得到的准确率为

  1. 0.7833894500561167

对于一个二分类问题来说,这个准确率似乎不太行,接下来用逻辑回归算法试下

  1. from sklearn.model_selection import cross_val_score
  2. from sklearn.linear_model import LogisticRegression
  3.  
  4. alg = LogisticRegression(random_state=1)
  5. # Compute the accuracy score for all the cross validation folds,
  6. # (much simper than what we did before!)
  7. scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
  8. # Take the mean of the scores (because we have one for each fold)
  9. print(scores.mean())

得到的准确率为,可以发现效果要好了一点。

  1. 0.8047138047138048

上面得到的结果都是对交叉验证后的验证集来进行分类,在实际结果中,应该使用测试数据集来进行分类。

读取测试数据集并填充数据集,然后进行数值映射,与上面类似。

  1. titanic_test = pandas.read_csv("test.csv")
  2. titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
  3. titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
  4. titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
  5. titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
  6. titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
  7.  
  8. titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
  9. titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
  10. titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

通过上面发现,似乎线性回归,逻辑回归这类算法似乎不太行,那这次再用随机森林算法来试下(一般来说随机森林算法比线性回归和逻辑回归算法的效果好一点),注意随机森林参数的变化。

  1. from sklearn.model_selection import cross_val_score
  2. from sklearn.model_selection import KFold
  3. from sklearn.ensemble import RandomForestClassifier
  4.  
  5. predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
  6.  
  7. # Initialize our algorithm with the default paramters
  8. # n_estimators is the number of trees we want to make
  9. # min_samples_split is the minimum number of rows we need to make a split
  10. # min_samples_leaf is the minimum number of samples we can have at the place where a
  11. # tree branch(分支) ends (the bottom points of the tree)
  12. alg = RandomForestClassifier(random_state=1,
  13. n_estimators=10,
  14. min_samples_split=2,
  15. min_samples_leaf=1)
  16. # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
  17. kf = KFold(n_splits=3, shuffle=False, random_state=1)
  18. scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
  19.  
  20. # Take the mean of the scores (because we have one for each fold)
  21. print(scores.mean())

准确率为:

  1. 0.7856341189674523

发现准确率还是不太行。在机器学习中,调整参数也是非常重要的,一般通过参数的调整来对模型进行优化。这次调整随机森林的参数。

  1. alg = RandomForestClassifier(random_state=1,
  2. n_estimators=100,
  3. min_samples_split=4,
  4. min_samples_leaf=2)
  5. # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
  6. kf = KFold(n_splits=3, shuffle=False, random_state=1)
  7. scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)
  8.  
  9. # Take the mean of the scores (because we have one for each fold)
  10. print(scores.mean())

得到的准确率为:

  1. 0.8148148148148148

未完待续。。。

版权声明:本文为xiaoyh原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/xiaoyh/p/11321780.html