机器学习算法完整版见fenghaootong-github

Titanic

预测哪些乘客会幸存下来

数据集

数据特征:

  • Survived:是否存活(0代表否,1代表是)
  • Pclass:社会阶级(1代表上层阶级,2代表中层阶级,3代表底层阶级)
  • Name:船上乘客的名字
  • Sex:船上乘客的性别
  • Age:船上乘客的年龄(可能存在 NaN)
  • SibSp:乘客在船上的兄弟姐妹和配偶的数量
  • Parch:乘客在船上的父母以及小孩的数量
  • Ticket:乘客船票的编号
  • Fare:乘客为船票支付的费用
  • Cabin:乘客所在船舱的编号(可能存在 NaN)
  • Embarked:乘客上船的港口(C 代表从 Cherbourg 登船,Q 代表从 Queenstown 登船,S 代表从 Southampton 登船)

导入数据

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings(\'ignore\')
df = pd.read_csv(\'../DATA/Titanic.csv\', header=0)

数据整理

  • 只取出三个自变量
  • 将Age(年龄)缺失的数据补全
  • 将Pclass变量转变为三个 Summy 变量
  • 将sex转为0-1变量
subdf = df[[\'Pclass\',\'Sex\',\'Age\']]
y = df.Survived
# sklearn中的Imputer也可以
age = subdf[\'Age\'].fillna(value=subdf.Age.mean())
# sklearn OneHotEncoder也可以
pclass = pd.get_dummies(subdf[\'Pclass\'],prefix=\'Pclass\')
sex = (subdf[\'Sex\']==\'male\').astype(\'int\')
X = pd.concat([pclass,age,sex],axis=1)
X.head()
Pclass_1 Pclass_2 Pclass_3 Age Sex
0 0 0 1 22.0 1
1 1 0 0 38.0 0
2 0 0 1 26.0 0
3 1 0 0 35.0 0
4 0 0 1 35.0 1

建立模型

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion=\'entropy\', max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)
print("准确率为:{:.2f}".format(clf.score(X_test,y_test)))
准确率为:0.83
#查看那个特征最重要
clf.feature_importances_
array([ 0.08398076,  0.        ,  0.23320717,  0.10534824,  0.57746383])

交叉验证

from sklearn import cross_validation
scores1 = cross_validation.cross_val_score(clf, X, y, cv=10)
scores1
array([ 0.82222222,  0.82222222,  0.7752809 ,  0.87640449,  0.82022472,
        0.76404494,  0.7752809 ,  0.76404494,  0.83146067,  0.78409091])
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True, 
                        show_classification_report=True, 
                        show_confusion_matrix=True):
    y_pred=clf.predict(X)   
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n")

    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")

    if show_confusion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

measure_performance(X_test,y_test,clf, show_classification_report=True, show_confusion_matrix=True)
Accuracy:0.834 

Classification report
             precision    recall  f1-score   support

          0       0.85      0.88      0.86       134
          1       0.81      0.76      0.79        89

avg / total       0.83      0.83      0.83       223


Confusion matrix
[[118  16]
 [ 21  68]] 

与随机森林的比较

from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier(n_estimators=1000,random_state=33)
clf2 = clf2.fit(X_train,y_train)
scores2 = cross_validation.cross_val_score(clf2,X, y, cv=10)
clf2.feature_importances_
scores2.mean(), scores1.mean()
(0.81262938372488946, 0.80352769265690616)
版权声明:本文为htfeng原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://www.cnblogs.com/htfeng/p/9931754.html