决策树算法实例

机器学习算法完整版见fenghaootong-github

Titanic

预测哪些乘客会幸存下来

数据集

数据特征：

Survived：是否存活（0代表否，1代表是）
Pclass：社会阶级（1代表上层阶级，2代表中层阶级，3代表底层阶级）
Name：船上乘客的名字
Sex：船上乘客的性别
Age:船上乘客的年龄（可能存在 NaN）
SibSp：乘客在船上的兄弟姐妹和配偶的数量
Parch：乘客在船上的父母以及小孩的数量
Ticket：乘客船票的编号
Fare：乘客为船票支付的费用
Cabin：乘客所在船舱的编号（可能存在 NaN）
Embarked：乘客上船的港口（C 代表从 Cherbourg 登船，Q 代表从 Queenstown 登船，S 代表从 Southampton 登船）

导入数据

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings(\'ignore\')
df = pd.read_csv(\'../DATA/Titanic.csv\', header=0)

数据整理

只取出三个自变量
将Age（年龄）缺失的数据补全
将Pclass变量转变为三个 Summy 变量
将sex转为0-1变量

subdf = df[[\'Pclass\',\'Sex\',\'Age\']]
y = df.Survived

# sklearn中的Imputer也可以
age = subdf[\'Age\'].fillna(value=subdf.Age.mean())
# sklearn OneHotEncoder也可以
pclass = pd.get_dummies(subdf[\'Pclass\'],prefix=\'Pclass\')
sex = (subdf[\'Sex\']==\'male\').astype(\'int\')
X = pd.concat([pclass,age,sex],axis=1)
X.head()

	Pclass_1	Pclass_3	Age	Sex
0	0	1	22.0	1
1	1	0	38.0	0
2	0	1	26.0	0
3	1	0	35.0	0
4	0	1	35.0	1

建立模型

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion=\'entropy\', max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)
print("准确率为：{:.2f}".format(clf.score(X_test,y_test)))

准确率为：0.83

#查看那个特征最重要
clf.feature_importances_

array([ 0.08398076,  0.        ,  0.23320717,  0.10534824,  0.57746383])

交叉验证

from sklearn import cross_validation
scores1 = cross_validation.cross_val_score(clf, X, y, cv=10)
scores1

array([ 0.82222222,  0.82222222,  0.7752809 ,  0.87640449,  0.82022472,
        0.76404494,  0.7752809 ,  0.76404494,  0.83146067,  0.78409091])

from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True, 
                        show_classification_report=True, 
                        show_confusion_matrix=True):
    y_pred=clf.predict(X)   
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n")

    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")

    if show_confusion_matrix:
        print("Confusion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")

measure_performance(X_test,y_test,clf, show_classification_report=True, show_confusion_matrix=True)

Accuracy:0.834 

Classification report
             precision    recall  f1-score   support

          0       0.85      0.88      0.86       134
          1       0.81      0.76      0.79        89

avg / total       0.83      0.83      0.83       223


Confusion matrix
[[118  16]
 [ 21  68]]

与随机森林的比较

from sklearn.ensemble import RandomForestClassifier
clf2 = RandomForestClassifier(n_estimators=1000,random_state=33)
clf2 = clf2.fit(X_train,y_train)
scores2 = cross_validation.cross_val_score(clf2,X, y, cv=10)
clf2.feature_importances_
scores2.mean(), scores1.mean()

(0.81262938372488946, 0.80352769265690616)

本文链接：https://www.cnblogs.com/htfeng/p/9931754.html

决策树算法实例

Titanic

数据集

与随机森林的比较

决策树算法实例的更多相关文章

随机推荐

热门专题

目录导航