数据挖掘实践（26）：算法基础（四）决策树算法

1, 信息熵与条件熵（前置知识）

1.1 熵与决策树之间的关系

　　决策树的介绍：决策树学习采用的是自顶向下的递归方法，其基本思想是以信息嫡为度量构造一棵熵值下降最快的树，到叶子节点处的熵值为零，此时每个叶子节点的实例都属于同一类

　　决策树学习的生成算法建立决策树的关键．即在当前状态下选择哪个属性作为分类依据。根据不同的目标函数．建立决策树主要有一下三种算法。

　　ID3:Iterative Dichotomiser

　　C4.5

　　CART:Classification And Regression Tree

说白了就是物体内部的混乱程度，比如杂货市场里面什么都有，那肯定混乱呀，专卖店里只卖一个牌子的那就稳定多了

1.2 ID3/信息增益

1.3 代码实验

import pandas as pd
from sklearn import preprocessing #预处理
from sklearn import tree #树的类型
from sklearn.datasets import load_iris #鸢尾花数据集

iris = load_iris() #加载进来鸢尾花数据集数据集

iris_feature_name = iris.feature_names # 花萼长、花萼宽、花瓣长、花瓣宽
iris_features = iris.data # 150行 * 5列
iris_target_name = iris.target_names  # setosa、versicolor、virginica
iris_target = iris.target # setosa——>0 versicolor——>1 virginica——>2

iris_feature_name # 花萼长、花萼宽、花瓣长、花瓣宽

[\'sepal length (cm)\',
 \'sepal width (cm)\',
 \'petal length (cm)\',
 \'petal width (cm)\']

iris_features[:5,:] # 连续值

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

iris_target_name # 目标值label

array([\'setosa\', \'versicolor\', \'virginica\'], dtype=\'<U10\')

iris_features.shape

(150, 4)

#构建模型
clf = tree.DecisionTreeClassifier(max_depth=4)
clf = clf.fit(iris_features, iris_target)

iris_target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

clf

DecisionTreeClassifier(class_weight=None, criterion=\'gini\', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter=\'best\')

import pydotplus #生成可视化文件，anaconda3已经封装好，目的是调入Graphviz工具
from IPython.display import Image, display

同学们可以把它当做画决策树文件的模型

dot_data = tree.export_graphviz(clf, # 决策树模型
                                out_file = None, # 输出文件格式
                                feature_names = iris_feature_name, # 特征名称
                                class_names = iris_target_name, # 标签名称
                                filled=True,  # 由颜色标识不纯度
                                rounded=True  # 树节点为圆角矩形
                               )
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(graph.create_png()))

2 CART和基尼系数

2.1 CART/分类树

2.2 CART/回归树

分裂点（参数对连续只属性的处理）

3 决策树代码实战

3.1 员工收入分类

#用于数据处理和分析的工具包
import pandas as pd
#引入用于数据预处理/特征工程的工具包
from sklearn import preprocessing
#import决策树建模包
from sklearn import tree

# 2.读取数据
adult_data = pd.read_csv(\'./data/DecisionTree.csv\')

#读取前5行，了解一下数据
adult_data.head(10)

#   工作单位    教育          婚姻/配偶       职业      关系     人种   性别    国家    收入
#   政府       学士            未婚                  没有家庭   白人   男      美国
#   自由职业    学士            已婚                   丈夫     黑人   女      古巴
#   自己干       博士           离婚  
#               高二

adult_data.info()

<class \'pandas.core.frame.DataFrame\'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 9 columns):
workclass         32561 non-null object
education         32561 non-null object
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
gender            32561 non-null object
native-country    32561 non-null object
income            32561 non-null object
dtypes: object(9)
memory usage: 2.2+ MB

adult_data.shape

(32561, 9)

adult_data.columns
# 工作单位、教育、婚姻/配偶、职业、关系、人种、性别、国家、收入

Index([\'workclass\', \'education\', \'marital-status\', \'occupation\',
       \'relationship\', \'race\', \'gender\', \'native-country\', \'income\'],
      dtype=\'object\')

#3.区分一下特征(属性)和目标
feature_columns = [u\'workclass\', u\'education\', u\'marital-status\', u\'occupation\', u\'relationship\', u\'race\', u\'gender\', u\'native-country\']
label_column = [\'income\']

features = adult_data[feature_columns] # 建立特征矩阵
label = adult_data[label_column] # 建立标准矩阵

features.head(2) # 特征展示

pd.get_dummies(df)

pd.get_dummies(df.color)

#4.特征处理/特征工程/读热编码/哑变量
features = pd.get_dummies(features)

features.head(2)

#4.构建模型
#初始化一个决策树分类器
clf = tree.DecisionTreeClassifier(criterion=\'entropy\', max_depth=4) # 相当于正则化
#用决策树分类器拟合数据
clf = clf.fit(features.values, label.values)

clf # 决策树分类器

DecisionTreeClassifier(class_weight=None, criterion=\'entropy\', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter=\'best\')

clf.predict(features.values)

array([\' <=50K\', \' <=50K\', \' <=50K\', ..., \' <=50K\', \' <=50K\', \' >50K\'],
      dtype=object)

#5.可视化决策树

import pydotplus

from IPython.display import display, Image

dot_data = tree.export_graphviz(clf,  # 决策树模型
                                out_file=None, # 输出文件格式
                                feature_names=features.columns, # 特征矩阵
                                class_names = [\'<=50k\', \'>50k\'], # 标签
                                filled = True, # 由颜色标识不纯度
                                rounded =True  # 树节点为圆角矩形
                               )

graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(graph.create_png()))

# 根节点是婚姻状况为已婚，左边是关系是承认孩子的父亲，右边是职业

3.2 损失函数

4 总结

4.1 决策树的定义

4.2 决策树的分支

4.3 决策树的剪支

5 面试相关

5.1 什么是决策树

5.2 和其他模型相对，它的优点是什么

5.3 谈谈你对信息增益和信息增益比的理解

本文链接：https://www.cnblogs.com/qiu-hua/p/14397853.html

数据挖掘实践（26）：算法基础（四）决策树算法

1, 信息熵与条件熵（前置知识）

2 CART和基尼系数

3 决策树代码实战

4 总结

5 面试相关

数据挖掘实践（26）：算法基础（四）决策树算法的更多相关文章

随机推荐

热门专题

目录导航