2020/05/31
RからPythonへの道(15)
前回からかなりの時間が経ちました。前回はGW前で、その後、機械学習ライブラリのPyCaretにハマっていたこともあり、このシリーズはしばらく休止状態でしたが、久々に復活です!今回は「14. 決定木(分類)(2)」について、PythonとRで計算していきたいと思います。教材は『データサイエンス教本』を参考にしました。 データはキノコデータセットで、22個の説明変数、1個の目的変数(食用キノコ(e)か毒キノコ(p)かの分類)の計23個です。以下の4つの説明変数(カテゴリ変数)を用いて解析を行います。
説明変数
gill-color : ひだの色(12種類)
gill-attachment : ひだがあるか(4種類、データは内2種類)
odor : 臭い(9種類)
cap-color : カサの色(10種類)
目的変数
classes : 分類(2種類:食用キノコ/毒キノコ)
参考教材では「決定木」で分類する際の理論説明(エントロピーや情報利得)がされていますが、その部分は割愛し、データの「前処理」と「決定木の実行」のみをお話します。
まずは、Pythonのコードです。
# Decision Tree (Classification)前処理では25行目で、4つの説明変数(カテゴリ変数)をダミー変数に分けて33個の変数にしました。目的変数(flg)は29行目で毒キノコなら1、食用ならば0に値を割り当てました。
import pandas as pd
import requests
import io
import numpy as np
import matplotlib.pyplot as plt
#--------------------
# data preprocessing
#--------------------
# data read
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'
res = requests.get(url).content
mushroom = pd.read_csv(io.StringIO(res.decode('utf-8')), header=None)
print(mushroom.head())
# dara label
mushroom.columns = ['classes','cap-shape','cap-surface','cap-color','bruises','odor','gill-attachment', 'gill-spacing', 'gill-size','gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring','stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
print(mushroom.head())
print('data stype:{}'.format(mushroom.shape))
print('missing number:{}'.format(mushroom.isnull().sum().sum()))
# dummy variable
mushroom_dummy = pd.get_dummies(mushroom[['gill-color', 'gill-attachment', 'odor', 'cap-color']])
print(mushroom_dummy.head())
# objective variable
mushroom_dummy['flg'] = mushroom['classes'].map(lambda x:1 if x == 'p' else 0)
print(mushroom_dummy['flg'])
#------------------------------
# data modeling and evaluation
#------------------------------
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# data split
X = mushroom_dummy.drop('flg', axis=1)
y = mushroom_dummy['flg']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#print(X_train.shape)
#print(X_test.shape)
# decision tree model
model = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=0) #5 -> 3
model.fit(X_train, y_train)
# Result : Calculation the accuracy and cross tabulation table
print('Accuracy rate(train): {:.3f}'.format(model.score(X_train, y_train)))
print(confusion_matrix(model.predict(X_train), y_train))
print('Accuracy rate(test) : {:.3f}'.format(model.score(X_test, y_test)))
print(confusion_matrix(model.predict(X_test), y_test))
# Graph
from sklearn import tree
import pydotplus
from sklearn.externals.six import StringIO
from IPython.display import Image
dot_data = StringIO()
tree.export_graphviz(model, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf('DT_result.pdf')
決定木の実行では、データをtrain : testを7 : 3に分けて、trainデータでモデルを作り、testデータでその予測をしました。また、train、testの双方でモデルの精度(スコア)も評価しました。
Pythonコードの実行結果は以下の通りです。
# 16行目65行目の決定木の最終出力グラフは以下の通りです。
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
0 p x s n t p f c n k e e s s w w p w o p k s u
1 e x s y t a f c b k e c s s w w p w o p n n g
2 e b s w t l f c b n e c s s w w p w o p n n m
3 p x y w t p f c n n e e s s w w p w o p k s u
4 e x s g f n f w b k t e s s w w p w o e n a g
# 20行目
classes cap-shape cap-surface cap-color bruises odor ... veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p ... w o p k s u
1 e x s y t a ... w o p n n g
2 e b s w t l ... w o p n n m
3 p x y w t p ... w o p k s u
4 e x s g f n ... w o e n a g
[5 rows x 23 columns]
# 21行目
data stype:(8124, 23)
# 22行目
missing number:0
# 26行目
gill-color_b gill-color_e gill-color_g gill-color_h gill-color_k ... cap-color_p cap-color_r cap-color_u cap-color_w cap-color_y
0 0 0 0 0 1 ... 0 0 0 0 0
1 0 0 0 0 1 ... 0 0 0 0 1
2 0 0 0 0 0 ... 0 0 0 1 0
3 0 0 0 0 0 ... 0 0 0 1 0
4 0 0 0 0 1 ... 0 0 0 0 0
[5 rows x 33 columns]
# 30行目
0 1
1 0
2 0
3 1
4 0
..
8119 0
8120 0
8121 0
8122 1
8123 0
Name: flg, Length: 8124, dtype: int64
# 51行目
Accuracy rate(train): 0.991
# 52行目
[[2936 52]
[ 0 2698]]
# 53行目
Accuracy rate(test) : 0.992
# 54行目
[[1272 20]
[ 0 1146]]

# Decision Tree (Classification)Rコードの実行結果は以下の通りです。
library(dummies)
#--------------------
# data preprocessing
#--------------------
# data read
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'
mushroom = read.csv(url, header = FALSE)
print(head(mushroom))
# data label
colnames(mushroom) = c('classes','cap.shape','cap.surface','cap.color','bruises','odor','gill.attachment', 'gill.spacing', 'gill.size','gill.color', 'stalk.shape', 'stalk.root', 'stalk.surface.above.ring','stalk.surface.below.ring', 'stalk.color.above.ring', 'stalk.color.below.ring', 'veil.type', 'veil.color', 'ring.number', 'ring.type', 'spore.print.color', 'population', 'habitat')
print(head(mushroom))
print(paste("data style: ", dim(mushroom)[1], ",", dim(mushroom)[2]))
print(paste("missing number: ", sum(is.na(mushroom))))
# dummy variable
gill.color_dummy = dummy(mushroom$'gill.color', sep = ".")
gill.attachment_dummy = dummy(mushroom$'gill.attachment', sep=".")
odor_dummy = dummy(mushroom$'odor', sep=".")
cap.color_dummy = dummy(mushroom$'cap.color', sep=".")
# objective variable
flg_dummy = ifelse(mushroom$classes == 'p', 1, 0)
print(head(mushroom$classes))
print(head(flg_dummy))
mushroom_dummy = data.frame(cbind(gill.color_dummy, gill.attachment_dummy, odor_dummy, cap.color_dummy, flg_dummy))
#------------------------------
# data modeling and evaluation
#------------------------------
library(rpart)
library(rpart.plot)
# data split
X = mushroom_dummy[, colnames(mushroom_dummy) != "flg_dummy"]
y = mushroom_dummy[, c("flg_dummy")]
train.rate = 0.7 # training data rate
train.index = sample(nrow(mushroom_dummy),nrow(mushroom_dummy) * train.rate)
df_Train = mushroom_dummy[train.index ,]
df_Test = mushroom_dummy[-train.index ,]
cat("train=", nrow(df_Train), "test=", nrow(df_Test), "\n")
# decition tree model
rp = rpart(flg_dummy~., data = df_Train, method = 'class')
summary(rp)
# Predict
pred.rpart.train = predict(rp, df_Train, type = "class")
pred.rpart.test = predict(rp, df_Test, type = "class")
# Result : cross tabulation table
result.train = table(pred.rpart.train, df_Train$flg_dummy)
print(result.train)
result.test = table(pred.rpart.test, df_Test$flg_dummy)
print(result.test)
# Calculation the accuracy
accuracy_prediction_train = sum(diag(result.train)) / sum(result.train)
print(accuracy_prediction_train)
accuracy_prediction_test = sum(diag(result.test)) / sum(result.test)
print(accuracy_prediction_test)
# Graph
rpart.plot(rp , type = 4, extra = 1, digits = 3)
# 11行目68行目の決定木の最終出力グラフは以下の通りです。
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1 p x s n t p f c n k e e s s w w p w o p k s u
2 e x s y t a f c b k e c s s w w p w o p n n g
3 e b s w t l f c b n e c s s w w p w o p n n m
4 p x y w t p f c n n e e s s w w p w o p k s u
5 e x s g f n f w b k t e s s w w p w o e n a g
6 e x y y t a f c b n e c s s w w p w o p k n g
# 15行目
classes cap.shape cap.surface cap.color bruises odor gill.attachment gill.spacing gill.size gill.color
1 p x s n t p f c n k
2 e x s y t a f c b k
3 e b s w t l f c b n
4 p x y w t p f c n n
5 e x s g f n f w b k
6 e x y y t a f c b n
stalk.shape stalk.root stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
1 e e s s w
2 e c s s w
3 e c s s w
4 e e s s w
5 t e s s w
6 e c s s w
stalk.color.below.ring veil.type veil.color ring.number ring.type spore.print.color population habitat
1 w p w o p k s u
2 w p w o p n n g
3 w p w o p n n m
4 w p w o p k s u
5 w p w o e n a g
6 w p w o p k n g
# 16行目
[1] "data style: 8124 , 23"
# 17行目
[1] "missing number: 0"
# 27行目
[1] p e e p e e
Levels: e p
# 28行目
[1] 1 0 0 1 0 0
# 45行目
train= 5686 test= 2438
# 49行目
Call:
rpart(formula = flg_dummy ~ ., data = df_Train, method = "class")
n= 5686
CP nsplit rel error xerror xstd
1 0.7538518 0 1.00000000 1.00000000 0.013819092
2 0.1085840 1 0.24614820 0.24614820 0.008924162
3 0.1060161 2 0.13756420 0.14636831 0.007065808
4 0.0100000 3 0.03154806 0.03154806 0.003376090
Variable importance
odor.n odor.f odor.l odor.a gill.color.b gill.color.u gill.color.w
42 13 13 11 7 5 5
gill.color.n
4
Node number 1: 5686 observations, complexity param=0.7538518
predicted class=0 expected loss=0.4794231 P(node) =1
class counts: 2960 2726
probabilities: 0.521 0.479
left son=2 (2461 obs) right son=3 (3225 obs)
Primary splits:
odor.n < 0.5 to the right, improve=1714.4280, (0 missing)
odor.f < 0.5 to the left, improve=1123.4170, (0 missing)
gill.color.b < 0.5 to the left, improve= 812.2295, (0 missing)
gill.color.n < 0.5 to the right, improve= 239.1147, (0 missing)
odor.s < 0.5 to the left, improve= 232.5788, (0 missing)
Surrogate splits:
odor.f < 0.5 to the left, agree=0.700, adj=0.307, (0 split)
gill.color.b < 0.5 to the left, agree=0.641, adj=0.171, (0 split)
gill.color.u < 0.5 to the right, agree=0.617, adj=0.116, (0 split)
gill.color.w < 0.5 to the right, agree=0.616, adj=0.113, (0 split)
gill.color.n < 0.5 to the right, agree=0.610, adj=0.098, (0 split)
Node number 2: 2461 observations
predicted class=0 expected loss=0.03494514 P(node) =0.4328174
class counts: 2375 86
probabilities: 0.965 0.035
Node number 3: 3225 observations, complexity param=0.108584
predicted class=1 expected loss=0.1813953 P(node) =0.5671826
class counts: 585 2640
probabilities: 0.181 0.819
left son=6 (296 obs) right son=7 (2929 obs)
Primary splits:
odor.a < 0.5 to the right, improve=436.7978, (0 missing)
odor.l < 0.5 to the right, improve=425.4514, (0 missing)
odor.f < 0.5 to the left, improve=188.9691, (0 missing)
gill.color.n < 0.5 to the right, improve=139.1077, (0 missing)
gill.color.b < 0.5 to the left, improve=123.4467, (0 missing)
Node number 6: 296 observations
predicted class=0 expected loss=0 P(node) =0.05205769
class counts: 296 0
probabilities: 1.000 0.000
Node number 7: 2929 observations, complexity param=0.1060161
predicted class=1 expected loss=0.09866849 P(node) =0.5151249
class counts: 289 2640
probabilities: 0.099 0.901
left son=14 (289 obs) right son=15 (2640 obs)
Primary splits:
odor.l < 0.5 to the right, improve=520.96960, (0 missing)
odor.f < 0.5 to the left, improve= 61.43912, (0 missing)
gill.color.n < 0.5 to the right, improve= 60.45658, (0 missing)
cap.color.w < 0.5 to the right, improve= 50.10237, (0 missing)
gill.color.w < 0.5 to the right, improve= 42.62831, (0 missing)
Surrogate splits:
gill.color.n < 0.5 to the right, agree=0.903, adj=0.021, (0 split)
gill.color.k < 0.5 to the right, agree=0.903, adj=0.017, (0 split)
Node number 14: 289 observations
predicted class=0 expected loss=0 P(node) =0.05082659
class counts: 289 0
probabilities: 1.000 0.000
Node number 15: 2640 observations
predicted class=1 expected loss=0 P(node) =0.4642983
class counts: 0 2640
probabilities: 0.000 1.000
# 57行目
pred.rpart.train 0 1
0 2960 86
1 0 2640
# 59行目
pred.rpart.test 0 1
0 1248 34
1 0 1156
# 63行目
[1] 0.9848751
# 65行目
[1] 0.9860541

『RからPythonへの道』バックナンバー
(1) はじめに
(2) 0. 実行環境(作業環境)
(3) 1. PythonからRを使う方法 2. RからPythonを使う方法
(4) 3. データフレーム
(5) 4. ggplot
(6) 5.行列
(7) 6.基本統計量
(8) 7. 回帰分析(単回帰)
(9) 8. 回帰分析(重回帰)
(10) 9. 回帰分析(ロジスティック回帰1)
(11) 10. 回帰分析(ロジスティック回帰2)
(12) 11. 回帰分析(リッジ、ラッソ回帰)
(13) 12. 回帰分析(多項式回帰)
(14) 13. 決定木(分類)(1)