2020/02/23
RからPythonへの道(12)
今回は「11. 回帰分析(リッジ、ラッソ回帰)」について、PythonとRで計算していきたいと思います。教材は今回も『東京大学のデータサイエンティスト育成講座』を参考にしました。リッジ回帰、ラッソ回帰は以下の式の第一項(目的変数と予測値の二乗誤差)に第二項(過学習を抑える正則項)を加えた値を最小にする分析手法で、q=1の時はラッソ回帰、q=2の時はリッジ回帰と呼びます。Mは変数の数、wは重み付け係数、λは「正則化パラメータ」です。
import requests, zipfileリッジ、ラッソ回帰を行う際は、クロスバリデーション(CV)を行い、前式の値が最小になるλを求めてモデリングするのが一般的ですが、書籍ではクロスバリデーションの記載がなかったので、追記しました。ただし、設定はデフォルトなので、正常に機能していないようです(ユーザの試行錯誤が必要?)。後ほどのRのコードではうまく計算してくれました。
import io
import numpy as np
import pandas as pd
#--------------------
# data preprocessing
#--------------------
# data file download
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
res = requests.get(url).content # bytes type data
# data read
auto = pd.read_csv(io.StringIO(res.decode('utf-8')),header=None)
# data label
auto.columns = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-stype', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
print('car data format:{}'.format(auto.shape))
print(auto.head())
# count ?-data
auto = auto[['price', 'horsepower', 'width', 'height']]
print(auto.isin(['?']).sum())
# replace ?-data with NaN and delete NaN
auto = auto.replace('?', np.nan).dropna()
# data type check
print('data type check(before)\n{}\n'.format(auto.dtypes))
# data type conversion
auto = auto.assign(price=pd.to_numeric(auto.price))
auto = auto.assign(horsepower=pd.to_numeric(auto.horsepower))
print('data type check(after)\n{}\n'.format(auto.dtypes))
# Correlation
print(auto.corr())
#------------------------------
# data modeling and evaluation
#------------------------------
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
X = auto.drop('price',axis=1)
y = auto['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# modeling
linear = LinearRegression()
lasso = Lasso(random_state=0)
ridge = Ridge(random_state=0)
lasso_cv = LassoCV(random_state=0)
ridge_cv = RidgeCV()
for model in [linear, lasso, ridge, lasso_cv, ridge_cv]:
model.fit(X_train, y_train)
print('{}(train):{:.6f}'.format(model.__class__.__name__, model.score(X_train, y_train)))
print('{}(test):{:.6f}'.format(model.__class__.__name__, model.score(X_test, y_test)))
print('{}(intercept):'.format(model.__class__.__name__), model.intercept_)
print('{}(slope):'.format(model.__class__.__name__), model.coef_)
Pythonコードの実行結果は以下の通りです。前回の重回帰分析の結果部分は省略します。
LinearRegression(train):0.733358分かりにくいので、一覧表にすると以下の通り。
LinearRegression(test):0.737069
LinearRegression(intercept): -128409.04630338575
LinearRegression(slope): [ 81.6510781 1829.17450553 229.51007731]
Lasso(train):0.733358
Lasso(test):0.737107
Lasso(intercept): -128383.27846080146
Lasso(slope): [ 81.66816264 1828.79749583 229.45963922]
Ridge(train):0.733355
Ridge(test):0.737768
Ridge(intercept): -127977.72972868057
Ridge(slope): [ 82.05219555 1820.23362733 231.67489868]
LassoCV(train):0.733108
LassoCV(test):0.742949
LassoCV(intercept): -123801.85160600257
LassoCV(slope): [ 84.37307241 1764.39408866 217.89863746]
RidgeCV(train):0.733100
RidgeCV(test):0.743506
RidgeCV(intercept): -124272.13077537864
RidgeCV(slope): [ 85.47424186 1743.93061057 249.69193454]

次にRのコードです。Rのリッジ回帰、ラッソ回帰分析は、glmnetパッケージのglmnet関数を使って行いました。glmnet関数を使う時、パラメータalphaを0に指定するとリッジ回帰、1に指定するとラッソ回帰、0<α<1の時はリッジとラッソの混合形態である「Elastic Net」になります。前半(37行目まで)は、以前の重回帰分析のコードと同じです。
#--------------------Rの結果は以下の通りです。リッジ、ラッソについてはCVを行い、誤差を最小化するλと回帰係数を求めました。
# data preprocessing
#--------------------
# data read
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
auto = read.csv(url, header = FALSE)
# data label
colnames(auto) = c("symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price")
dim(auto)
head(auto)
# count ?-data
auto = auto[, c("price","horsepower","width","height")]
length(auto$price[which(auto$price=="?")])
length(auto$horsepower[which(auto$horsepower=="?")])
length(auto$width[which(auto$width=="?")])
length(auto$height[which(auto$height=="?")])
# replace ?-data with NaN and delete NaN
# data type check(before)
sapply(auto, class)
auto$price[which(auto$price=="?")] = NA
auto$horsepower[which(auto$horsepower=="?")] = NA
auto$width[which(auto$width=="?")] = NA
auto$height[which(auto$height=="?")] = NA
auto = na.omit(auto)
# data type conversion
auto$price = as.numeric(as.character(auto$price))
auto$horsepower = as.numeric(as.character(auto$horsepower))
# data type check(after)
sapply(auto, class)
# Correlation
cor(auto)
#------------------------------
# data modeling and evaluation
#------------------------------
library(caret)
library(glmnet)
X = auto[, c("horsepower","width","height")]
y = auto[, c("price")]
set.seed(0)
inTrain <- createDataPartition(y, p = 0.5,list = F)
df_Train <- auto[inTrain,]
df_Test <- auto[-inTrain,]
# modeling (Multi-regression)
auto.lm<-train(price ~ horsepower + width + height, data = df_Train, method = 'lm')
print(summary(auto.lm))
# prediction
auto.pred = predict(auto.lm, newdata= df_Test)
# Coefficient of determination
# Train data
a = sum((df_Train$price - predict(auto.lm, newdata = df_Train))^2)
b = sum((df_Train$price - mean(df_Train$price))^2)
print(paste("Multi-Regression(train): ", 1 - a/b))
# Test data
a = sum((df_Test$price - predict(auto.lm, newdata = df_Test))^2)
b = sum((df_Test$price - mean(df_Test$price))^2)
print(paste("Multi-Regression(test): ", 1 - a/b))
#--------------------------
# Ridge & Lasso Regression
#--------------------------
xx = as.matrix(df_Train[,2:4])
yy = as.matrix(df_Train[,1])
xx1 = as.matrix(df_Test[,2:4])
yy1 = as.matrix(df_Test[,1])
# Ridge Regression
Ridge.fit = glmnet(x=xx, y=yy, family = "gaussian", alpha = 0)
plot(Ridge.fit, xvar = "lambda", label=TRUE)
Ridge.fit.CV1 = cv.glmnet(x=xx, y=yy, family="gaussian", alpha=0) # cross validation
plot(Ridge.fit.CV1)
print(paste("lambda Min: ", Ridge.fit.CV1$lambda.min, "log lambda Min: ", log(Ridge.fit.CV1$lambda.min)))
abline(v=log(Ridge.fit.CV1$lambda.min),lty=3)
print(coef(Ridge.fit.CV1, s="lambda.min"))
# Coefficient of determination
# Train data
Ridge.fit.CV1.train = predict(Ridge.fit.CV1, s="lambda.min", newx = xx)
a = sum((yy- Ridge.fit.CV1.train)^2)
b = sum((yy - mean(yy))^2)
print(paste("Ridge-Regression(train): ", 1 - a/b))
# Test data
Ridge.fit.CV1.pred = predict(Ridge.fit.CV1, s="lambda.min", newx = xx1)
a = sum((yy1 - Ridge.fit.CV1.pred)^2)
b = sum((yy1 - mean(yy1))^2)
print(paste("Ridge-Regression(test): ", 1 - a/b))
# Lasso Regression
Lasso.fit = glmnet(x=xx, y=yy, family = "gaussian", alpha = 1)
plot(Lasso.fit, xvar = "lambda", label=TRUE)
Lasso.fit.CV1 = cv.glmnet(x=xx, y=yy, family="gaussian", alpha=1) # cross validation
plot(Lasso.fit.CV1)
print(paste("lambda Min: ", Lasso.fit.CV1$lambda.min, "log lambda Min: ", log(Lasso.fit.CV1$lambda.min)))
abline(v=log(Lasso.fit.CV1$lambda.min),lty=3)
print(coef(Lasso.fit.CV1, s="lambda.min"))
# Coefficient of determination
# Train data
Lasso.fit.CV1.train = predict(Lasso.fit.CV1, s="lambda.min", newx = xx)
a = sum((yy- Lasso.fit.CV1.train)^2)
b = sum((yy - mean(yy))^2)
print(paste("Lasso-Regression(train): ", 1 - a/b))
# Test data
Lasso.fit.CV1.pred = predict(Lasso.fit.CV1, s="lambda.min", newx = xx1)
a = sum((yy1 - Lasso.fit.CV1.pred)^2)
b = sum((yy1 - mean(yy1))^2)
print(paste("Lasso-Regression(test): ", 1 - a/b))
Call:これも分かりにくいので一覧表にすると以下の通り。
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-6772 -2542 -220 1686 15489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -119751.97 16976.10 -7.054 2.58e-10 ***
horsepower 116.34 15.69 7.417 4.56e-11 ***
width 1705.94 303.95 5.613 1.89e-07 ***
height 159.60 186.17 0.857 0.393
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4077 on 97 degrees of freedom
Multiple R-squared: 0.7541, Adjusted R-squared: 0.7465
F-statistic: 99.17 on 3 and 97 DF, p-value: < 2.2e-16
[1] "Multi-Regression(train): 0.75411846216974"
[1] "Multi-Regression(test): 0.771159651819088"
[1] "lambda Min: 643.869960222362 log lambda Min: 6.46749678059418"
4 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -115873.9444
horsepower 109.5919
width 1650.6246
height 168.2076
[1] "Ridge-Regression(train): 0.752515077280814"
[1] "Ridge-Regression(test): 0.773514153053235"
[1] "lambda Min: 35.1699620177062 log lambda Min: 3.56019236652245"
4 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -118887.4435
horsepower 115.4275
width 1705.2886
height 146.0971
[1] "Lasso-Regression(train): 0.754082093937949"
[1] "Lasso-Regression(test): 0.771449632230262"





『RからPythonへの道』バックナンバー
(1) はじめに
(2) 0. 実行環境(作業環境)
(3) 1. PythonからRを使う方法 2. RからPythonを使う方法
(4) 3. データフレーム
(5) 4. ggplot
(6) 5.行列
(7) 6.基本統計量
(8) 7. 回帰分析(単回帰)
(9) 8. 回帰分析(重回帰)
(10) 9. 回帰分析(ロジスティック回帰1)
(11) 10. 回帰分析(ロジスティック回帰2)