2020/03/01
今回は「
12. 回帰分析(多項式回帰)」について、PythonとRで計算していきたいと思います。教材は『データサイエンス教本』を参考にしました。
多項式回帰分析は以下の式で表され、非線形回帰分析に分類されるものです。

まずはPythonのコードです。Rのデータセットである「cars」を読み込んで、statsmodelsライブラリの一般最小二乗法OLS(Ordinary Least Squares)を用いて解析を進めました。目的変数はdist、説明変数はspeedで、次数が1(単回帰)、2、3の場合について行いました。
# Polynomial regression
import numpy as np
import statsmodels.formula.api as smf
from rpy2.robjects import r, pandas2ri
import matplotlib.pyplot as plt
pandas2ri.activate()
# data read
df = r['cars'] # read datasets of cars
print(df.head())
x = df.speed
y = df.dist
# 1st-order
result1 = smf.ols('dist ~ speed', data=df).fit()
print(result1.summary())
# 2nd-order
result2 = smf.ols('dist ~ speed + np.power(speed, 2)', data=df).fit()
print(result2.summary())
# 3rd-order
result3 = smf.ols('dist ~ speed + np.power(speed, 2) + np.power(speed, 3)', data=df).fit()
print(result3.summary())
#graph
plt.scatter(x, y)
b0_1, b1_1 = result1.params # beta0, beta1
b0_2, b1_2, b2_2 = result2.params # beta0, beta1, beta2
b0_3, b1_3, b2_3, b3_3 = result3.params # beta0, beta1, beta2, beta3
plt.plot(x, b0_1 + b1_1*x, 'r')
plt.plot(x, b0_2 + b1_2*x + b2_2*x**2, 'g')
plt.plot(x, b0_3 + b1_3*x + b2_3*x**2 + b3_3*x**3, 'b')
plt.xlabel("speed")
plt.ylabel("dist")
plt.show()
Pythonコードの結果は以下の通りです。
# 11行目
speed dist
1 4.0 2.0
2 4.0 10.0
3 7.0 4.0
4 7.0 22.0
5 8.0 16.0
# 17行目: 1st-order (y = b0 + b1*x)
OLS Regression Results
==============================================================================
Dep. Variable: dist R-squared: 0.651
Model: OLS Adj. R-squared: 0.644
Method: Least Squares F-statistic: 89.57
Date: 日, 01 3 2020 Prob (F-statistic): 1.49e-12
Time: 13:13:14 Log-Likelihood: -206.58
No. Observations: 50 AIC: 417.2
Df Residuals: 48 BIC: 421.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -17.5791 6.758 -2.601 0.012 -31.168 -3.990
speed 3.9324 0.416 9.464 0.000 3.097 4.768
==============================================================================
Omnibus: 8.975 Durbin-Watson: 1.676
Prob(Omnibus): 0.011 Jarque-Bera (JB): 8.189
Skew: 0.885 Prob(JB): 0.0167
Kurtosis: 3.893 Cond. No. 50.7
==============================================================================
# 21行目: 2nd-order (y = b0 + b1*x + b2*x^2)
OLS Regression Results
==============================================================================
Dep. Variable: dist R-squared: 0.667
Model: OLS Adj. R-squared: 0.653
Method: Least Squares F-statistic: 47.14
Date: 日, 01 3 2020 Prob (F-statistic): 5.85e-12
Time: 13:13:14 Log-Likelihood: -205.39
No. Observations: 50 AIC: 416.8
Df Residuals: 47 BIC: 422.5
Df Model: 2
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 2.4701 14.817 0.167 0.868 -27.338 32.278
speed 0.9133 2.034 0.449 0.656 -3.179 5.006
np.power(speed, 2) 0.1000 0.066 1.515 0.136 -0.033 0.233
==============================================================================
Omnibus: 11.173 Durbin-Watson: 1.762
Prob(Omnibus): 0.004 Jarque-Bera (JB): 11.061
Skew: 0.991 Prob(JB): 0.00396
Kurtosis: 4.173 Cond. No. 2.16e+03
==============================================================================
# 26行目: 3rd-order (y = b0 + b1*x + b2*x^2 + b3*x^3)
OLS Regression Results
==============================================================================
Dep. Variable: dist R-squared: 0.673
Model: OLS Adj. R-squared: 0.652
Method: Least Squares F-statistic: 31.58
Date: 日, 01 3 2020 Prob (F-statistic): 3.07e-11
Time: 13:13:14 Log-Likelihood: -204.94
No. Observations: 50 AIC: 417.9
Df Residuals: 46 BIC: 425.5
Df Model: 3
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept -19.5050 28.405 -0.687 0.496 -76.682 37.672
speed 6.8011 6.801 1.000 0.323 -6.889 20.491
np.power(speed, 2) -0.3497 0.500 -0.699 0.488 -1.356 0.657
np.power(speed, 3) 0.0103 0.011 0.907 0.369 -0.012 0.033
==============================================================================
Omnibus: 11.647 Durbin-Watson: 1.813
Prob(Omnibus): 0.003 Jarque-Bera (JB): 11.650
Skew: 1.037 Prob(JB): 0.00295
Kurtosis: 4.136 Cond. No. 8.75e+04
==============================================================================
グラフは以下の通りです。1、2、3次のフィッティングを赤、緑、青線で示しました。

次にRのコードです。10、15、20行目の回帰分析はlm関数(線形モデル関数)とpoly関数(多項式関数)を使いました。今回のモデルは、厳密には非線形モデルですが、lm関数でも処理できるようです。後ほど、非線形モデル関数であるnls関数を使った例も示します。
# Polynomial regression
# data read
df = cars # cars dataset
head(df)
x = df$speed
y = df$dist
# 1st-order (y = b0 + b1*x)
df.lm1 = lm(y ~ poly(x, degree = 1, raw = TRUE))
summary(df.lm1)
y_pred1 = predict(df.lm1)
# 2nd-order (y = b0 + b1*x + b2*x^2)
df.lm2 = lm(y ~ poly(x, degree = 2, raw = TRUE))
summary(df.lm2)
y_pred2 = predict(df.lm2)
# 3rd-order (y = b0 + b1*x + b2*x^2 + b3*x^3)
df.lm3 = lm(y ~ poly(x, degree = 3, raw = TRUE))
summary(df.lm3)
y_pred3 = predict(df.lm3)
# graph
plot(x, y, xlab='speed', ylab='dist', pch=16)
lines(x, y_pred1, col=2) # red
lines(x, y_pred2, col=3) # green
lines(x, y_pred3, col=4) # blue
Rのコードの結果は以下の通りです。
# 5行目
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
# 11行目: 1st-order (y = b0 + b1*x)
Call:
lm(formula = y ~ poly(x, degree = 1, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
poly(x, degree = 1, raw = TRUE) 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# 16行目: 2nd-order (y = b0 + b1*x + b2*x^2)
Call:
lm(formula = y ~ poly(x, degree = 2, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-28.720 -9.184 -3.188 4.628 45.152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.47014 14.81716 0.167 0.868
poly(x, degree = 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656
poly(x, degree = 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136
Residual standard error: 15.18 on 47 degrees of freedom
Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12
# 21行目: 3rd-order (y = b0 + b1*x + b2*x^2 + b3*x^3)
Call:
lm(formula = y ~ poly(x, degree = 3, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-26.670 -9.601 -2.231 7.075 44.691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19.50505 28.40530 -0.687 0.496
poly(x, degree = 3, raw = TRUE)1 6.80111 6.80113 1.000 0.323
poly(x, degree = 3, raw = TRUE)2 -0.34966 0.49988 -0.699 0.488
poly(x, degree = 3, raw = TRUE)3 0.01025 0.01130 0.907 0.369
Residual standard error: 15.2 on 46 degrees of freedom
Multiple R-squared: 0.6732, Adjusted R-squared: 0.6519
F-statistic: 31.58 on 3 and 46 DF, p-value: 3.074e-11
グラフもPythonと同じ結果です。

Rでは非線形回帰分析用にnls関数も準備されていますので、使ってみました。以下のコードの10、14、18行目です。実際の数式を解析式に記載している所はPythonのstatsmodelsライブラリと同じですね。
# Non-linear regression
# data read
df = cars # cars dataset
head(df)
x = df$speed
y = df$dist
# 1st-order (y = b0 + b1*x)
nls_1st = nls(y ~ b0_1 + b1_1*x, start = c(b0_1=1, b1_1=1), trace = TRUE)
summary(nls_1st)
# 2nd-order (y = b0 + b1*x + b2*x^2)
nls_2nd = nls(y ~ b0_2 + b1_2*x + b2_2*x^2, start = c(b0_2=1, b1_2=1, b2_2=1), trace = TRUE)
summary(nls_2nd)
# 3rd-order (y = b0 + b1*x + b2*x^2 + b3*x^3)
nls_3rd = nls(y ~ b0_3 + b1_3*x + b2_3*x^2 + b3_3*x^3, start = c(b0_3=1, b1_3=1, b2_3=1, b3_3=1), trace = TRUE)
summary(nls_3rd)
結果は以下の通りです。
# 11行目: summary(nls_1st)
Formula: y ~ b0_1 + b1_1 * x
Parameters:
Estimate Std. Error t value Pr(>|t|)
b0_1 -17.5791 6.7584 -2.601 0.0123 *
b1_1 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 3.137e-09
# 15行目: summary(nls_2nd)
Formula: y ~ b0_2 + b1_2 * x + b2_2 * x^2
Parameters:
Estimate Std. Error t value Pr(>|t|)
b0_2 2.47014 14.81717 0.167 0.868
b1_2 0.91329 2.03422 0.449 0.656
b2_2 0.09996 0.06597 1.515 0.136
Residual standard error: 15.18 on 47 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.387e-07
# 19行目: summary(nls_3rd)
Formula: y ~ b0_3 + b1_3 * x + b2_3 * x^2 + b3_3 * x^3
Parameters:
Estimate Std. Error t value Pr(>|t|)
b0_3 -19.50505 28.40530 -0.687 0.496
b1_3 6.80111 6.80114 1.000 0.323
b2_3 -0.34966 0.49988 -0.699 0.488
b3_3 0.01025 0.01130 0.907 0.369
Residual standard error: 15.2 on 46 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 2.441e-08
この結果も、lm、poly関数を使った場合と同じでした。結果をまとめると以下の通りです。Python、Rの2種類の解法は全て同じ結果でした。Rで汎用的に非線形回帰を実行する場合は「nls関数」を使った方が分かりやすいと思いました。

Pythonで回帰分析を行う場合、statsmodelsライブラリを使えば、Rに似たようなコード記載ができるので、違和感なく使えそうです。今後も積極的に使っていこうと思います。
『RからPythonへの道』バックナンバー(1)
はじめに(2)
0. 実行環境(作業環境)(3)
1. PythonからRを使う方法 2. RからPythonを使う方法(4)
3. データフレーム(5)
4. ggplot(6)
5.行列(7)
6.基本統計量(8)
7. 回帰分析(単回帰)(9)
8. 回帰分析(重回帰)(10)
9. 回帰分析(ロジスティック回帰1)(11)
10. 回帰分析(ロジスティック回帰2)(12)
11. 回帰分析(リッジ、ラッソ回帰)