ノート/Python/統計/回帰分析 Linear Regression

関連 ノート/Python/統計/主成分分析・因子分析

回帰分析をするパッケージはいろいろある。それぞれにできることが違う。

scipyのstatsのlinregress()メソッド

マニュアルはscipy.stats.linregress
statsの他のいろいろな関数のリストはStatistical functions (scipy.stats)

特徴としては

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
x = np.loadtxt('shusseiritsu.csv', delimiter=",")  # CSVファイルからデータを読込む
# 出典 http://www.mhlw.go.jp/toukei/saikin/hw/jinkou/suikei15/dl/2015suikei.pdf
# print(x)
birth1 = [u[1] for u in x if u[0]<=1975]
death1 = [u[2] for u in x if u[0]<=1975]
birth2 = [u[1] for u in x if (1976<=u[0] and u[0]<=1989)]
death2 = [u[2] for u in x if (1976<=u[0] and u[0]<=1989)]
birth3 = [u[1] for u in x if 1990<=u[0]]
death3 = [u[2] for u in x if 1990<=u[0]]

print('1975年以前の相関係数', np.corrcoef(birth1, death1) [0,1])
print('1976-1989の相関係数', np.corrcoef(birth2, death2) [0,1])
print('1990年以降の相関係数', np.corrcoef(birth3, death3) [0,1])
# 単回帰分析
print('1975年以前の回帰分析', scipy.stats.linregress(birth1, death1))
print('1976-1989の回帰分析', scipy.stats.linregress(birth2, death2))
print('1990年以降の回帰分析', scipy.stats.linregress(birth3, death3))
'''
# グラフを描く
plt.scatter(birth1, death1, marker='x', label='1975年以前')
plt.scatter(birth2, death2, marker='.', label='1976年〜1989年')
plt.scatter(birth3, death3, marker='*', label='1990年以降')
plt.title('出生率と死亡率との関係')
plt.xlabel('出生率')
plt.ylabel('死亡率')
plt.legend()
plt.show()
'''

実行結果は

1975年以前の相関係数 0.922404896289
1976-1989の相関係数 -0.491032741331
1990年以降の相関係数 -0.965411930518
1975年以前の回帰分析 LinregressResult(slope=0.3451178361054752, intercept=0.96204012512558279, 
   rvalue=0.92240489628864086, pvalue=1.1465561864080153e-12, stderr=0.027810161695884787)
1976-1989の回帰分析 LinregressResult(slope=-0.038225255972696312, intercept=6.6939249146757698, 
   rvalue=-0.49103274133050051, pvalue=0.074589391104373334, stderr=0.019576625863660331)
1990年以降の回帰分析 LinregressResult(slope=-1.7235815792285496, intercept=23.852665883107164, 
   rvalue=-0.96541193051791485, pvalue=1.621614066506831e-15, stderr=0.09501748192512055)

scikit-learnのlinear_modelのクラスLinearRegression()

マニュアルはsklearn.linear_model.LinearRegression

特徴としては

import numpy as np
from sklearn.linear_model import LinearRegression
x = np.loadtxt('shusseiritsu.csv', delimiter=",")  # CSVファイルからデータを読込む
# 出典 http://www.mhlw.go.jp/toukei/saikin/hw/jinkou/suikei15/dl/2015suikei.pdf
birth1 = [u[1] for u in x if u[0]<=1975]
death1 = [u[2] for u in x if u[0]<=1975]
birth2 = [u[1] for u in x if (1976<=u[0] and u[0]<=1989)]
death2 = [u[2] for u in x if (1976<=u[0] and u[0]<=1989)]
birth3 = [u[1] for u in x if 1990<=u[0]]
death3 = [u[2] for u in x if 1990<=u[0]]

model = LinearRegression()
model.fit(np.array(birth1).reshape(-1, 1), np.array(death1).reshape(-1,1))
print('1975年以前・回帰係数', model.coef_)
print('1975年以前・切片', model.intercept_)
print('1975年以前・決定係数', model.score(np.array(birth1).reshape(-1,1), np.array(death1).reshape(-1,1)))

結果は

1975年以前・回帰係数 [[ 0.34511784]]
1975年以前・切片 [ 0.96204013]
1975年以前・決定係数 0.850830792697

p値(検定)は計算しない。こうすれば拡張できるという例がFind p-value (significance) in scikit-learn LinearRegression (StackOverflow)その元にある。

<重回帰の例>

Scipy lecture notes 3.1.6.5. Multiple Regression

statsmodelsのOLSクラス

マニュアルはLinear Regression
クラスのマニュアルは[[statsmodels.regression.linear_model.OLS:http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html]
返される結果のクラスのマニュアルは
(一般) statsmodels.regression.linear_model.RegressionResults
(OLS) statsmodels.regression.linear_model.OLSResults

マニュアル内のExxamples Ordinary Least Squares
     Generalized Least Squares

上と同じプログラムでOLSを使ってみる。

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
x = np.loadtxt('shusseiritsu.csv', delimiter=",")  # CSVファイルからデータを読込む
# 出典 http://www.mhlw.go.jp/toukei/saikin/hw/jinkou/suikei15/dl/2015suikei.pdf
birth1 = [u[1] for u in x if u[0]<=1975]
death1 = [u[2] for u in x if u[0]<=1975]
birth2 = [u[1] for u in x if (1976<=u[0] and u[0]<=1989)]
death2 = [u[2] for u in x if (1976<=u[0] and u[0]<=1989)]
birth3 = [u[1] for u in x if 1990<=u[0]]
death3 = [u[2] for u in x if 1990<=u[0]]

model2 = sm.OLS(death1, sm.add_constant(birth1))  # X側はadd_constantで1列追加し、切片を計算させる
results2 = model2.fit()
print(results2.summary())
b, a = results2.params
print('a=', a, 'b=', b)

結果は

summary()の部分
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.851
Model:                            OLS   Adj. R-squared:                  0.845
Method:                 Least Squares   F-statistic:                     154.0
Date:                Fri, 02 Feb 2018   Prob (F-statistic):           1.15e-12
Time:                        12:31:16   Log-Likelihood:                -32.653
No. Observations:                  29   AIC:                             69.31
Df Residuals:                      27   BIC:                             72.04
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.9620      0.590      1.630      0.115      -0.249       2.173
x1             0.3451      0.028     12.410      0.000       0.288       0.402
==============================================================================
Omnibus:                        1.232   Durbin-Watson:                   1.021
Prob(Omnibus):                  0.540   Jarque-Bera (JB):                1.166
Skew:                           0.428   Prob(JB):                        0.558
Kurtosis:                       2.518   Cond. No.                         87.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
a, bの値
a= 0.345117836105 b= 0.962040125126

p値はconst(切片)が0.115で棄却できない、x1(傾き)が0.000で棄却。

色々なパッケージでBostonデータを処理してみる

# -*- coding: utf-8 -*-
# Boston-regession.py
# Quick introduction to linear regression in Python
#    https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
from scipy.stats import pearsonr, linregress
import statsmodels
import statsmodels.api as sm

dset = datasets.load_boston()
print(dset.DESCR)
boston = pd.DataFrame(dset.data)
boston.columns = dset.feature_names
target = pd.DataFrame(dset.target)

# scikit-learn linear_modelを使った線形回帰
#    https://pythondatascience.plavox.info/scikit-
learn/%E7%B7%9A%E5%BD%A2%E5%9B%9E%E5%B8%B0
model = linear_model.LinearRegression()
model.fit(boston, target)
# 偏回帰係数
print(pd.DataFrame({"Name":boston.columns,\
                    "Coefficients":model.coef_[0]}).sort_values(by='Coefficients') )
# 切片 (誤差)
print('intercept', model.intercept_[0], 'r2', model.score(boston,target))

# scipy.stats.peasonr
#   manual --- https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html
print(pearsonr(boston['NOX'], target[0]))
#print(pearsonr(boston, target[0]))  # 重回帰はダメ

# scipy.stats.linregress
#   manual --- https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html
print(linregress(boston['NOX'], target[0]))
#print(linregress(boston, target[0]))  # 重回帰はダメ

# statsmodels.OLS
#   examples:  http://www.statsmodels.org/dev/examples/notebooks/generated/ols.html
#   manual --- http://www.statsmodels.org/dev/regression.html#module-reference
#   OLSREsultsのmanual --- http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResul
ts.html#statsmodels.regression.linear_model.OLSResults
#model3 = sm.OLS(target[0], sm.add_constant(boston))
model3 = sm.OLS(target, sm.add_constant(boston))
result3 = model3.fit()
print(result3.summary())
print(result3.pvalues)

結果は

/home/yamanouc/.pyenv/versions/notebook/lib/python3.5/site-
packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools 
module is deprecated and will be removed in a future version. Please use the 
pandas.tseries module instead.
  from pandas.core import datetools
Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive

    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie 
Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that 
address regression
problems.

**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and 
Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In 
Proceedings on the Tenth International Conference of Machine Learning, 236-243, 
University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

    Coefficients     Name           <sklearn linear_modelsの結果>
4     -17.795759      NOX
7      -1.475759      DIS
10     -0.953464  PTRATIO
12     -0.525467    LSTAT
0      -0.107171     CRIM
9      -0.012329      TAX
6       0.000751      AGE
11      0.009393        B
2       0.020860    INDUS
1       0.046395       ZN
8       0.305655      RAD
3       2.688561     CHAS
5       3.804752       RM
intercept 36.4911032804 r2 0.740607742865

(-0.4273207723732827, 7.0650415862543328e-24)          <scipy.stats.peasonrの結果>

LinregressResult(slope=-33.916055008661104, intercept=41.345874467973246, <scipy.stats.linregressの結果>
rvalue=-0.42732077237328259, pvalue=7.0650415862534586e-24, stderr=3.1963370321953462)

                            OLS Regression Results     <statsmodels.OLSの結果>
==============================================================================
Dep. Variable:                      0   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     108.1
Date:                Sun, 04 Feb 2018   Prob (F-statistic):          6.95e-135
Time:                        10:21:48   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3026.
Df Residuals:                     492   BIC:                             3085.
Df Model:                          13
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         36.4911      5.104      7.149      0.000      26.462      46.520
CRIM          -0.1072      0.033     -3.276      0.001      -0.171      -0.043
ZN             0.0464      0.014      3.380      0.001       0.019       0.073
INDUS          0.0209      0.061      0.339      0.735      -0.100       0.142
CHAS           2.6886      0.862      3.120      0.002       0.996       4.381
NOX          -17.7958      3.821     -4.658      0.000     -25.302     -10.289
RM             3.8048      0.418      9.102      0.000       2.983       4.626
AGE            0.0008      0.013      0.057      0.955      -0.025       0.027
DIS           -1.4758      0.199     -7.398      0.000      -1.868      -1.084
RAD            0.3057      0.066      4.608      0.000       0.175       0.436
TAX           -0.0123      0.004     -3.278      0.001      -0.020      -0.005
PTRATIO       -0.9535      0.131     -7.287      0.000      -1.211      -0.696
B              0.0094      0.003      3.500      0.001       0.004       0.015
LSTAT         -0.5255      0.051    -10.366      0.000      -0.625      -0.426
==============================================================================
Omnibus:                      178.029   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              782.015
Skew:                           1.521   Prob(JB):                    1.54e-170
Kurtosis:                       8.276   Cond. No.                     1.51e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
specified.
[2] The condition number is large, 1.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
const      3.182440e-12
CRIM       1.126402e-03
ZN         7.836070e-04
INDUS      7.345971e-01
CHAS       1.912339e-03
NOX        4.117296e-06
RM         2.207486e-18
AGE        9.546859e-01
DIS        6.017651e-13
RAD        5.189664e-06
TAX        1.117826e-03
PTRATIO    1.268218e-12
B          5.072875e-04
LSTAT      6.595808e-23
dtype: float64

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2018-02-04 (日) 10:32:31 (311d)