ノート/Python/統計/主成分分析・因子分析

関連  ノート/Python/統計/回帰分析

データセット

irisサンプルを主成分分析してみる

Rで試す

> data(iris)
> iris.data <- iris[,1:4]
> pca <- prcomp(data.matrix(iris.data), scale=TRUE)
> biplot(pca)

R-version-iris-pca.png

Pythonで試す

#/usr/bin/env python
# -*- coding: utf-8 -*-
# from functools import reduce
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from matplotlib import pyplot as plt

colors = ['red', 'blue', 'green' ]
markers = ['x', 'point', 'plus' ]
iris = load_iris()
# print(iris.DESCR)
species = ['Setosa','Versicolour', 'Virginica']
irisdata = pd.DataFrame(iris.data, columns=iris.feature_names)
iristarget = pd.DataFrame(iris.target, columns=['target'])
irispd = pd.concat([irisdata, iristarget], axis=1)
pca = PCA(n_components = 4)
pca.fit(irisdata)
print('components', pca.components_)
print('mean', pca.mean_)
print('variance', pca.explained_variance_ )
print('covariance', pca.get_covariance())
print('各次元の寄与率', pca.explained_variance_ratio_)
print('累積寄与率', np.cumsum(pca.explained_variance_ratio_))

# 主成分をプロットする
'''
transformed0 = pca.transform(irisdata[irispd.target==0])
transformed1 = pca.transform(irisdata[irispd.target==1])
transformed2 = pca.transform(irisdata[irispd.target==2])
plt.scatter( [u[0] for u in transformed0], [u[1] for u in transformed0], c='red', label=species[0], marker='x' )
lt.scatter( [u[0] for u in transformed1], [u[1] for u in transformed1], c='blue', label=species[1], marker='.' )
plt.scatter( [u[0] for u in transformed2], [u[1] for u in transformed2], c='green', label=species[2], marker='+' )
plt.title('irisデータの主成分分析')
plt.xlabel('pc1')
plt.ylabel('pc2')
plt.legend()
plt.show()
'''
# biplotもどきの借り物
# https://sukhbinder.wordpress.com/2015/08/05/biplot-with-python/
def biplot(score,coeff,pcax,pcay,labels=None):
    pca1=pcax-1
    pca2=pcay-1
    xs = score[:,pca1]
    ys = score[:,pca2]
    n=score.shape[1]
    scalex = 1.0/(xs.max()- xs.min())
    scaley = 1.0/(ys.max()- ys.min())
    plt.scatter(xs*scalex,ys*scaley)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,pca1], coeff[i,pca2],color='r',alpha=0.5)
        if labels is None:
            plt.text(coeff[i,pca1]* 1.15, coeff[i,pca2] * 1.15, "Var"+str(i+1), color='g', ha='center', va='center')
        else:
            plt.text(coeff[i,pca1]* 1.15, coeff[i,pca2] * 1.15, labels[i], color='g', ha='center', va='center')
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("PC{}".format(pcax))
    plt.ylabel("PC{}".format(pcay))
    plt.grid()
    plt.show()
biplot(pca.transform(irisdata), pca.components_, 0,1)

出力

components [[ 0.36158968 -0.08226889  0.85657211  0.35884393]
 [ 0.65653988  0.72971237 -0.1757674  -0.07470647]
 [-0.58099728  0.59641809  0.07252408  0.54906091]
 [ 0.31725455 -0.32409435 -0.47971899  0.75112056]]
mean [ 5.84333333  3.054       3.75866667  1.19866667]
variance [ 4.19667516  0.24062861  0.07800042  0.02352514]
covariance [[ 0.68112222 -0.03900667  1.26519111  0.51345778]
 [-0.03900667  0.18675067 -0.319568   -0.11719467]
 [ 1.26519111 -0.319568    3.09242489  1.28774489]
 [ 0.51345778 -0.11719467  1.28774489  0.57853156]]
各次元の寄与率 [ 0.92461621  0.05301557  0.01718514  0.00518309]
累積寄与率 [ 0.92461621  0.97763178  0.99481691  1.0 ]

python-version-iris-pca.png

因子分析

〜〜〜

〜〜〜

〜〜〜

〜〜〜

因子分析をRで試す。使い方はRと因子分析、 主成分分析と因子分析

> library(MASS)
> factanal(Boston, factors=2, scores="Bartlett")     scoresは"Bartlett"か"regression"、無指定だとscoresは計算されない

出力

Call:
factanal(x = Boston, factors = 2, scores="Bartlett") 

Uniquenesses:
   crim      zn   indus    chas     nox      rm     age     dis     rad 
  0.622   0.500   0.263   0.979   0.227   0.849   0.297   0.242   0.139 
    tax ptratio   black   lstat    medv 
  0.039   0.772   0.780   0.515   0.715 

Loadings:
         Factor1 Factor2
crim     0.593   0.163 
zn      -0.197  -0.679 
indus    0.609   0.605 
chas             0.129 
nox      0.551   0.685 
rm      -0.247  -0.300 
age      0.371   0.752 
dis     -0.394  -0.776 
rad      0.917   0.145 
tax      0.959   0.201 
ptratio  0.460   0.125 
black   -0.437  -0.170 
lstat    0.471   0.513 
medv    -0.428  -0.319 

               Factor1 Factor2
SS loadings      3.991   3.069
Proportion Var   0.285   0.219
Cumulative Var   0.285   0.504

Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 1177.72 on 64 degrees of freedom.
The p-value is 1.74e-204 

更にコマンド

> fa <- factanal(Boston, factors=2, scores="Bartlett")
> plot(NULL, xlim=c(0,1), ylim=c(0,1), xlab="因子1", ylab="因子2")
> text(fa$loadings, names(Boston))

R-version-Boston-fa-plottext.png

また、

> biplot(fa$scores, fa$loadings)

R-version-Boston-fa-biplot.png

Pythonで試す

# -*- coding: utf-8 -*-
# Quick introduction to linear regression in Python
#    https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import scale

dset = datasets.load_boston()
print(dset.DESCR)
boston = pd.DataFrame(dset.data)
boston.columns = dset.feature_names
target = pd.DataFrame(dset.target)
# scikit-learn FactorAnalysisを使った因子分析
#    https://pythondatascience.plavox.info/scikit-learn/%E7%B7%9A%E5%BD%A2%E5%9B%9E%E5%B8%B0
fa = FactorAnalysis(n_components=2)
boston_normal = scale(boston)
fa.fit(boston_normal)
print(fa.components_) # loadings
# 因子負荷
#print(pd.DataFrame({"Name":boston.columns,\
#                    "Coefficients":fa.coef_[0]}).sort_values(by='Coefficients') )
 
# https://sukhbinder.wordpress.com/2015/08/05/biplot-with-python/
def biplot(score,coeff,pcax,pcay,labels=None):
    pca1=pcax-1
    pca2=pcay-1
    xs = score[:,pca1]
    ys = score[:,pca2]
    n=score.shape[1]
    scalex = 1.0/(xs.max()- xs.min())
    scaley = 1.0/(ys.max()- ys.min())
    plt.scatter(xs*scalex,ys*scaley)
    for j in range(coeff.shape[1]):
        plt.arrow(0, 0, coeff[pca1,j], coeff[pca2,j],color='r',alpha=0.5)
        if labels is None:
            plt.text(coeff[pca1,j]* 1.15, coeff[pca2,j] * 1.15, "Var"+str(j+1), color='g', ha='center', va='center')
        else:
            plt.text(coeff[pca1,j]* 1.15, coeff[pca2,j] * 1.15, labels[j], color='g', ha='center', va='center')
    plt.xlim(-1,1)
    plt.ylim(-1,1)
    plt.xlabel("PC{}".format(pcax))
    plt.ylabel("PC{}".format(pcay))
    plt.grid()
    plt.show()

print('---')
print('fa.transform(boston_normal)', fa.transform(boston_normal))
biplot(fa.transform(boston_normal), fa.components_, 0,1, labels=boston.columns)

python-version-Boston-fa-biplot.png

分散分析?

主成分分析とscree plot


添付ファイル: filepython-version-Boston-fa-biplot.png 44件 [詳細] fileR-version-Boston-fa-plottext.png 54件 [詳細] fileR-version-Boston-fa-biplot.png 55件 [詳細] filepython-version-iris-pca.png 32件 [詳細] fileR-version-iris-pca.png 44件 [詳細]

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2018-02-04 (日) 09:23:57 (311d)