head(data.deug)
Present the data analysis report to obtain the best PCR and their interpretation using the data set deug. Do the following:
library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(data.deug, histogram = TRUE, pch = 15)
# install.packages("corrplot")
library(corrplot)
par(mfrow=c(2,2))
corrMat<-cor(data.deug)
corrplot(corrMat, method = "number")
corrplot.mixed(corrMat, lower.col = "black", number.cex = .7)
corrplot(corrMat, method = "number") # Display the correlation coefficient
boxplot(data.deug, col="pink")
apply(data.deug,2,mean) #mean
## Algebra Analysis Proba Informatic Economy Option1 Option2
## 45.567308 33.985577 31.240385 26.990385 69.550962 22.750000 22.668269
## English Sport
## 21.125962 9.230769
apply(data.deug,2,sd) #Variances
## Algebra Analysis Proba Informatic Economy Option1 Option2
## 10.489722 8.811802 13.381987 8.247383 9.609744 5.204087 4.759564
## English Sport
## 4.261956 4.944313
If we failed to scale the variables before performing PCA, then most of PCs that we observed would be driven by the some variables, because they have the largest mean and variation. Thus, it is important to standardize the variables to have mean zero and standard deviation one before performing PCA.
deug.centered <- scale(data.deug,scale=TRUE,center=TRUE)
head(deug.centered)
## Algebra Analysis Proba Informatic Economy Option1
## 1 -0.5307393 -0.90623657 -0.39159989 -0.1200847 -1.8367775 -1.10490076
## 2 -0.8167336 0.05837888 0.43040061 0.6074188 0.2548495 0.24019582
## 3 -0.8167336 0.79602598 -0.16741793 0.9105453 0.2548495 0.24019582
## 4 1.6618832 0.39883139 1.92494697 1.0317958 0.8167791 0.04803916
## 5 0.8992319 -0.28207364 0.20621866 1.0924211 -1.2124112 -0.72058746
## 6 0.4225748 0.45557347 0.05676402 -0.8475882 -0.2758618 -0.52843080
## Option2 English Sport
## 1 0.27980100 -0.49882294 0.4589578
## 2 -0.14040556 1.14361533 0.4589578
## 3 0.91011084 -0.35804252 0.4589578
## 4 0.06969772 -0.02955486 0.9645892
## 5 0.27980100 0.67434725 0.4589578
## 6 -1.61112851 0.25200598 -1.8669469
Compute the correlation matrix
(Sigma = cor(data.deug)) #Compute the correlation matrix (using the center and scaled data).
## Algebra Analysis Proba Informatic Economy Option1
## Algebra 1.0000000 0.44496518 0.5042599 0.38858039 0.36587492 0.5367073
## Analysis 0.4449652 1.00000000 0.5164676 0.31995199 0.20681450 0.4044038
## Proba 0.5042599 0.51646761 1.0000000 0.37287463 0.16706197 0.4440595
## Informatic 0.3885804 0.31995199 0.3728746 1.00000000 0.07564364 0.2496737
## Economy 0.3658749 0.20681450 0.1670620 0.07564364 1.00000000 0.3714455
## Option1 0.5367073 0.40440385 0.4440595 0.24967368 0.37144552 1.0000000
## Option2 0.1963220 0.06157671 0.1118533 0.08580382 0.33924787 0.2039705
## English 0.1140047 -0.11973500 0.1869370 0.13095780 0.40634881 0.0938390
## Sport 0.2347202 0.15834887 0.2693675 0.06243446 0.17844197 0.2594090
## Option2 English Sport
## Algebra 0.19632201 0.11400473 0.23472019
## Analysis 0.06157671 -0.11973500 0.15834887
## Proba 0.11185328 0.18693698 0.26936752
## Informatic 0.08580382 0.13095780 0.06243446
## Economy 0.33924787 0.40634881 0.17844197
## Option1 0.20397055 0.09383900 0.25940896
## Option2 1.00000000 0.02127238 0.07661760
## English 0.02127238 1.00000000 0.13728734
## Sport 0.07661760 0.13728734 1.00000000
evec = eigen(Sigma)$vectors
(eval = eigen(Sigma)$values)
## [1] 3.1013578 1.3629834 1.0323269 0.9340533 0.7397529 0.5746693 0.5325414
## [8] 0.4375395 0.2847754
λ1= 3.1013578
λ2= 1.3629834
λ3= 1.0323269
λ4= 0.9340533
λ5= 0.7397529
λ6= 0.5746693
λ7= 0.5325414
λ8= 0.4375395
λ9= 0.2847754
With this ordering, then the \(i^{th}\) PC is the \(i^{th}\) eigenvector corresponding to \(λ_i\). represents our principal components.
λ1= 3.1013578 = \(Var(Z_1)\)
λ2= 1.3629834 = \(Var(Z_2)\)
λ3= 1.0323269 = \(Var(Z_3)\)
λ4= 0.9340533 = \(Var(Z_4)\)
λ5= 0.7397529 = \(Var(Z_5)\)
λ6= 0.5746693 = \(Var(Z_6)\)
λ7= 0.5325414 = \(Var(Z_7)\)
λ8= 0.4375395 = \(Var(Z_8)\)
λ9= 0.2847754 = \(Var(Z_9)\)
pr.out <- prcomp(data.deug, scale=TRUE)
pr.out$rotation
## PC1 PC2 PC3 PC4 PC5 PC6
## Algebra 0.4499972 -0.07914523 0.0471192 -0.06031393 0.09364452 -0.35788011
## Analysis 0.3709055 -0.39627195 0.1062237 0.04140274 0.29340513 0.46128485
## Proba 0.4207824 -0.20740106 -0.2092587 0.01933918 -0.07006815 0.50306851
## Informatic 0.3002324 -0.24091467 -0.2375823 -0.53216019 -0.50423346 -0.27346361
## Economy 0.3145059 0.53687259 0.1442512 -0.06352658 0.35740240 0.02659076
## Option1 0.4211180 -0.01371946 0.1419643 0.13812595 0.27108324 -0.48696475
## Option2 0.1894393 0.32205782 0.6905267 -0.14506372 -0.49097358 0.23677894
## English 0.1564407 0.57374155 -0.5728454 -0.21630766 0.02022396 0.18099207
## Sport 0.2368946 0.11970164 -0.2007995 0.78748504 -0.45088419 -0.06792421
## PC7 PC8 PC9
## Algebra -0.06047918 0.77634677 -0.20533160
## Analysis 0.42112415 -0.16366337 -0.43519930
## Proba -0.49482138 0.07698087 0.47644731
## Informatic 0.33309171 -0.22661684 0.14436684
## Economy 0.44630634 -0.00238415 0.51026585
## Option1 -0.39603411 -0.55488478 -0.08875721
## Option2 -0.15766204 -0.02684335 -0.19955864
## English -0.14654224 -0.06652162 -0.46080281
## Sport 0.24664572 -0.01304809 -0.01359074
(pr.out$sdev)^2
## [1] 3.1013578 1.3629834 1.0323269 0.9340533 0.7397529 0.5746693 0.5325414
## [8] 0.4375395 0.2847754
plot(pr.out,type="l", main = "scree plot")
(pve = eval/sum(eval))
## [1] 0.34459531 0.15144260 0.11470299 0.10378370 0.08219477 0.06385214 0.05917127
## [8] 0.04861550 0.03164171
yvec = cumsum(pve)
plot(yvec, type='b', xlab="Principal Components", ,
ylab="Cumulative variance", main = " ", ylim=c(0,1))
abline(h=0.8,lty=2)
the x-axis of the scree plot is the number of principal components.The y axis on the other hand, would be the proportion of variance, and this ranges from 0 to 1. We usually look at the first k PCs which explain at least 80% of the variance
plot(pve, xlab="Principal Component", type='b',
ylab="Proportion of Variance Explained", ylim=c(0,1))
Elbow rule in determining which PCs to look at – basically find the cut-off where the shape of the curve looks like an elbow. Therefore, the best PCs is 2.
# Plot the first two PCs
pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale=0)
library(devtools)
## Loading required package: usethis
library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
## Loading required package: grid
ggbiplot(pr.out)
pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
ggbiplot(pr.out, labels=rownames(data.deug),labels.size = 2,varname.size = 3)
pr.out$rotation[,1:2]
## PC1 PC2
## Algebra 0.4499972 -0.07914523
## Analysis 0.3709055 -0.39627195
## Proba 0.4207824 -0.20740106
## Informatic 0.3002324 -0.24091467
## Economy 0.3145059 0.53687259
## Option1 0.4211180 -0.01371946
## Option2 0.1894393 0.32205782
## English 0.1564407 0.57374155
## Sport 0.2368946 0.11970164
Biplot displays both the PC scores and the loadings.
The black state names represent the scores for the first two PC, whereas the orange arrows represent the first two PC loading vectors (with axes on the top and right).
For example, the loading for Algebra on the PC1 is -0.45, and its loading on the PC2 0.08 (the word Algebra is centered at the point (-0.45, 0.08)).
#install.packages("pls")
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
##
## corrplot
## The following object is masked from 'package:stats':
##
## loadings
set.seed(23)
pcr.fit = pcr(Algebra~., data=data.deug, scale=TRUE,validation="CV")
summary(pcr.fit)
## Data: X dimension: 104 8
## Y dimension: 104 1
## Fit method: svdpc
## Number of components considered: 8
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 10.54 8.147 8.348 8.450 8.572 8.703 8.792
## adjCV 10.54 8.131 8.315 8.412 8.527 8.650 8.722
## 7 comps 8 comps
## CV 9.017 9.041
## adjCV 8.946 8.966
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.12 49.06 61.94 73.59 82.79 89.69 96.33 100.00
## Algebra 43.29 44.04 44.17 44.32 44.47 45.09 45.15 45.74
the number of component is 8.
Cross-validated using 10 random segments.
The summary() function provides the percentage of variance explained in the predictors and in the response using different numbers of components.
we can think of this as the amount of information about the predictors or the response that is captured using M principal components. For example, setting M=1 only captures 32.12% of all the variance, or information, in the predictors.
In contrast, using M=7 increases the value to 96.33%. If we were to use all M=p=8 components, this would increase to 100%.
validationplot(pcr.fit ,val.type="MSEP", main = "cross-validation Mean Square Error Prediction")
We see that the smallest cross-validation error (MSEP) occurs when M=1 components are used.
set.seed(24)
train = sample(1:nrow(data.deug), nrow(data.deug)/2) # select 50% of the total data (nrow(data.deug)/2=104/2=52) as the training sample.
pcr.fit2 =pcr(Algebra~., data=data.deug ,subset=train ,scale=TRUE , validation ="CV")
validationplot(pcr.fit2 ,val.type="MSEP", main ="MSE by number of components")
x = model.matrix(Algebra~.,data.deug)[,-1]
y = data.deug$Algebra
test = (-train)
pcr.pred = predict(pcr.fit,x[test,],ncomp=5)
y.test = y[test]
(mean((pcr.pred-y.test)^2)) # test MSE
## [1] 65.3277
This test set MSE of 65.3277 is competitive with the results obtained using other regression methods (ridge re-gression and the lasso).
However, as a result of the way PCR is implemented, the final model is more difficult to interpret because it does not perform any kind of variable selection or even directly produce coefficient estimates.
We fit PCR on the full data set, using M = 5, because the lowest cross-validation error occurs when M=5 component are used.
library(ISLR)
pcr.fit7 = pcr(Algebra~.,data=data.deug,scale=TRUE,ncomp=5)
summary(pcr.fit7)
## Data: X dimension: 104 8
## Y dimension: 104 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 32.12 49.06 61.94 73.59 82.79
## Algebra 43.29 44.04 44.17 44.32 44.47