R markdown

HW7

Problem 1: (25 points)

head(data.deug)

Present the data analysis report to obtain the best PCR and their interpretation using the data set deug. Do the following:

a. (3 points/25) Present the Exploratory Visualization: Scatterplot, correlation matrix plot, Boxplot, Mean and Variances.

Scatterplot

library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(data.deug, histogram = TRUE, pch = 15)

# install.packages("corrplot")
library(corrplot)
par(mfrow=c(2,2))

corrMat<-cor(data.deug)

corrplot(corrMat, method = "number")
corrplot.mixed(corrMat, lower.col = "black", number.cex = .7)

corrplot(corrMat, method = "number") # Display the correlation coefficient

Boxplot

boxplot(data.deug, col="pink")

Mean and Variances

apply(data.deug,2,mean) #mean
##    Algebra   Analysis      Proba Informatic    Economy    Option1    Option2 
##  45.567308  33.985577  31.240385  26.990385  69.550962  22.750000  22.668269 
##    English      Sport 
##  21.125962   9.230769
apply(data.deug,2,sd) #Variances
##    Algebra   Analysis      Proba Informatic    Economy    Option1    Option2 
##  10.489722   8.811802  13.381987   8.247383   9.609744   5.204087   4.759564 
##    English      Sport 
##   4.261956   4.944313

If we failed to scale the variables before performing PCA, then most of PCs that we observed would be driven by the some variables, because they have the largest mean and variation. Thus, it is important to standardize the variables to have mean zero and standard deviation one before performing PCA.

b. (3 points/25) Compute the correlation matrix (using the center and scaled data).

Standardize the data (Center and scale)

deug.centered <- scale(data.deug,scale=TRUE,center=TRUE)
head(deug.centered)
##      Algebra    Analysis       Proba Informatic    Economy     Option1
## 1 -0.5307393 -0.90623657 -0.39159989 -0.1200847 -1.8367775 -1.10490076
## 2 -0.8167336  0.05837888  0.43040061  0.6074188  0.2548495  0.24019582
## 3 -0.8167336  0.79602598 -0.16741793  0.9105453  0.2548495  0.24019582
## 4  1.6618832  0.39883139  1.92494697  1.0317958  0.8167791  0.04803916
## 5  0.8992319 -0.28207364  0.20621866  1.0924211 -1.2124112 -0.72058746
## 6  0.4225748  0.45557347  0.05676402 -0.8475882 -0.2758618 -0.52843080
##       Option2     English      Sport
## 1  0.27980100 -0.49882294  0.4589578
## 2 -0.14040556  1.14361533  0.4589578
## 3  0.91011084 -0.35804252  0.4589578
## 4  0.06969772 -0.02955486  0.9645892
## 5  0.27980100  0.67434725  0.4589578
## 6 -1.61112851  0.25200598 -1.8669469

Compute the correlation matrix

(Sigma = cor(data.deug)) #Compute the correlation matrix (using the center and scaled data).
##              Algebra    Analysis     Proba Informatic    Economy   Option1
## Algebra    1.0000000  0.44496518 0.5042599 0.38858039 0.36587492 0.5367073
## Analysis   0.4449652  1.00000000 0.5164676 0.31995199 0.20681450 0.4044038
## Proba      0.5042599  0.51646761 1.0000000 0.37287463 0.16706197 0.4440595
## Informatic 0.3885804  0.31995199 0.3728746 1.00000000 0.07564364 0.2496737
## Economy    0.3658749  0.20681450 0.1670620 0.07564364 1.00000000 0.3714455
## Option1    0.5367073  0.40440385 0.4440595 0.24967368 0.37144552 1.0000000
## Option2    0.1963220  0.06157671 0.1118533 0.08580382 0.33924787 0.2039705
## English    0.1140047 -0.11973500 0.1869370 0.13095780 0.40634881 0.0938390
## Sport      0.2347202  0.15834887 0.2693675 0.06243446 0.17844197 0.2594090
##               Option2     English      Sport
## Algebra    0.19632201  0.11400473 0.23472019
## Analysis   0.06157671 -0.11973500 0.15834887
## Proba      0.11185328  0.18693698 0.26936752
## Informatic 0.08580382  0.13095780 0.06243446
## Economy    0.33924787  0.40634881 0.17844197
## Option1    0.20397055  0.09383900 0.25940896
## Option2    1.00000000  0.02127238 0.07661760
## English    0.02127238  1.00000000 0.13728734
## Sport      0.07661760  0.13728734 1.00000000

c. (3 points/25) Compute the principal components (PCs) using the correlation matrix

Compute the Eigenvectors and Eigenvalues from the correlation matrix.

evec = eigen(Sigma)$vectors
(eval = eigen(Sigma)$values) 
## [1] 3.1013578 1.3629834 1.0323269 0.9340533 0.7397529 0.5746693 0.5325414
## [8] 0.4375395 0.2847754

λ1= 3.1013578
λ2= 1.3629834
λ3= 1.0323269
λ4= 0.9340533
λ5= 0.7397529
λ6= 0.5746693
λ7= 0.5325414
λ8= 0.4375395
λ9= 0.2847754

With this ordering, then the \(i^{th}\) PC is the \(i^{th}\) eigenvector corresponding to \(λ_i\). represents our principal components.

d. (3 points/25) Write the expression for each principal component

λ1= 3.1013578 = \(Var(Z_1)\)
λ2= 1.3629834 = \(Var(Z_2)\)
λ3= 1.0323269 = \(Var(Z_3)\)
λ4= 0.9340533 = \(Var(Z_4)\)
λ5= 0.7397529 = \(Var(Z_5)\)
λ6= 0.5746693 = \(Var(Z_6)\)
λ7= 0.5325414 = \(Var(Z_7)\)
λ8= 0.4375395 = \(Var(Z_8)\)
λ9= 0.2847754 = \(Var(Z_9)\)

e. (3 points/25) Compute the PCs from the Singular Value Decomposition (SVD).

Compute the PCs grom the Singular Value Decomposition(SVD).

pr.out <- prcomp(data.deug, scale=TRUE)

pr.out$rotation
##                  PC1         PC2        PC3         PC4         PC5         PC6
## Algebra    0.4499972 -0.07914523  0.0471192 -0.06031393  0.09364452 -0.35788011
## Analysis   0.3709055 -0.39627195  0.1062237  0.04140274  0.29340513  0.46128485
## Proba      0.4207824 -0.20740106 -0.2092587  0.01933918 -0.07006815  0.50306851
## Informatic 0.3002324 -0.24091467 -0.2375823 -0.53216019 -0.50423346 -0.27346361
## Economy    0.3145059  0.53687259  0.1442512 -0.06352658  0.35740240  0.02659076
## Option1    0.4211180 -0.01371946  0.1419643  0.13812595  0.27108324 -0.48696475
## Option2    0.1894393  0.32205782  0.6905267 -0.14506372 -0.49097358  0.23677894
## English    0.1564407  0.57374155 -0.5728454 -0.21630766  0.02022396  0.18099207
## Sport      0.2368946  0.11970164 -0.2007995  0.78748504 -0.45088419 -0.06792421
##                    PC7         PC8         PC9
## Algebra    -0.06047918  0.77634677 -0.20533160
## Analysis    0.42112415 -0.16366337 -0.43519930
## Proba      -0.49482138  0.07698087  0.47644731
## Informatic  0.33309171 -0.22661684  0.14436684
## Economy     0.44630634 -0.00238415  0.51026585
## Option1    -0.39603411 -0.55488478 -0.08875721
## Option2    -0.15766204 -0.02684335 -0.19955864
## English    -0.14654224 -0.06652162 -0.46080281
## Sport       0.24664572 -0.01304809 -0.01359074
(pr.out$sdev)^2
## [1] 3.1013578 1.3629834 1.0323269 0.9340533 0.7397529 0.5746693 0.5325414
## [8] 0.4375395 0.2847754

f. (3 points/25) Present the scree plot and interpret it.

plot(pr.out,type="l", main = "scree plot")

(pve = eval/sum(eval))   
## [1] 0.34459531 0.15144260 0.11470299 0.10378370 0.08219477 0.06385214 0.05917127
## [8] 0.04861550 0.03164171
yvec = cumsum(pve) 
plot(yvec, type='b', xlab="Principal Components", ,
   ylab="Cumulative variance", main = " ", ylim=c(0,1))
   abline(h=0.8,lty=2)

the x-axis of the scree plot is the number of principal components.The y axis on the other hand, would be the proportion of variance, and this ranges from 0 to 1. We usually look at the first k PCs which explain at least 80% of the variance

g. (3 points/25) Select the best PCs using the Elbow rule

plot(pve, xlab="Principal Component", type='b', 
  ylab="Proportion of Variance Explained", ylim=c(0,1))

Elbow rule in determining which PCs to look at – basically find the cut-off where the shape of the curve looks like an elbow. Therefore, the best PCs is 2.

h. (4 points/25) Present and interpret the Biplot

# Plot the first two PCs

pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out, scale=0)

library(devtools)
## Loading required package: usethis
library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## Loading required package: grid
ggbiplot(pr.out)

pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
ggbiplot(pr.out, labels=rownames(data.deug),labels.size = 2,varname.size = 3)

pr.out$rotation[,1:2]
##                  PC1         PC2
## Algebra    0.4499972 -0.07914523
## Analysis   0.3709055 -0.39627195
## Proba      0.4207824 -0.20740106
## Informatic 0.3002324 -0.24091467
## Economy    0.3145059  0.53687259
## Option1    0.4211180 -0.01371946
## Option2    0.1894393  0.32205782
## English    0.1564407  0.57374155
## Sport      0.2368946  0.11970164

Biplot displays both the PC scores and the loadings.

The black state names represent the scores for the first two PC, whereas the orange arrows represent the first two PC loading vectors (with axes on the top and right).

For example, the loading for Algebra on the PC1 is -0.45, and its loading on the PC2 0.08 (the word Algebra is centered at the point (-0.45, 0.08)).

Problem 2: (25 points)

a. (5 points/25) Calculates pcr and uses cross-validation to determine the number of component. To be reproducible use set.seed(23).

#install.packages("pls")
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:corrplot':
## 
##     corrplot
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(23)
pcr.fit = pcr(Algebra~., data=data.deug, scale=TRUE,validation="CV")
summary(pcr.fit)
## Data:    X dimension: 104 8 
##  Y dimension: 104 1
## Fit method: svdpc
## Number of components considered: 8
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           10.54    8.147    8.348    8.450    8.572    8.703    8.792
## adjCV        10.54    8.131    8.315    8.412    8.527    8.650    8.722
##        7 comps  8 comps
## CV       9.017    9.041
## adjCV    8.946    8.966
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X          32.12    49.06    61.94    73.59    82.79    89.69    96.33   100.00
## Algebra    43.29    44.04    44.17    44.32    44.47    45.09    45.15    45.74

the number of component is 8.

b. (5 points/25) Interpret the summary of the function pcr.

Cross-validated using 10 random segments.

The summary() function provides the percentage of variance explained in the predictors and in the response using different numbers of components.

we can think of this as the amount of information about the predictors or the response that is captured using M principal components. For example, setting M=1 only captures 32.12% of all the variance, or information, in the predictors.

In contrast, using M=7 increases the value to 96.33%. If we were to use all M=p=8 components, this would increase to 100%.

c. (5 points/25) Plot cross-validation Mean Square Error Prediction (MSEP)

validationplot(pcr.fit ,val.type="MSEP", main = "cross-validation Mean Square Error Prediction")

We see that the smallest cross-validation error (MSEP) occurs when M=1 components are used.

d. (5 points/25) Perform PCR on the training data (select 50%) with the right number of components and evaluate its test set performance. Plots MSE by number of components. To be reproducible use set.seed(24)

set.seed(24)
train = sample(1:nrow(data.deug), nrow(data.deug)/2) # select  50% of the total data (nrow(data.deug)/2=104/2=52) as the training sample.


pcr.fit2 =pcr(Algebra~., data=data.deug ,subset=train ,scale=TRUE , validation ="CV")


validationplot(pcr.fit2 ,val.type="MSEP", main ="MSE by number of components") 

Test set performance

x = model.matrix(Algebra~.,data.deug)[,-1]  
y = data.deug$Algebra 
test = (-train)
pcr.pred = predict(pcr.fit,x[test,],ncomp=5)
y.test = y[test] 
(mean((pcr.pred-y.test)^2)) # test MSE
## [1] 65.3277

This test set MSE of 65.3277 is competitive with the results obtained using other regression methods (ridge re-gression and the lasso).

However, as a result of the way PCR is implemented, the final model is more difficult to interpret because it does not perform any kind of variable selection or even directly produce coefficient estimates.

e. (5 points/25) Perform PCR on the full data

We fit PCR on the full data set, using M = 5, because the lowest cross-validation error occurs when M=5 component are used.

library(ISLR)
pcr.fit7 = pcr(Algebra~.,data=data.deug,scale=TRUE,ncomp=5)
summary(pcr.fit7)
## Data:    X dimension: 104 8 
##  Y dimension: 104 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          32.12    49.06    61.94    73.59    82.79
## Algebra    43.29    44.04    44.17    44.32    44.47