An Introduction to PCA in R

Annotated and presented by Kat Placek:


# Use-R group 6 July 2017

# largely following this tutorial http://planspace.org/2013/02/03/pca-3d-visualization-and-clustering-in-r/

# principal component analysis (PCA) is a form of multidimensional scaling. It is a linear transformation of the variables into a lower dimensional space which retains maximal amount of information about the variables.

# here we are using the iris data set, which has length (L) and width (W) measurements for both sepals and petals for 3 different species of irises, with 50 observations per species.

data(iris)

# this is a little tweak so that things line up nicely later on

iris$Species <- factor(iris$Species, levels = c("versicolor","virginica","setosa"))

# we can see the 1st six rows (here, these are observations) for each column for the iris dataset

head(iris)

# we check the correlations between L and W for sepal and petal

round(cor(iris[,1:4]), 2)

# now we run the PCA using the correlations between L and W for sepals and petals

# note: you can also run PCA on the covariance matrix, however, covariance matrices are used when variables are on a similar scale (e.g. if sepal length and petal length were to have similar scales, instead of sepal length being longer than petal length)

# more info: https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance

pc <- princomp(iris[,1:4], cor=TRUE, scores=TRUE)

# here we will see the proportion of variance accounted for by each component. the number of outputted components will ALWAYS equal the number of variables you include in the analysis, in our case this is 4 (L and W for both sepals and petals = 4).

summary(pc)

# we can plot the proportion of variance accounted for by the components using a scree plot

plot(pc,type="lines")

# we can see the new scores assigned to each observation for each of the 4 components

pc$scores

# and we can see the loadings (the weight by which each standardized original variable was multiplied to get the component score)

loadings(pc)

# we can also plot variables and individual scores on the same diagram (this is called a biplot), which helps to interpret the factorial axes while looking at individual score location.

biplot(pc)

# lets do a 3D graph of our components, and see how they match on to each of the 3 iris flower species

install.packages("scatterplot3d")

library(scatterplot3d)

colors <- c("#FF0000" , "#00FF00" , "#0000FF")

colors1 <-colors[as.numeric(iris$Species)]

s3d <- scatterplot3d(pc$scores[,1:3], color=colors1, main="actual species")

legend(s3d$xyz.convert(3, 0, 0), legend = levels(iris$Species), col = c("#FF0000" , "#00FF00" , "#0000FF"), pch = 1)

# k-means clustering - for more on this with the iris dataset, go to https://datascienceplus.com/k-means-clustering-in-r/

# K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we have to specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

# 1 Reassign data points to the cluster whose centroid is closest.

# 2 Calculate new centroid of each cluster.

# since the initial cluster assignments are random, let us set the seed to ensure reproducibility

set.seed(20)

# base the clusters off of the different L and W for sepals, petals, and give us only 3 clusters - here we only want 3 clusters because we think that the clusters will reflect the 3 different flower species

cl <- kmeans(iris[,1:4],3)

# save the cluster scores as a factor in our iris dataset

iris$cluster <- as.factor(cl$cluster)

# plot clusters vs PCA components

colors2 <-colors[as.numeric(iris$cluster)]

s3d2 <-scatterplot3d(pc$scores[,1:3], color=colors2, main="k-means clusters")

legend(s3d2$xyz.convert(3, 0, 0), legend = levels(iris$Species), col = c("#FF0000" , "#00FF00" , "#0000FF"), pch = 1)

with(iris, table(cluster, Species))

#So k-means got all the setosas perfectly but made some mistakes with the other two species, picking far too many flowers for its cluster 1.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s