Simple And Complete Tutorial For Understanding Principal Component Analysis

Understanding Principal Component Analysis

In most practical applications of data science, you will end up with data with a lot of dimensions.

It is not easy to process all these dimensions because of considerations of cost and processing power limitations.

These data dimensions could be highly correlated to each other so you will lose a lot of time and money, without much to show for it.

Want to be a smarter data scientist and work only with uncorrelated dimensions.

You have to master Principal Component Analysis. It is one of the most used techniques for dimensionality reduction.

In this Post, these are the steps we are going to take to master just that:-

1. Understand the Important Math Terms Behind PCA
2. Singular Value Decomposition
3. Why Principal Component Analysis Works
4. Step By Step Implementation of PCA
5. Tips To Be Careful About with PCA

MATHS AND IMPORTANT TERMS TO KNOW BEFORE STARTING WITH PCA

If you don’t understand linear algebra, it can be Tricky to understand how PCA works, so I am going first cover the related topic briefly if you are still lost I am going to link related articles

Variance

Q. Which of these distributions of data has more variance?

Image From Wikipedia, https://en.wikipedia.org/wiki/Variance#/media/File:Comparison_standard_deviations.svg

ANSWER:- The blue population has higher variation because the data is more spread out from the mean.

Variance is a measure of the spread of a variable in a data set. It is the square of standard deviation.

Higher Variance implies the data is more spread out. In matrix form we can represent is as follows:_

Co-variance

Variance is one dimensional, it only works with one variable, therefore, its utility is limited to understanding a single variable also.

We need to understand how much the dimensions vary with respect to each other.

Quick example, Which of the two arrays will have a higher Covariation:-

(S1, S2)  or (S1, S3)

&lt;/pre&gt;
&lt;pre&gt;s1=[1,2,3,4,5,6,7,8,9,10] s2=[1,2,3,4,5,6,7,8,9,10] s3=[1,1,1,1,1,1,1,1,1,1]


Answer:- (S1, S2) as both the variable increase together, S3 remains constant so it is not correlated to S1 or S2

Covariance is a measure of the joint variability of two random variables. It can be positive , negative or 0.

Here is the Formula to calculate covariance.

Data scientists just prefer to use the numpy method of np.cov() .

Here X is a standardized  matrix.

Co-variance Matrix

In case, where we have more than 2 dimensions we get a lot more covariate values.

For n dimensions, you will have this number of dimensions

We can represent all these values in the form of a matrix, this is known as the Covariance Matrix

Matrix Transformation

When we multiply a Matrix on the left side of a matrix, we are actually transforming the axis of the data set or basis of the vector.

Most of you will be familiar with i hat and j hat. When I say a basis of a vector picture these axes.

In a vector representation, it implies the vector lies at 1 i + 1 j  (i and j are the coordinate axis). Here is the vector representation.

When we multiply it by a 2 by 2 matrix, the two columns of the matrix on the right respectively show where the i and j axis will land.

When we multiply this transformation to the vector, the new vector is actually -1 (transformed i)+2  (transformed j).

Transformed Vector =

Transformed Vector in the original coordinate space

Okay, simple right.

But now let’s take it a one step ahead, I am saying that I don’t believe in this traditional basis system of I hat and j hat. At the end up of the day, the coordinate system is a construct.

My coordinate will have these axes

In my coordinate system, the vector after the transformation is still.

The blue vector is the Transformed Vector in the New coordinate system

Lets now start backwards.

What if I tell you I have a vector in my coordinate system, and you being a traditional math lover want to stick to the original coordinate system, how do you interpret my vector.

Another way to represent this thing would be:-

This is the beginning of a cool realization, not only does a matrix transformation, help you to transform a vector from one coordinate system to another.

It also helps you understand what are the coordinates of a vector in your system, that is currently being represented in some other coordinate system.

The last thing, I will like to show is without doing any visualization how do I get the vector of a vecotr in my coordinate system. How do we solve that problem?

But if we take the transformation matrix to the other side. we get

Hence we can represent the above vector in my coordinate system if we multiply it with the inverse of the transformation matrix.

In the above example, we can also get vector [1 1] in my coordinate system by multiplying the original vector [3 5] with the inverse of the transformation matrix.

In the case of an orthogonal matrix, which is a matrix that each column has unit length and is perpendicular to every other column. We can work with the Transpose of the transformation matrix as it is equivalent to the inverse.

If you give it a few shots, you can start to see matrix multiplication can not only help us change the basis of a vector but also represent the same vectors in different coordinate systems

INSIGHT:- Rather than working with existing dimensions, it is possible to change the basis of a vector using a linear transformation

If you still have doubts, please head over to this excellent explanation here

Eigen Vectors and Eigen Values

Eigenvectors are the vectors that after a linear transformation, do not change their direction or rotate.

Eigenvalues are basically the values by which these vectors expand or are squished

In the above equation we multiplied a matrix transformation A with our Eigenvector v, but rather than getting a vector with axis transformed, we end up with the same vector with magnitude increased based on the Eigenvalues

Lastly, all the eigenvectors of a Symmetric matrix are perpendicular (All eigenvectors are orthogonal), ie. at right angles to each other, no matter how many dimensions you have.

By the way, another word for perpendicular, in maths talk, is orthogonal.

INSIGHT – This is important because it means that you can express the data in terms of these perpendicular eigenvectors.

The first thing we do in PCA is that we take the covariance vector. The covariance matrix helps to summarize the similarities and relationships between the different dimensions of the data.

From the covariance matrix , when we calculate the eigenvalues and eigenvectors, basically we are calculating the principal  directions and magnitude of the data.

What are the directions of the data you ask?

They’re the underlying structure in the data. They are the directions where there is the most variance or the directions where the data is most spread out.

Eigenvalues show the variance explained in each direction. Higher the eigenvalues more are more is the total variance explained

https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/

As mentioned above the eigenvectors are always orthogonal to each other, so we can do a linear transformation and plot the data in terms of the principal components.

PRINCIPLE BEHIND  PCA

The principle behind PCA is to find the first principal component by minimizing the reconstruction (projection) error, which simultaneously maximizes the variance of the projected data.

Here is what i mean by the project error

Let’s now go step by step to understand what is happening here:-

1. Let’s start with a random data set, that plots the relationship between two variables x, y.

2) We can see that the dimension x and y of the data are correlated, as both of them are increasing together.

3) Now let’s come on to our initial aim to find axis with smallest projection error and the highest explained variance.

In simple terms, we want to find the line on which the projections of the data are the most spread out.

This, GIF from stackoverflow, does an excellent job of visualizing our objective

We can also understand the projection error from the gif, it is the length of red lines that connect the blue data points to the black axis.

4) We can visualize here that the line passing through the center of the data has the smallest projection error and the highest spread of data i.e maximum variance.

We can reduce the data to one dimension as it explains more than 90 percent of the variablity.

Why are Eigen vectors the principal components for the data

By now we are clear about the aim, we want to find the axis around which the data is most spread out.

2. We want to a find a unit vector w , so that  =1
3. Let’s find the projection of the data matrix on this axis , it will be given by the dot product,
4. We want to maximize the variance of these projections, these are given by

5. Since there is a constraint on the w,  =1 or  =1 , we need to use the LaGrange method to solve the equation. If you are not sure what the Lagrange method is it is best you go through my blog post on support vector machines where I cover it in more detail

We get the equation we need to maximize as

6 Differentiating the above equation we get:-

This is also the equation of the eigenvectors.

Hence we can prove that the first principal direction is given by the eigenvector with the largest eigenvalue. This is a nontrivial and surprising statement.

STEP BY STEP IMPLEMENTATION OF PCA

Now that we have the basics down, Let work on the step by step implementation of Principal Component Analysis.

For this example we will work with the iris dataset, I choose this data set as I know a lot of you are aware of these type of standard kaggle data sets and it is a good place to start learning.

The flowers belong to three different species (array spec) (shown as blue, green, yellow dots in the graphs below):

• 0: setosa (blue dots), 1: versicolor (green dots), 2: virginica (yellow dots)

The data points are in 4 dimensions.

A good image to explain the variables and add some color to the article

Here is the peek into the dataset, I used the head() function from numpy.

Another thing I would like you guys to follow along is the dimensions of the data so that we know we are in the right direction.

Overall the data has 6 columns, but we will work with only the 4 independent variables, so the dimensions of the dataset is (150,4) .

1. Standardize the Data – We first standardize the data such that the mean is 0. This ensures that bigger values do not bias the calculations.

If the importance of features is independent of the variance of the features, then divide each observation in a column by that column standard deviation.

In Python, we can accomplish this by using the standard scalar method from Sklearn

import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt2
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline


2. Calculate the Covariance matrix – Now we calculate the covariance of the data. Let’s first visualize the correlation matrix, to decide whether we need PCA or not.

Here is the code, I use the corr() function to find the covariance values and then use seaborne to plot the values in the form of a heatmap. For this particular task, I always prefer to use seaborne over matplotlib.


correlation = data_pca.corr()
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='cubehelix')

plt.title('Correlation between different fearures')



For calculating the covariance matrix, we can use the NumPy method of np.cov() or in this case lets do it using matrix algebra.

cov_mat = (data_pca).T.dot(data_pca) / (data_pca.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)


We get the output
Covariance matrix [[ 1.00671141 -0.11010327 0.87760486 0.82344326] [-0.11010327 1.00671141 -0.42333835 -0.358937 ] [ 0.87760486 -0.42333835 1.00671141 0.96921855] [ 0.82344326 -0.358937 0.96921855 1.00671141]]

As the data is of the form, (150,4), so we get the covariance matrix of the dimension (4,4)

3. Find the eigenvectors and eigenvalues of the Covariance matrix– Using SVD explained above, we calculate the eigenvectors and eigenvalues of the covariance matrix.


eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)



Eigenvectors

[[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
[-0.26335492 -0.92555649  0.24203288 -0.12413481]
[ 0.58125401 -0.02109478  0.14089226 -0.80115427]
[ 0.56561105 -0.06541577  0.6338014   0.52354627]]

Eigenvalues
[2.93035378 0.92740362 0.14834223 0.02074601]


The dimension of the feature vector is (4 ,4)

4. Sort the eigenvalues and eigenvectors in descending order– We sort the eigenvectors according to eigenvalues.

Eigenvalues show the variance explained in each direction. Higher the eigenvalues more is the total variance explained

Here is the code to sort the eigenvalues


# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])



Here is the output we get,

Eigenvalues in descending order:

2.9303537755893165

0.9274036215173421

0.14834222648163997

0.020746013995596228

The proportion of variance explained by including only the first eigenvector is λ₁/(λ₁ + λ₂ + … + λp)

Here λ₁ , λ₂ are the respective eigenvalues.

Similarly, the variance explained by including two variables.

λ₁+λ₂/(λ₁ + λ₂ + … + λp)

We choose the required number of K eigenvectors from to create the final feature vector.

For the above example we select the first two eigenvectors as together they explain, 96 % of the total variance of the data.

We choose the number of eigenvectors requires based on default assumptions, for example, you want the top 5 dimensions.

Or you can choose based on the total variance explained, for example, you might want to explain 80 % of the variance of the data set.

Also, when you are working with cross-validation sets, we work with only the training set to find the number of components required.

You can also treat the number of PCA dimension as a hyperparameter to select using a grid search on the final supervised score.

5. Dimensionality Reduction:- Finally we use the final feature vector to find the transformed data set.

Final FeatureVector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top, and OriginalData is the mean-adjusted data transposed


TransformedData = data_pca.dot(matrix_w)



The dimension of the final data is (2,150).

Original data is basically vectors in 4 dimension space and we are multiplying them with the transpose of the eigenvectors so that we can know the coordinate of these data points in our new coordinate system. Eigenvectors are an orthogonal matrix space

Final transformed data has the dimension k * m, Where k is the number of features and m is the  number of training examples

Here is our final data plotted along the principal component axis.


fig, ax = plt.subplots(figsize=(10, 10))

ax.scatter(TransformedData[:, 0], TransformedData[:, 1], alpha=0.2,
c=data['Species'], cmap='viridis')

plt.xlabel("PCA1")
plt.ylabel("PCA2")