# Principal Component Analysis (PCA)

Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction in machine learning. It is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called the **Principal Components**. It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.

PCA works by considering the variance of each attribute because the high attribute shows the good split between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA are ** image processing, movie recommendation system, optimizing the power allocation in various communication channels.** It is a feature extraction technique, so it contains the important variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:

Variance and Covariance

Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

**Dimensionality:**It is the number of features or variables present in the given dataset. More easily, it is the number of columns present in the dataset.**Correlation:**It signifies that how strongly two variables are related to each other. Such as if one changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are directly proportional to each other.**Orthogonal:**It defines that variables are not correlated to each other, and hence the correlation between the pair of variables is zero.**Eigenvectors:**If there is a square matrix M, and a non-zero vector v is given. Then v will be eigenvector if Av is the scalar multiple of v.**Covariance Matrix:**A matrix containing the covariance between the pair of variables is called the Covariance Matrix.

#### Principal Components in PCA

As described above, the transformed new features or the output of PCA are the Principal Components. The number of these PCs are either equal to or less than the original features present in the dataset. Some properties of these principal components are given below:

The principal component must be the linear combination of the original features.

These components are orthogonal, i.e., the correlation between a pair of variables is zero.

The importance of each component decreases when going to 1 to n, it means the 1 PC has the most importance, and n PC will have the least importance.

#### Applications of Principal Component Analysis

PCA is mainly used as the dimensionality reduction technique in various AI applications such

**as computer vision, image compression, etc.**It can also be used for finding hidden patterns if data has high dimensions. Some fields where PCA is used are Finance, data mining, Psychology, etc.

## Steps for PCA algorithm

**Getting the dataset**Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the training set, and Y is the validation set.**Representing data into a structure**Now we will represent our dataset into a structure. Such as we will represent the two-dimensional matrix of independent variable X. Here each row corresponds to the data items, and the column corresponds to the Features. The number of columns is the dimensions of the dataset.**Standardizing the data**In this step, we will standardize our dataset. Such as in a particular column, the features with high variance are more important compared to the features with lower variance. If the importance of features is independent of the variance of the feature, then we will divide each data item in a column with the standard deviation of the column. Here we will name the matrix as Z.**Calculating the Covariance of Z**To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.**Calculating the Eigen Values and Eigen Vectors**Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high information. And the coefficients of these eigenvectors are defined as the eigenvalues.**Sorting the Eigen Vectors**In this step, we will take all the eigenvalues and will sort them in decreasing order, which means from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.**Calculating the new features Or Principal Components**Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In the resultant matrix Z*, each observation is the linear combination of original features. Each column of the Z* matrix is independent of each other.**Remove less or unimportant features from the new dataset.**The new feature set has occurred, so we will decide here what to keep and what to remove. It means, we will only keep the relevant or important features in the new dataset, and unimportant features will be removed out

## How to Calculate PCA

**1. Take the whole dataset consisting of ***d+1 dimensions* and ignore the labels such that our new dataset becomes *d dimensional.*

*d+1 dimensions*and ignore the labels such that our new dataset becomes

*d dimensional.*

Let’s say we have a dataset which is *d+1 dimensional*. Where d could be thought as *X_train* and 1 could be thought as y_train *(labels) *in modern machine learning paradigm. So, *X_train + y_train* makes up our complete train dataset.

So, after we drop the labels we are left with *d dimensional* dataset and this would be the dataset we will use to find the principal components. Also, let’s assume we are left with a three-dimensional dataset after ignoring the labels i.e d = 3.

we will assume that the samples stem from two different classes, where one-half samples of our dataset are labeled class 1 and the other half class 2.

Let our data matrix **X** be the score of three students :

**2. Compute the mean of every dimension of the whole dataset.**

The data from the above table can be represented in matrix **A,** where each column in the matrix shows scores on a test and each row shows the score of a student.

*Matrix A*

So, The mean of matrix **A **would be

*Mean of Matrix A*

**3. Compute the ***covariance matrix* of the whole dataset ( sometimes also called as the variance-covariance matrix)

*covariance matrix*of the whole dataset ( sometimes also called as the variance-covariance matrix)

So, we can compute the covariance of two variables **X** and **Y** using the following formula

Using the above formula, we can find the covariance matrix of **A. **Also, the result would be a *square matrix of d ×d dimensions.*

Let’s rewrite our original matrix like this

*Matrix A*

Its c*ovariance matrix* would be

*Covariance Matrix of A*

Few points that can be noted here is :

Shown in

*Blue*along the diagonal, we see the variance of scores for each test. The art test has the biggest variance (720); and the English test, the smallest (360). So we can say that art test scores have more variability than English test scores.The covariance is displayed in black in the off-diagonal elements of the matrix

**A**

**a) **The covariance between math and English is positive (360), and the covariance between math and art is positive (180). This means the scores tend to covary in a positive way. As scores on math go up, scores on art and English also tend to go up; and vice versa.

**b)** The covariance between English and art, however, is zero. This means there tends to be no predictable relationship between the movement of English and art scores.

**4. Compute Eigenvectors and corresponding Eigenvalues**

*Intuitively, an eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it.*

Now, we can easily compute eigenvalue and eigenvectors from the covariance matrix that we have above.

*Let A be a square matrix, ν a vector and λ a scalar that satisfies Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A.*

The eigenvalues of ** A** are roots of the characteristic equation

Calculating *det(A-λI) *first, *I *is an identity matrix :

Simplifying the matrix first, we can calculate the determinant later,

Now that we have our simplified matrix, we can find the determinant of the same :

We now have the equation and we need to solve for *λ, *so as to get the* eigenvalue of the matrix. *So, equating the above equation to zero :

After solving this equation for the value of ** λ, **we get the following value

*Eigenvalues*

Now, we can calculate the eigenvectors corresponding to the above eigenvalues. I would not show how to calculate eigenvector here, visit this __link__ to understand how to calculate eigenvectors.

So, after solving for *eigenvectors* we would get the following solution for the corresponding *eigenvalues*

**5. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a ***d × k dimensional* matrix W.

*d × k dimensional*matrix W.

We started with the goal to reduce the dimensionality of our feature space, i.e., projecting the feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this new feature subspace. However, the eigenvectors only define the directions of the new axis, since they have all the same unit length 1.

So, in order to decide which eigenvector(s) we want to drop for our lower-dimensional subspace, we have to take a look at the corresponding eigenvalues of the eigenvectors. Roughly speaking, the eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data, and those are the ones we want to drop.

The common approach is to rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top *k *eigenvectors.

So, after sorting the eigenvalues in decreasing order, we have

For our simple example, where we are reducing a 3-dimensional feature space to a 2-dimensional feature subspace, we are combining the two eigenvectors with the highest eigenvalues to construct our *d×k *dimensional eigenvector matrix **W.**

So, *eigenvectors *corresponding to two maximum eigenvalues are :

**6. Transform the samples onto the new subspace**

In the last step, we use the 2×3 dimensional matrix ** W **that we just computed to transform our samples onto the new subspace via the equation

**where**

*y = W′ × x***is the**

*W′**transpose*of the matrix

*W.*__Resource__: Towardsdatascience.com, TutorialPoint

The Tech Platform