scikit-learn : Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)

bogotobogo.com site search:

Principal component analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. - wiki

PCA tries to find the directions of maximum variance (direction of orthogonal axes / principal components) in data and projects it onto a new subspace with lower dimension than the original one.

Principal Components Analysis (PCA)

Here, $x_1$ and $x_2$ are the original feature axes, and $z_1$ and $z_2$ are the principal components.

Dimensionality reduction via principal component analysis

In order to reduce dimensionality using PCA, we construct a transformation matrix $W$ which has $d \times k$-dimension.

With the $W$ matrix we can map a sample vector $x$ onto a new $k$-dimensional feature subspace that has fewer dimensions than the original $d$-dimensional feature space:

$$ \mathbf x=[x_1, x_2, \cdots ,x_d], \qquad \mathbf x \in \mathbb R^d $$ $$ \downarrow \mathbf x \mathbf W, \qquad \qquad W \in \mathbb R^{d \times k} $$ $$ \mathbf z=[z_1, z_2, \cdots ,z_k], \qquad \mathbf z \in \mathbb R^k $$

After the transformation from the original $d$-dimensional data onto this new $k$-dimensional subspace ($k \le d$), the first principal component will have the largest possible variance, and all consequent principal components will have the largest possible variance given that they are uncorrelated (orthogonal) to the other principal components.

Note that the PCA directions are highly sensitive to data scaling, and most likely we need to standardize the features prior to PCA if the features were measured on different scales and we want to assign equal importance to all features.

Here are the steps of PCA algorithm for dimensionality reduction:

Standardize the $d$-dimensional dataset.
Construct the covariance matrix.
Decompose the covariance matrix into its eigenvectors and eigenvalues.
Select $k$ eigenvectors that correspond to the $k$ largest eigenvalues, where $k$ is the dimensionality of the new feature subspace ( $k \le d$ ).
Construct a projection matrix $W$ from the "top" $k$ eigenvectors.
Transform the $d$-dimensional input dataset $\mathbf x$ using the projection matrix $W$ to obtain the new $k$-dimensional feature subspace.

Eigenvalues & eigenvectors

Continued from the previous section for principal component analysis, in this section we'll standardize the data, construct the covariance matrix, obtain the eigenvalues and eigenvectors of the covariance matrix, and sort the eigenvalues by decreasing order to rank the eigenvectors.

Let's start by loading the Wine dataset from "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data":

Note that PCA is an unsupervised method, which means that information about the class labels is ignored. It shows clear contrast compared with a random forest which uses the class membership information to compute the node impurities, variance measures the spread of values along a feature axis. Recall we did the following for random forest:

After the read-in, we process the Wine data into separate training (70%) and test (30%) sets and then standardize it to unit variance:

Now we want to construct the covariance matrix which is symmetric with $d \times d$-dimension, where $d$ is the dataset dimension. The covariance matrix stores the pairwise covariances between the different features.

The covariance between two features $x_j$ and $x_k$ on the population level can be calculated via the equation below:

$$ \sigma_{jk} = \frac {1}{N} \sum_{i=1}^N (x_j^i - \mu_j)(x_k^i - \mu_k) $$

where $\mu_j$ and $\mu_k$ are the sample means of feature $j$ and $k$, respectively.

Note that the sample means are zero if we standardize the dataset.

A positive covariance between two features indicates that the features increase or decrease together, while a negative covariance means that the features vary in opposite directions.

A covariance matrix of three features can then be written as $A$:

$$ A = \begin{bmatrix} \sigma_{11}^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_{22}^2 & \sigma_{23} \\ \sigma_{32} & \sigma_{32} & \sigma_{33}^2 \\ \end{bmatrix} $$

The eigenvectors of the covariance matrix represent the principal components, while the corresponding eigenvalues will define their magnitude.

In the case of the Wine dataset, we can obtain 13 eigenvectors and eigenvalues from the $13 \times 13$ covariance matrix.

An eigenvector $\nu$ satisfies the following condition where \lambda is the eigenvalue:

$$ \mathbf A \mathbf \nu = \lambda \mathbf \nu $$

We're going to use the linalg.eig function from NumPy to obtain the eigenpairs of the Wine covariance matrix:

We computed the covariance matrix of the standardized training dataset using the numpy.cov() function.

Using the linalg.eig function, we performed the eigendecomposition that yielded 13 eigenvalues and the corresponding eigenvectors stored as columns in a $13 \times 13$ matrix.

Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature subspace, we only select the subset of the eigenvectors (principal components) that contains most of the information (variance).

Since the eigenvalues define the magnitude of the eigenvectors, we have to sort the eigenvalues by decreasing magnitude, and we are interested in the top $k$ eigenvectors based on the values of their corresponding eigenvalues.

But before we collect those $k$ most informative eigenvectors, let's plot the variance explained ratios of the eigenvalues.

The variance explained ratio of an eigenvalue $\lambda_j$ is simply the fraction of an eigenvalue $\lambda_j$ and the total sum of the eigenvalues:

$$ \frac {\lambda_j} {\sum_{j=1}^d \lambda_j}$$

Using the NumPy cumsum() function, we can then calculate the cumulative sum of explained variances, which we will plot via matplotlib's step() function:

The plot shows that the first principal component alone accounts for 40 percent of the variance. Also, we can see that the first two principal components combined explain almost 60 percent of the variance in the data.

Transform dataset onto a new principal axes

Now that we have decomposed the covariance matrix into eigen-pairs, we want to transform the Wine dataset onto the new principal component axes.

We're going to sort the eigen-pairs by descending order of the eigenvalues, and construct a projection matrix from the selected eigenvectors. Then, using the projection matrix we will transform the data onto the lower-dimensional subspace.

Let's start by sorting the eigen-pairs by decreasing order of the eigenvalues:

Next, we collect the two eigenvectors that correspond to the two largest values to capture about 60 percent of the variance in this dataset.

Note that choosing the number of principal components has to be determined from a trade-off between computational efficiency and the performance of the classifier, however, we only chose two eigenvectors for the demonstration purpose.

Now we've created a $13 \times 2$ projection matrix $\mathbf W$ from the top two eigenvectors.

Using the projection matrix, we can now transform a sample $\mathbf x$ (represented as $1 \times 13$ row vector) onto the PCA subspace obtaining $\mathbf x^\prime$ which is a 2-D sample vector consisting of two new features:

$$ \mathbf x^\prime = \mathbf x \mathbf W $$

In the same way, we can transform the entire $124 \times 13$ training dataset onto the two principal components by calculating the matrix dot product:

$$ \mathbf X^\prime = \mathbf X \mathbf W $$

Finally, it's time to visualize the transformed Wine training set, now stored as an $124 \times 2$ matrix, in a two-dimensional scatterplot:

We we can see from the plot, the data is more spread along the $x$-axis which is the first principal component than the $y$-axis which is the second principal component.

Though this is consistent with the explained variance ratio plot that we created in the previous subsection, we can intuitively see that a linear classifier will likely be able to separate the classes well.

One more thing to remind once again:
Although we encoded the class labels information for the purpose of illustration in the preceding scatter plot, we have to keep in mind that PCA is an unsupervised technique that doesn't use class label information.

PCA in scikit-learn

In this section we want to learn how to use the PCA class implemented in scikit-learn.

PCA is another one ofa scikit-learn's transformer classes, where we first fit the model using the training data before we transform both the training data and the test data using the same model parameters.

Let's use the PCA from scikit-learn on the Wine training dataset, and classify the transformed samples via logistic regression.

To visualize the decision regions, we'll use the following code:

From the picture above, we can see the decision regions for the training model reduced to the two principal component axes.

Note: Covariance-matrix

This section explains the core concept of covariance matrix (ref Statistics 101: The Covariance Matrix).

Here is the data we're going to use:

Here is the scatter plots of each variable against the others:

So, we get the following covariance matrix

As we can see from the picture above, the diagonal of a covariance matrix gives us the two kinds of variances: the variance of each variables against others and the covariance with itself. So, the off-diagonal entries are the covariance between pair of variables.