scikit-learn : Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. - wiki
PCA tries to find the directions of maximum variance (direction of orthogonal axes / principal components) in data and projects it onto a new subspace with lower dimension than the original one.
Principal Components Analysis (PCA)
Here, $x_1$ and $x_2$ are the original feature axes, and $z_1$ and $z_2$ are the principal components.
In order to reduce dimensionality using PCA, we construct a transformation matrix $W$ which has $d \times k$-dimension.
With the $W$ matrix we can map a sample vector $x$ onto a new $k$-dimensional feature subspace that has fewer dimensions than the original $d$-dimensional feature space:
$$ \mathbf x=[x_1, x_2, \cdots ,x_d], \qquad \mathbf x \in \mathbb R^d $$ $$ \downarrow \mathbf x \mathbf W, \qquad \qquad W \in \mathbb R^{d \times k} $$ $$ \mathbf z=[z_1, z_2, \cdots ,z_k], \qquad \mathbf z \in \mathbb R^k $$After the transformation from the original $d$-dimensional data onto this new $k$-dimensional subspace ($k \le d$), the first principal component will have the largest possible variance, and all consequent principal components will have the largest possible variance given that they are uncorrelated (orthogonal) to the other principal components.
Note that the PCA directions are highly sensitive to data scaling, and most likely we need to standardize the features prior to PCA if the features were measured on different scales and we want to assign equal importance to all features.
Here are the steps of PCA algorithm for dimensionality reduction:
- Standardize the $d$-dimensional dataset.
- Construct the covariance matrix.
- Decompose the covariance matrix into its eigenvectors and eigenvalues.
- Select $k$ eigenvectors that correspond to the $k$ largest eigenvalues, where $k$ is the dimensionality of the new feature subspace ( $k \le d$ ).
- Construct a projection matrix $W$ from the "top" $k$ eigenvectors.
- Transform the $d$-dimensional input dataset $\mathbf x$ using the projection matrix $W$ to obtain the new $k$-dimensional feature subspace.
Continued from the previous section for principal component analysis, in this section we'll standardize the data, construct the covariance matrix, obtain the eigenvalues and eigenvectors of the covariance matrix, and sort the eigenvalues by decreasing order to rank the eigenvectors.
Let's start by loading the Wine dataset from "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data":
Note that PCA is an unsupervised method, which means that information about the class labels is ignored. It shows clear contrast compared with a random forest which uses the class membership information to compute the node impurities, variance measures the spread of values along a feature axis. Recall we did the following for random forest:
After the read-in, we process the Wine data into separate training (70%) and test (30%) sets and then standardize it to unit variance:
Now we want to construct the covariance matrix which is symmetric with $d \times d$-dimension, where $d$ is the dataset dimension. The covariance matrix stores the pairwise covariances between the different features.
The covariance between two features $x_j$ and $x_k$ on the population level can be calculated via the equation below:
$$ \sigma_{jk} = \frac {1}{N} \sum_{i=1}^N (x_j^i - \mu_j)(x_k^i - \mu_k) $$where $\mu_j$ and $\mu_k$ are the sample means of feature $j$ and $k$, respectively.
Note that the sample means are zero if we standardize the dataset.
A positive covariance between two features indicates that the features increase or decrease together, while a negative covariance means that the features vary in opposite directions.
A covariance matrix of three features can then be written as $A$:
$$ A = \begin{bmatrix} \sigma_{11}^2 & \sigma_{12} & \sigma_{13} \\ \sigma_{21} & \sigma_{22}^2 & \sigma_{23} \\ \sigma_{32} & \sigma_{32} & \sigma_{33}^2 \\ \end{bmatrix} $$The eigenvectors of the covariance matrix represent the principal components, while the corresponding eigenvalues will define their magnitude.
In the case of the Wine dataset, we can obtain 13 eigenvectors and eigenvalues from the $13 \times 13$ covariance matrix.
An eigenvector $\nu$ satisfies the following condition where \lambda is the eigenvalue:
$$ \mathbf A \mathbf \nu = \lambda \mathbf \nu $$We're going to use the linalg.eig function from NumPy to obtain the eigenpairs of the Wine covariance matrix:
We computed the covariance matrix of the standardized training dataset using the numpy.cov() function.
Using the linalg.eig function, we performed the eigendecomposition that yielded 13 eigenvalues and the corresponding eigenvectors stored as columns in a $13 \times 13$ matrix.
Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature subspace, we only select the subset of the eigenvectors (principal components) that contains most of the information (variance).
Since the eigenvalues define the magnitude of the eigenvectors, we have to sort the eigenvalues by decreasing magnitude, and we are interested in the top $k$ eigenvectors based on the values of their corresponding eigenvalues.
But before we collect those $k$ most informative eigenvectors, let's plot the variance explained ratios of the eigenvalues.
The variance explained ratio of an eigenvalue $\lambda_j$ is simply the fraction of an eigenvalue $\lambda_j$ and the total sum of the eigenvalues:
$$ \frac {\lambda_j} {\sum_{j=1}^d \lambda_j}$$Using the NumPy cumsum() function, we can then calculate the cumulative sum of explained variances, which we will plot via matplotlib's step() function:
The plot shows that the first principal component alone accounts for 40 percent of the variance. Also, we can see that the first two principal components combined explain almost 60 percent of the variance in the data.
Now that we have decomposed the covariance matrix into eigen-pairs, we want to transform the Wine dataset onto the new principal component axes.
We're going to sort the eigen-pairs by descending order of the eigenvalues, and construct a projection matrix from the selected eigenvectors. Then, using the projection matrix we will transform the data onto the lower-dimensional subspace.
Let's start by sorting the eigen-pairs by decreasing order of the eigenvalues:
Next, we collect the two eigenvectors that correspond to the two largest values to capture about 60 percent of the variance in this dataset.
Note that choosing the number of principal components has to be determined from a trade-off between computational efficiency and the performance of the classifier, however, we only chose two eigenvectors for the demonstration purpose.
Now we've created a $13 \times 2$ projection matrix $\mathbf W$ from the top two eigenvectors.
Using the projection matrix, we can now transform a sample $\mathbf x$ (represented as $1 \times 13$ row vector) onto the PCA subspace obtaining $\mathbf x^\prime$ which is a 2-D sample vector consisting of two new features:
$$ \mathbf x^\prime = \mathbf x \mathbf W $$In the same way, we can transform the entire $124 \times 13$ training dataset onto the two principal components by calculating the matrix dot product:
$$ \mathbf X^\prime = \mathbf X \mathbf W $$Finally, it's time to visualize the transformed Wine training set, now stored as an $124 \times 2$ matrix, in a two-dimensional scatterplot:
We we can see from the plot, the data is more spread along the $x$-axis which is the first principal component than the $y$-axis which is the second principal component.
Though this is consistent with the explained variance ratio plot that we created in the previous subsection, we can intuitively see that a linear classifier will likely be able to separate the classes well.
One more thing to remind once again:
Although we encoded the class labels information for the purpose of illustration in
the preceding scatter plot, we have to keep in mind that PCA is an unsupervised
technique that doesn't use class label information.
In this section we want to learn how to use the PCA class implemented in scikit-learn.
PCA is another one ofa scikit-learn's transformer classes, where we first fit the model using the training data before we transform both the training data and the test data using the same model parameters.
Let's use the PCA from scikit-learn on the Wine training dataset, and classify the transformed samples via logistic regression.
To visualize the decision regions, we'll use the following code:
From the picture above, we can see the decision regions for the training model reduced to the two principal component axes.
This section explains the core concept of covariance matrix (ref Statistics 101: The Covariance Matrix).
Here is the data we're going to use:
Here is the scatter plots of each variable against the others:
So, we get the following covariance matrix
As we can see from the picture above, the diagonal of a covariance matrix gives us the two kinds of variances: the variance of each variables against others and the covariance with itself. So, the off-diagonal entries are the covariance between pair of variables.
Source is available from bogotobogo-Machine-Learning .
Next:
Data Compression via Dimensionality Reduction II - Linear Discriminant Analysis (LDA)Machine Learning with scikit-learn
scikit-learn installation
scikit-learn : Features and feature extraction - iris dataset
scikit-learn : Machine Learning Quick Preview
scikit-learn : Data Preprocessing I - Missing / Categorical data
scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization
scikit-learn : Data Preprocessing III - Dimensionality reduction vis Sequential feature selection / Assessing feature importance via random forests
Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)
scikit-learn : Data Compression via Dimensionality Reduction II - Linear Discriminant Analysis (LDA)
scikit-learn : Data Compression via Dimensionality Reduction III - Nonlinear mappings via kernel principal component (KPCA) analysis
scikit-learn : Logistic Regression, Overfitting & regularization
scikit-learn : Supervised Learning & Unsupervised Learning - e.g. Unsupervised PCA dimensionality reduction with iris dataset
scikit-learn : Unsupervised_Learning - KMeans clustering with iris dataset
scikit-learn : Linearly Separable Data - Linear Model & (Gaussian) radial basis function kernel (RBF kernel)
scikit-learn : Decision Tree Learning I - Entropy, Gini, and Information Gain
scikit-learn : Decision Tree Learning II - Constructing the Decision Tree
scikit-learn : Random Decision Forests Classification
scikit-learn : Support Vector Machines (SVM)
scikit-learn : Support Vector Machines (SVM) II
Flask with Embedded Machine Learning I : Serializing with pickle and DB setup
Flask with Embedded Machine Learning II : Basic Flask App
Flask with Embedded Machine Learning III : Embedding Classifier
Flask with Embedded Machine Learning IV : Deploy
Flask with Embedded Machine Learning V : Updating the classifier
scikit-learn : Sample of a spam comment filter using SVM - classifying a good one or a bad one
Machine learning algorithms and concepts
Batch gradient descent algorithmSingle Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function
Batch gradient descent versus stochastic gradient descent
Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method
Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)
Logistic Regression
VC (Vapnik-Chervonenkis) Dimension and Shatter
Bias-variance tradeoff
Maximum Likelihood Estimation (MLE)
Neural Networks with backpropagation for XOR using one hidden layer
minHash
tf-idf weight
Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)
Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)
Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)
Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)
Locality-Sensitive Hashing (LSH) using Cosine Distance (Cosine Similarity)
Artificial Neural Networks (ANN)
[Note] Sources are available at Github - Jupyter notebook files1. Introduction
2. Forward Propagation
3. Gradient Descent
4. Backpropagation of Errors
5. Checking gradient
6. Training via BFGS
7. Overfitting & Regularization
8. Deep Learning I : Image Recognition (Image uploading)
9. Deep Learning II : Image Recognition (Image classification)
10 - Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization