My Introduction to Principal Component Analysis
AI/ML is a big field, and there’s a lot of things to potentially learn. Even though I’ve been working around and in it for a number of years now, I’m consistently learning about ideas and approaches to complex problems that I’ve maybe heard mentioned but never really got to dive into from a professional or academic standpoint. This week, it was an eye-opening exploration into Principal Component Analysis, or PCA.
PCA can solve for a couple of different problems in machine learning which, due to the industries and projects I’ve worked on, I had never run into or had the need to explore solutions for. The one that fascinated me the most was a situation that occurs when your data predictor variables outnumber your observations.
Say you have a data set that has 200 features / variables / columns, but only 100 observations. If you’ve never run into a situation like this before, it may not be obvious why this would be a problem. However, it turns out to make mathematics-based predictive modeling impossible. It’s similar to the situation where I give you a math problem with two variables and ask you to provide me with the discrete correct answers for the problem. Example:
X + Y = 50
What are the values for X and Y? There is absolutely no way to tell, if this is all of the information I provide you. I have two variables and only one answer, so given no additional information, it is mathematically impossible to determine a precise answer. There are, in fact, an infinite range of potential values for X and Y. The only way we might be able to solve for this would be to come up with another formula that used these two variables and produced a different result — perhaps something like this:
Y - X = 100
Now that we have two formulas that give us results, we can solve for one variable in terms of the other and find out the correct values for each one.
(X = -25, Y = 75)
If you think about it, this is exactly what a lot of machine learning models are really doing. You have various predictor variables with different outputs and you’re finding the right way to manipulate those variables (via coefficients and an offset) to be able to add up the values to get to the label we see in the data, under the assumption that these coefficient values reflect the impact of the variables in the real world.
The problem in our scenario, though, is that you have to have as many formulas as you have variables, or else you can’t actually solve for X and Y like I did above. If you had 3 variables, you would need at least three formulas. Likewise, if you have a data set with 10 predictor variables, you would need at least 10 observations (our formulas that provide the inputs to get to our answer / label) in order to generate **any** mathematical response. It’s simply impossible otherwise.
This makes our initial scenario challenging then — if you have and need to use 200 predictor variables and only 100 observations, you cannot use (most? / all?) math-based machine learning algorithms to generate predictions. Tree-based approaches are probably workable, but the results may not be satisfactory.
Enter PCA (and other dimensionality-reduction approaches). What PCA does is it creates statistical representations — called principal components — of variables in your data that are designed to explain the highest amount of variance in your data per variable, but don’t actually represent any particular variables in general. By using principal components, you lose the ability to determine the overall impact of any given variable in the data that was used to create them. But in our wide data scenario, you gain the ability to generate a prediction based on the data the principal components represent.
The practical application of this is that I could take my data with 200 predictor variables and use PCA to reduce the number of variables to 100, making mathematical prediction possible, while still retaining a decent percentage of the overall signal in the data. But you don’t have to stop there. You might be in a resource-constrained computing enviroment where reducing the number of predictor variables results in an ability to dramatically reduce the time it takes to generate a model, so you might choose to reduce those 200 predictor variables down to 10 instead of 100. And assuming you can capture enough signal in these 10 principal components (the first principal component tends to capture the most signal, and the amount of variance captured per component falls as each one is generated), it may well be that any resulting loss in predictive power is more than made up for by the reduction in time to produce the model.
That ends up being the other reason PCA and other dimension reduction approaches might be used — raw compute limitations. Perhaps, if prediction is your only goal, and model updates need to happen frequently, using PCA as a preprocessing step in your data can result in higher value returns simply because of the ability to update faster and more frequently than you might be able to do with a slightly more predictive but slower to compute model. The crux here is the actual accuracy lost by PCA (if any — there are situations where it actually corrects problems in your data and produces **better** predictions…) vs. what is gained in terms of time to value. If the PCA approach ends up being close enough, and you don’t care about interpretability of results, then it may be the better approach.
I haven’t done a deep dive on how principal components are generated yet, but it looks like a typical minimization problem based on my initial pass, it’s just minimizing data variance rather than something like prediction error. If I have a reason to go deeper here, I’ll post an update as to how it works. In the meantime, understanding more about dimensionality reduction, including what it’s really doing and why you might use it, was a really fun learning experience for me.