Nutrition & Principal Component Analysis: A Tutorial
A great overview of Principal Component Analysis (PCA), with an example application in the field of nutrition.
Using data from the United States Department of Agriculture, we analyzed the nutritional content of a random sample of food items. Four nutrition variables were analyzed: Vitamin C, Fiber, Fat and Protein. For fair comparison, food items were raw and measured by 100g.
Among food items, the presence of certain nutrients appear correlated. This is illustrated in the barplot below with 4 example items:
Specifically, fat and protein levels seem to move in the same direction with each other, and in the opposite direction from fiber and vitamin C levels. To confirm our hypothesis, we can check for correlations (tutorial: correlation analysis) between the nutrition variables. As expected, there are large positive correlations between fat and protein levels (r = -0.56), as well as between fiber and vitamin C levels (r = 0.57).
Therefore, instead of analyzing all 4 nutrition variables, we can combine highly-correlated variables, leaving just 2 dimensions to consider. This is the same strategy used in PCA – it examines correlations between variables to reduce the number of dimensions in the dataset. This is why PCA is called a dimension reductiontechnique.
Applying PCA to this food dataset results in the following principal components:
The numbers represent weights used in combining variables to derive principal components. For example, to get the top principal component (PC1) value for a particular food item, we add up the amount of Fiber and Vitamin C it contains, with slightly more emphasis on Fiber, and then from that we subtract the amount of Fat and Protein it contains, with Protein negated to a larger extent.
We observe that the top principal component (PC1) summarizes our findings so far – it has paired fat with protein, and fiber with vitamin C. It also takes into account the inverse relationship between the pairs. Hence, PC1 likely serves to differentiate meat from vegetables. The second principal component (PC2) is a combination of two unrelated nutrition variables – fat and vitamin C. It serves to further differentiate sub-categories within meat (using fat) and vegetables (using vitamin C).
Using the top 2 principal components to plot food items results in the best data spread thus far:
Meat items (blue) have low PC1 values, and are thus concentrated on the left of the plot, on the opposite side from vegetable items (orange). Among meats, seafood items (dark blue) have lower fat content, so they have lower PC2 values and are at the bottom of the plot. Several non-leafy vegetarian items (dark orange), having lower vitamin C content, also have lower PC2 values and appear at the bottom.
Choosing the Number of Components. As principal components are derived from existing variables, the information available to differentiate data points is constrained by the number of variables you start with. Hence, the above PCA on food items only generated 4 principal components, corresponding to the original number of variables in the dataset.
Principal components are also ordered by their effectiveness in differentiating data points, with the first principal component doing so to the largest degree. To keep results simple and generalizable, only the first few principal components are selected for visualization and further analysis. The number of principal components to consider is determined by something called a scree plot:
A scree plot shows the decreasing effectiveness of subsequent principal components in differentiating data points. A rule of thumb is to use the number of principal components corresponding to the location of a kink. In the plot above, the kink is located at the second component. This means that even though having three or more principal components would better differentiate data points, this extra information may not justify the resulting complexity of the solution. As we can see from the scree plot, the top 2 principal components already account for about 70% of data spread. Using fewer principal components to explain the current data sample better ensures that the same components can be generalized to another data sample.
Maximizing Spread. The main assumption of PCA is that dimensions that reveal the largest spread among data points are the most useful. However, this may not be true. A popular counter example is the task of counting pancakes arranged in a stack, with pancake mass representing data points:
Click picture for more information.
To count the number of pancakes, one pancake is differentiated from the next along the vertical axis (i.e. height of the stack). However, if the stack is short, PCA would erroneously identify a horizontal axis (i.e. diameter of the pancakes) as a useful principal component for our task, as it would be the dimension along which there is largest spread.
Interpreting Components. If we are able to interpret the principal components of the pancake stack, with intelligible labels such as “height of stack” or “diameter of pancakes”, we might be able to select the correct principal components for analysis. However, this is often not the case. Interpretations of generated components have to be inferred, and sometimes we may struggle to explain the combination of variables in a principal component.
Nonetheless, having prior domain knowledge could help. In our example with food items, prior knowledge of major food categories help us to comprehend why nutrition variables are combined the way they are to form principal components.
Orthogonal Components. One major drawback of PCA is that the principal components it generates must not overlap in space, otherwise known as orthogonal components. This means that the components are always positioned at 90 degrees to each other. However, this assumption is restrictive as informative dimensions may not necessarily be orthogonal to each other:
To resolve this, we can use an alternative technique called Independent Component Analysis (ICA).
ICA allows its components to overlap in space, thus they do not need to be orthogonal. Instead, ICA forbids its components to overlap in the information they contain, aiming to reduce mutual information shared between components. Hence, ICA’s components are independent, with each component revealing unique information on the data set.
Information has thus far been represented by the degree of data spread, with dimensions along which data is more spread out being more informative. This is may not always be true, as seen from the pancake example. However, ICA is able to overcome this by taking into account other sources of information apart from data spread.
Therefore, ICA may be a backup technique to use if we suspect that components need to be derived based on information beyond data spread, or that components may not be orthogonal.
PCA is a classic technique to derive underlying variables, reducing the number of dimensions we need to consider in a dataset. In our example above, we were able to visualize the food dataset in a 2-dimensional graph, even though it originally had 4 variables. However, PCA makes several assumptions, such as relying on data spread and orthogonality to derive components. On the other hand, ICA is not subjected to these assumptions. Therefore, when in doubt, one could consider running a ICA to verify and complement results from a PCA.
Did you learn something useful today? We would be glad to inform you when we make a new post, so that the learning continues! Sign up for our notifications plus exclusive email-only data science tips.
Bio: Annalyn Ng has worked as a data analyst at Disney Research, Cambridge University, and Singapore's military.
Original. Reposted with permission.
For more posts like this, visit: https://annalyzin.wordpress.com.