# Dummies guide to the latest “Hockey Stick” controversy Guide pratique pour comprendre la dernière controverse sur la « crosse de hockey »

Often the first few components represent something recognisable and physical meaningful (at least in climate data applications). If a large part of the data set has a trend, than the mean trend may show up as one of the most important PCs. Similarly, if there is a seasonal cycle in the data, that will generally be represented by a PC. However, remember that PCs are just mathematical constructs. By themselves they say nothing about the physics of the situation. Thus, in many circumstances, physically meaningful timeseries are ‘distributed’ over a number of PCs, each of which individually does not appear to mean much. Different methodologies or conventions can make a big difference in which pattern comes up tops. If the aim of the PCA analysis is to determine the most important pattern, then it is important to know how robust that pattern is to the methodology. However, if the idea is to more simply summarize the larger data set, the individual ordering of the PCs is less important, and it is more crucial to make sure that as many significant PCs are included as possible.

3) How do you know whether a PC has significant information?

This determination is usually based on a ‘Monte Carlo’ simulation (so-called because of the random nature of the calculations). For instance, if you take 1000 sets of random data (that have the same statistical properties as the data set in question), and you perform the PCA analysis 1000 times, there will be 1000 examples of the first PC. Each of these will explain a different amount of the variation (or variance) in the original data. When ranked in order of explained variance, the tenth one down then defines the 99% confidence level: i.e. if your real PC explains more of the variance than 99% of the random PCs, then you can say that this is significant at the 99% level. This can be done for each PC in turn. (This technique was introduced by Preisendorfer et al. (1981), and is called the Preisendorfer N-rule).

The figure to the right gives two examples of this. Here each PC is plotted against the amount of fractional variance it explains. The blue line is the result from the random data, while the blue dots are the PC results for the real data. It is clear that at least the first two are significantly separated from the random noise line. In the other case, there are 5 (maybe 6) red crosses that appear to be distinguishable from the red line random noise. Note also that the first (‘most important’) PC does not always explain the same amount of the original data.

4) What do different conventions for PC analysis represent?

Some different conventions exist regarding how the original data should be normalized. For instance, the data can be normalized to have an average of zero over the whole record, or over a selected sub-interval. The variance of the data is associated with departures from the whatever mean was selected. So the pattern of data that shows the biggest departure from the mean will dominate the calculated PCs. If there is an *a priori* reason to be interested in departures from a particular mean, then this is a way to make sure that those patterns move up in the PC ordering. Changing conventions means that the explained variance of each PC can be different, the ordering can be different, and the number of significant PCs can be different.

5) How can you tell whether you have included enough PCs?

Page 2 of 9 | Previous page | Next page