On overfitting

First, we are interested in distinguishing those PCs that may be related to the true ‘modes’ of variability in the data. In our case, we are interested in those PCs that represent variations in Antarctica temperature that are related to, for example, changes in the circumpolar wind strength, variations in sea ice distribution, etc. While linear methods like PCA are inherently just an approximation to the real modes in the climate system, in general the first few PCs — which, by construction capture that variability that occurs over large spatial scales — do bear a strong relationship with the underlying physics. At the same time, we want to avoid those PCs that are unlikely to represent physically meaningful variability, either because they are simply noise (whether instrumental noise or true random variability in the climate) or because they represent variations of only local significance. This is not to say that local or ‘noisy’ parts of the system aren’t important, but this kind of variability is unlikely to be well represented by a network of predictor variables that is even sparser than the original data field from which the PCs are obtained.

The standard approach to determining which PCs represent to retain is to use the criterion of North et al. (1982), which provides an estimate of the uncertainty in the eigenvalues of the linear decomposition of the data into its time varying PCs and spatial eigenvectors (or EOFs). Most of the higher order PCs will have indistinguishable variance. In the case where the lower order PCs (which explain the majority of the variance) also have indistinguishable variance, caution needs to be applied in choosing which to retain, because PCA will usually scramble the underlying spatial (and temporal) structure these PCs represent. One can easily demonstrate this by creating an artificial spatio-temporal data set from simple time functions (say, a bunch of sines and cosines), plus random noise. PCA will only extract the original functions if they have significantly different amplitudes.

The figure below shows the eigenvalue spectrum — including the uncertainties — for both the satellite data from the main temperature reconstruction and the occupied weather station data used in Steig et al., 2009.

It’s apparent that in the satellite data (our predictand data set), there are three eigenvalues that lie well above the rest. One could argue for retaining #4 as well, though it does slightly overlap with #5. Retaining more than 4 requires retaining at least 6, and at most 7, to avoid having to retain all the rest (due to their overlapping error bars). With the weather station data (our predictor data set), one could justify choosing to retain 4 by the same criteria, or at most 7. Together, this suggests that in the combined data sets, a maximum of 7 PCs should be retained, and as few as 3. Retaining just 3 is a very reasonable choice, given the significant drop off in variance explained in the satellite data after this point: remember, we are trying to avoid including PCs that simply represent noise. For simple filtering applications (image processing for example), the risk of retaining too much noise is small. For extrapolation in time – the climate reconstruction problem — it is critical that the PCs approximate the dynamics in the system. In that application, retaining fewer PCs — and in particular only those that are distinguished by a large break in slope (i.e. PCs 3 or 4 in the actual data, above) is the choice least likely to unnecessarily inflate possible noise.

In short, we a priori would not want to retain more than 7, and as few as 3 PCs is clearly justifiable. We would most certainly not want to retain 9, 11, or 25 since doing so is almost certain to result in extrapolation errors. Which leads us to our a posteriori test: how well do the various reconstructions do, as a function of the number of retained PCs?

As in the simple x,y example above, we now want to take a look at how our extrapolation compares with reality. To do this, we withhold some of the data, calibrate the model (a separate calculation for each number of PCs we want to consider), and then compare the resulting reconstruction with the withheld data. Such calibration/verification tests, were, of course, done in our paper and is the major thing that distinguished our paper from previous work.

Page 2 of 4 | Previous page | Next page