I don’t tend to read other blogs much, despite contributing to RealClimate. And I’m especially uninterested in spending time reading blogs full of ad hominem attacks. But a handful of colleagues apparently do read this stuff, and have encouraged me to take a look at the latest commentaries on our Antarctic temperature story. Since I happen to be teaching about principal component analysis to my graduate students this week, I thought it would be worthwhile to put up a pedagogical post on the subject. (If you don’t know what principal component analysis (PCA) is, take a look at our earlier post, Dummy’s Guide to the Hockey Stick Controversy).
For starters, consider the following simple example. Suppose we have measured two variables over some time period (say, temperature and humidity, call them x and y). We manage to get 10 observations of x and y simultaneously, but unfortunately one of our instruments breaks and we wind up with 10 additional measurements of x only, but none of y. Fortunately, we know that x and y are physically related to one another, so our hope is that we can use the paired observations of x and y to estimate what y should have been, had we been able to measure it throughout the experiment. We plot the variables y vs. x and try fitting some function to the data. If we get the function right, we ought to be able to estimate what y is for any arbitrary value of x. The obvious thing one might try, given the apparent curve to the data, is to use a 2nd-order polynomial (that is, a parabola):
Looks pretty good right?
Well, .. no. Actually, in for this particular example, we should have just used a straight line. This becomes obvious after we fix the broken instrument and manage to increase the size of the data set:
Of course, our estimate of the best fit line to the data is itself uncertain, and when we include more data, we wind up with a slightly different result (shown by the dashed green line, below). But we are a lot closer than we would have been using the parabola, which diverges radically from the data at large values of x:
Of course, the parabola looked liked it might be a better fit — after all, it is closer to more of the data points, which really do seem to be curving upwards. And we couldn’t have known in advance, could we? Well, we could have increased our chances of being right, by using some basic statistics (a chi-squared test for example). The results would have shown us that there were not enough degrees of freedom in the data to justify the use of the higher-order curve (which reduces the degrees of freedom by from 8 to 7 in this example). Choosing the parabola in this case is a classic example of overfitting.
The basic lesson here is that one should avoid using more parameters than necessary when fitting a function (or functions) to a set of data. Doing otherwise, more often than not, leads to large extrapolation errors.
In using PCA for statistical climate reconstruction, avoiding overfitting means — in particular — carefully determining the right number of PCs to retain. Just as in the simple example above, there are two different — but both important — tests to apply here, one a priori and one a posteriori.
First, we are interested in distinguishing those PCs that may be related to the true ‘modes’ of variability in the data. In our case, we are interested in those PCs that represent variations in Antarctica temperature that are related to, for example, changes in the circumpolar wind strength, variations in sea ice distribution, etc. While linear methods like PCA are inherently just an approximation to the real modes in the climate system, in general the first few PCs — which, by construction capture that variability that occurs over large spatial scales — do bear a strong relationship with the underlying physics. At the same time, we want to avoid those PCs that are unlikely to represent physically meaningful variability, either because they are simply noise (whether instrumental noise or true random variability in the climate) or because they represent variations of only local significance. This is not to say that local or ‘noisy’ parts of the system aren’t important, but this kind of variability is unlikely to be well represented by a network of predictor variables that is even sparser than the original data field from which the PCs are obtained.
The standard approach to determining which PCs represent to retain is to use the criterion of North et al. (1982), which provides an estimate of the uncertainty in the eigenvalues of the linear decomposition of the data into its time varying PCs and spatial eigenvectors (or EOFs). Most of the higher order PCs will have indistinguishable variance. In the case where the lower order PCs (which explain the majority of the variance) also have indistinguishable variance, caution needs to be applied in choosing which to retain, because PCA will usually scramble the underlying spatial (and temporal) structure these PCs represent. One can easily demonstrate this by creating an artificial spatio-temporal data set from simple time functions (say, a bunch of sines and cosines), plus random noise. PCA will only extract the original functions if they have significantly different amplitudes.
The figure below shows the eigenvalue spectrum — including the uncertainties — for both the satellite data from the main temperature reconstruction and the occupied weather station data used in Steig et al., 2009.
It’s apparent that in the satellite data (our predictand data set), there are three eigenvalues that lie well above the rest. One could argue for retaining #4 as well, though it does slightly overlap with #5. Retaining more than 4 requires retaining at least 6, and at most 7, to avoid having to retain all the rest (due to their overlapping error bars). With the weather station data (our predictor data set), one could justify choosing to retain 4 by the same criteria, or at most 7. Together, this suggests that in the combined data sets, a maximum of 7 PCs should be retained, and as few as 3. Retaining just 3 is a very reasonable choice, given the significant drop off in variance explained in the satellite data after this point: remember, we are trying to avoid including PCs that simply represent noise. For simple filtering applications (image processing for example), the risk of retaining too much noise is small. For extrapolation in time – the climate reconstruction problem — it is critical that the PCs approximate the dynamics in the system. In that application, retaining fewer PCs — and in particular only those that are distinguished by a large break in slope (i.e. PCs 3 or 4 in the actual data, above) is the choice least likely to unnecessarily inflate possible noise.
In short, we a priori would not want to retain more than 7, and as few as 3 PCs is clearly justifiable. We would most certainly not want to retain 9, 11, or 25 since doing so is almost certain to result in extrapolation errors. Which leads us to our a posteriori test: how well do the various reconstructions do, as a function of the number of retained PCs?
As in the simple x,y example above, we now want to take a look at how our extrapolation compares with reality. To do this, we withhold some of the data, calibrate the model (a separate calculation for each number of PCs we want to consider), and then compare the resulting reconstruction with the withheld data. Such calibration/verification tests, were, of course, done in our paper and is the major thing that distinguished our paper from previous work.
As shown in the figure below, there is not much change to the goodness-of-fit (as measured in this case by the correlation coefficient of the withheld and reconstructed data in the split calibration/verification tests), whether one uses 3, 4, 5, or 6 PCs. There is a significant (and continuous) drop off in skill, however, if one uses more than 7. We can therefore eliminate using more than 7 PCs by this a posteriori test. And since little was gained by using more than 3 or 4, parsimony would indicate using no more.
Now the alert reader may point out that the reconstruction skill (as calculated with simple PCA and correlation scores at least) seems to be the greatest in West Antarctica when we use 4 PCs, rather than 3 as in the paper. However, as shown in the first figure below, choosing more than 3 PCs results in the same or even larger trends (especially in West Antarctica) than does using 3 PCs. Of course, we can get smaller trends if we use just two PCS, or more than 10, but this can’t be justified under either a priori or a posteriori criteria — no more so than using 13, or 25 can.* The result is the same whether we use a simple PCA (as I’ve mostly restricted myself to here, for simplicity of discussion), or the RegEM algorithm (we used both methods in the paper):
In summary: our choice to retain 3 PCs was not only a reasonable a priori choice, but it produces comparable or smaller trends to other reasonable a priori choices. Using 4, 6 or 7produces larger trends in West Antarctica, and little change in East Antarctica. Using more than 7 (at least up to 12) increases the trend in both areas. So much for the claim, reported on several web sites, that we purposefully chose 3 PCs to maximize the estimated warming trend!
All of this talk about statistics, of course, can distract one from the fact that the key finding in our paper — warming in West Antarctica, akin to that on the Peninsula — is obvious in the raw data:
Raw trends in temperature — in different versions of the monthly cloud-masked temperature data (a) Comiso (decadal trends 1982-1999), (b) Monaghan et al. (1982-1999) c) Steig et al. (1982-1999), (d) Steig et al., (1982-2006).
It is, furthermore, exactly what one would expect from the atmospheric dynamics. Low-pressure storms rarely penetrate into the high polar plateau of East Antarctica, and temperatures there are radiation dominated, not-transported dominated as they are in West Antarctica and on the Peninsula. This is why those general circulation models that properly simulate the observed atmospheric circulation and sea ice changes — increasing around most of East Antarctica, but decreasing off the coast of West Antarctica and the Peninsula — also match the pattern and magnitude of the temperature trends we observe:
Figure 3b from Goosse et al., 2009, showing temperature trends (1960-2000) simulated by an intermediate complexity coupled ocean-atmosphere model that uses data-assimilation to constrain the model to match observed surface temperature variations at surface weather stations. No satellite temperature data are used. Yellow areas are warming at 0.1 to 0.2 degrees/decade in the simulation and light orange 0.2 to 0.3 degrees/decade.
A final point is that it is actually rather bizarre that so much effort has been spent in trying to find fault with our Antarctic temperature paper. It appears this is a result of the persistent belief that by embarrassing specific scientists, the entire edifice of ‘global warming’ will fall. This is remarkably naive, as we have discussed before. The irony here is that our study was largely focused on regional climate change that may well be largely due to natural variability, as we clearly state in the paper. This seems to have escaped much of the blogosphere, however.
*While I was working on this post, someone called “Ryan O” posted a long discussion claiming that he gets better verification skill than in our paper using 13 PCs. This is curious, since it contradicts my finding that using so many PCs substantially degrades reconstruction skill. It appears that what has been done is first to adjust the satellite data so that it better matches the ground data, and then to do the reconstruction calculations. This doesn’t make any sense: it amounts to pre-optimizing the validation data (which are supposed to be independent), violating the very point of ‘verification’ altogether. This is not to say that some adjustment of the satellite data is unwarranted, but it can’t be done this way if one wants to use the data for verification. (And the verification results one gets this way certainly cannot be compared against the verification results based on untuned satellite data.)
Further, the claim made by ‘Ryan O’ that our calculations ‘inflate’ the temperature trends in the data is completely specious. All that has done is take our results, add additional PCs (resulting in a lower trend in this case), and then subtract those PCs (thereby getting the original trends back). In other words, 2 + 1 – 1 = 2.
Since I said at the outset that this is a pedagogical post, I will end by noting that perhaps something has been learned here: it appears that using our methods, and our data, even a self-professed ‘climate skeptic’ can obtain essentially the same results that we did: long term warming over virtually all of the ice sheet.