On overfitting

I don’t tend to read other blogs much, despite contributing to RealClimate. And I’m especially uninterested in spending time reading blogs full of ad hominem attacks. But a handful of colleagues apparently do read this stuff, and have encouraged me to take a look at the latest commentaries on our Antarctic temperature story. Since I happen to be teaching about principal component analysis to my graduate students this week, I thought it would be worthwhile to put up a pedagogical post on the subject. (If you don’t know what principal component analysis (PCA) is, take a look at our earlier post, Dummy’s Guide to the Hockey Stick Controversy).

For starters, consider the following simple example. Suppose we have measured two variables over some time period (say, temperature and humidity, call them x and y). We manage to get 10 observations of x and y simultaneously, but unfortunately one of our instruments breaks and we wind up with 10 additional measurements of x only, but none of y. Fortunately, we know that x and y are physically related to one another, so our hope is that we can use the paired observations of x and y to estimate what y should have been, had we been able to measure it throughout the experiment. We plot the variables y vs. x and try fitting some function to the data. If we get the function right, we ought to be able to estimate what y is for any arbitrary value of x. The obvious thing one might try, given the apparent curve to the data, is to use a 2nd-order polynomial (that is, a parabola):

Looks pretty good right?

Well, .. no. Actually, in for this particular example, we should have just used a straight line. This becomes obvious after we fix the broken instrument and manage to increase the size of the data set:

Of course, our estimate of the best fit line to the data is itself uncertain, and when we include more data, we wind up with a slightly different result (shown by the dashed green line, below). But we are a lot closer than we would have been using the parabola, which diverges radically from the data at large values of x:

Of course, the parabola looked liked it might be a better fit — after all, it is closer to more of the data points, which really do seem to be curving upwards. And we couldn’t have known in advance, could we? Well, we could have increased our chances of being right, by using some basic statistics (a chi-squared test for example). The results would have shown us that there were not enough degrees of freedom in the data to justify the use of the higher-order curve (which reduces the degrees of freedom by from 8 to 7 in this example). Choosing the parabola in this case is a classic example of overfitting.

The basic lesson here is that one should avoid using more parameters than necessary when fitting a function (or functions) to a set of data. Doing otherwise, more often than not, leads to large extrapolation errors.

In using PCA for statistical climate reconstruction, avoiding overfitting means — in particular — carefully determining the right number of PCs to retain. Just as in the simple example above, there are two different — but both important — tests to apply here, one a priori and one a posteriori.

Page 1 of 4 | Next page