Gavin Schmidt and Michael Mann
Readers may recall a flurry of excitement in the blogosphere concerning the McShane and Wyner paper in August. Well, the discussions on the McShane and Wyner paper in AOAS have now been put online. There are a stunning 13 different discussion pieces, an editorial and a rebuttal. The invited discussions and rebuttal were basically published ‘as is’, with simple editorial review, rather than proper external peer review. This is a relatively unusual way of doing things in our experience, but it does seem to have been effective at getting rapid responses with a wide variety of perspectives, though without peer review, a large number of unjustified, unsupportable and irrelevant statements have also got through.
A few of these discussions were already online, i.e. from Martin Tingley, Schmidt, Mann and Rutherford (SMR), and one from Smerdon. Others, including contributions from Nychka & Li, Wahl & Ammann, McIntyre & McKitrick, Smith, Berliner and Rougier are newly available on the AOAS site and we have not yet read these as carefully yet.
Inevitably, focus in the discussions is on problems with MW, but it is worth stating upfront here (as is also stated in a number of the papers) that MW made positive contributions to the discussion as well – they introduced a number of new methods (and provided code that allows everyone to try them out), and their use of the Monte Carlo/Markov Chain (MCMC) Bayesian approach to assess uncertainties in the reconstructions is certainly interesting. This does not excuse their rather poor framing of the issues, and the multiple errors they made in describing previous work, but it does make the discussions somewhat more interesting than a simple error correcting exercise might have been. MW are also to be commended on actually following through on publishing a reconstruction and its uncertainties, rather than simply pointing to potential issues and never working through the implications.
The discussions raise some serious general issues with MW’s work – with respect to how they use the data, the methodologies they introduce (specifically the ‘Lasso’ method), the conclusions they draw, whether there are objective methods to decide whether one method of reconstruction is better than another and whether the Bayesian approach outlined in the last part of the paper is really what it is claimed. But there are also a couple of very specific issues to the MW analysis; for instance, the claim that MW used the same data as Mann et al, 2008 (henceforth M08).
On that specific issue, presumably just an oversight, MW apparently used the “Start Year” column in the M08 spreadsheet instead of the “Start Year (for recon)” column. The difference between the two is related to the fact that many tree ring reconstructions only have a small number of trees in their earliest periods and that greatly inflates their uncertainty (and therefore reduces their utility). To reduce the impact of this problem, M08 only used tree ring records when they had at least 8 individual trees, which left 59 series in the 1000 AD frozen network. The fact that there were only 59 series in the AD 1000 network of M08 was stated clearly in the paper, and the criterion regarding the minimal number of trees (8) was described in the Supplementary Information. The difference in results between the correct M08 network and spurious 95 record network MW actually used is unfortunately quite significant. Using the correct data substantially reduces the estimates of peak medieval warmth shown by MW (as well as reducing the apparent spread among the reconstructions). This is even more true when the frequently challenged “Tiljander” series are removed, leaving a network of 55 series. In their rebuttal, MW claim that M08 quality control is simply an ‘ad hoc’ filtering and deny that they made a mistake at all. This is not really credible, and it would have done them much credit to simply accept this criticism.
With just this correction, applying MW’s own procedures yields strong conclusions regarding how anomalous recent warmth is the longer-term context. MW found recent warmth to be unusual in a long-term context: they estimated an 80% likelihood that the decade 1997-2006 was warmer than any other for at least the past 1000 years. Using the more appropriate 55-proxy dataset with the same estimation procedure (which involved retaining K=10 PCs of the proxy data), yields a higher probability of 84% that recent decadal warmth is unprecedented for the past millennium.
However K=10 principal components is almost certainly too large, and the resulting reconstruction likely suffers from statistical over-fitting. Objective selection criteria applied to the M08 AD 1000 proxy network as well as independent “pseudoproxy” analyses (discussed below) favor retaining only K=4 PCs. (Note that MW correctly point out that SMR made an error in calculating this, but correct application of the Wilks (2006) method fortunately does not change the result, 4 PCs should be retained in each case). Nonetheless, this choice yields a very close match with the relevant M08 reconstruction. It also yields considerably higher probabilities up to 99% that recent decadal warmth is unprecedented for at least the past millennium. These posterior probabilities imply substantially higher confidence than the “likely” assessment by M08 and IPCC (2007) (a 67% level of confidence). Indeed, a probability of 99% not only exceeds the IPCC “very likely” threshold (90%), but reaches the “virtually certain” (99%) threshold. In this sense, the MW analysis, using the proper proxy data and proper methodological choices, yields inferences regarding the unusual nature of recent warmth that are even more confident than expressed in past work.
An important real issue is whether proxy data provides more information than naive models (such as the mean of the calibrating data for instance) or outperform random noise of various types. This is something that has been addressed in many previous studies which have come to very different different conclusions than MW, and so the reasons why MW came to their conclusion is worth investigating. Two factors appear to be important – their use of the “Lasso” method exclusively to assess this, and the use of short holdout periods (30 years) for both extrapolated and interpolated validation periods.
So how do you assess how good a method is? This is addressed in almost half of the discussion papers – Tingley in particular gives strong evidence that Lasso is not in fact a very suitable method, and is outperformed by his Composite Regression method in test cases, Kaplan points out that using noise with significant long term trends will also perform well in interpolation. Both Smith and the paper by Craigmile and Rajaratnam also address this point.
In our submission, we tested all of the MW methods in “pseudoproxy” experiments based on long climate simulations (a standard benchmark used by practitioners in the field). Again, Lasso was outperformed by almost every other method, especially the EIV method used in M08, but even in comparison with the other methods MW introduced. The only support for ‘Lasso’ comes from McIntyre and McKitrick who curiously claim that the main criteria in choosing a method should be how long it has been used in other contexts, regardless of how poorly it performs in practice for a specific new application. A very odd criteria indeed, which if followed would lead to the complete cessation of any innovation in statistical approaches.
The MW rebuttal focuses a lot on SMR and we will take the time to look into the specifics more closely, but some of their criticism is simply bogus. They claim our supplemental code was not usable, but in fact we provided a turnkey R script for every single figure in our submission – something not true of their code, so that is a little cheeky of them [as is declaring that one of us to be a mere blogger, rather than a climate scientist ;-) ]. They make a great deal of the fact that we only plotted the ~50 year smoothed data rather than the annual means. But this seems to be more a function of their misconstruing what these reconstructions are for (or are capable of) rather than a real issue. Not least of which, the smoothing allows the curves and methods to be more easily distinguished – it is not a ‘correction’ to plot noisy annual data in order to obscure the differences in results!
Additionally, MW make an egregiously wrong claim about centering in our calculations. All the PC calculations use
prcomp(proxy, center=TRUE, scale=TRUE) to specifically deal with that, while the plots use a constant baseline of 1900-1980 for consistency. They confuse plotting convention with a calculation.
There is a great deal to digest in these discussions, and so we would like to open the discussion here to all of the authors to give their thoughts on how it all stacks up, what can be taken forward, and how such interactions might be better managed in future. For instance, we are somewhat hesitant to support non-peer reviewed contributions (even our own) in the literature, but perhaps others can make a case for it.
In summary, there is much sense in these contributions, and Berliner’s last paragraph sums this up nicely:
The problem of anthropogenic climate change cannot be settled by a purely statistical argument. We can have no controlled experiment with a series of exchangeable Earths randomly assigned to various forcing levels to enable traditional statistical studies of causation. (The use of large-scale climate system models can be viewed as a surrogate, though we need to better assess this.) Rather, the issue involves the combination of statistical analyses and, rather than versus, climate science.