Responses to McShane and Wyner

13 Dec 2010 by group

Gavin Schmidt and Michael Mann

Readers may recall a flurry of excitement in the blogosphere concerning the McShane and Wyner paper in August. Well, the discussions on the McShane and Wyner paper in AOAS have now been put online. There are a stunning 13 different discussion pieces, an editorial and a rebuttal. The invited discussions and rebuttal were basically published ‘as is’, with simple editorial review, rather than proper external peer review. This is a relatively unusual way of doing things in our experience, but it does seem to have been effective at getting rapid responses with a wide variety of perspectives, though without peer review, a large number of unjustified, unsupportable and irrelevant statements have also got through.

A few of these discussions were already online, i.e. from Martin Tingley, Schmidt, Mann and Rutherford (SMR), and one from Smerdon. Others, including contributions from Nychka & Li, Wahl & Ammann, McIntyre & McKitrick, Smith, Berliner and Rougier are newly available on the AOAS site and we have not yet read these as carefully yet.

Inevitably, focus in the discussions is on problems with MW, but it is worth stating upfront here (as is also stated in a number of the papers) that MW made positive contributions to the discussion as well – they introduced a number of new methods (and provided code that allows everyone to try them out), and their use of the Monte Carlo/Markov Chain (MCMC) Bayesian approach to assess uncertainties in the reconstructions is certainly interesting. This does not excuse their rather poor framing of the issues, and the multiple errors they made in describing previous work, but it does make the discussions somewhat more interesting than a simple error correcting exercise might have been. MW are also to be commended on actually following through on publishing a reconstruction and its uncertainties, rather than simply pointing to potential issues and never working through the implications.

The discussions raise some serious general issues with MW’s work – with respect to how they use the data, the methodologies they introduce (specifically the ‘Lasso’ method), the conclusions they draw, whether there are objective methods to decide whether one method of reconstruction is better than another and whether the Bayesian approach outlined in the last part of the paper is really what it is claimed. But there are also a couple of very specific issues to the MW analysis; for instance, the claim that MW used the same data as Mann et al, 2008 (henceforth M08).

On that specific issue, presumably just an oversight, MW apparently used the “Start Year” column in the M08 spreadsheet instead of the “Start Year (for recon)” column. The difference between the two is related to the fact that many tree ring reconstructions only have a small number of trees in their earliest periods and that greatly inflates their uncertainty (and therefore reduces their utility). To reduce the impact of this problem, M08 only used tree ring records when they had at least 8 individual trees, which left 59 series in the 1000 AD frozen network. The fact that there were only 59 series in the AD 1000 network of M08 was stated clearly in the paper, and the criterion regarding the minimal number of trees (8) was described in the Supplementary Information. The difference in results between the correct M08 network and spurious 95 record network MW actually used is unfortunately quite significant. Using the correct data substantially reduces the estimates of peak medieval warmth shown by MW (as well as reducing the apparent spread among the reconstructions). This is even more true when the frequently challenged “Tiljander” series are removed, leaving a network of 55 series. In their rebuttal, MW claim that M08 quality control is simply an ‘ad hoc’ filtering and deny that they made a mistake at all. This is not really credible, and it would have done them much credit to simply accept this criticism.

With just this correction, applying MW’s own procedures yields strong conclusions regarding how anomalous recent warmth is the longer-term context. MW found recent warmth to be unusual in a long-term context: they estimated an 80% likelihood that the decade 1997-2006 was warmer than any other for at least the past 1000 years. Using the more appropriate 55-proxy dataset with the same estimation procedure (which involved retaining K=10 PCs of the proxy data), yields a higher probability of 84% that recent decadal warmth is unprecedented for the past millennium.

However K=10 principal components is almost certainly too large, and the resulting reconstruction likely suffers from statistical over-fitting. Objective selection criteria applied to the M08 AD 1000 proxy network as well as independent “pseudoproxy” analyses (discussed below) favor retaining only K=4 PCs. (Note that MW correctly point out that SMR made an error in calculating this, but correct application of the Wilks (2006) method fortunately does not change the result, 4 PCs should be retained in each case). Nonetheless, this choice yields a very close match with the relevant M08 reconstruction. It also yields considerably higher probabilities up to 99% that recent decadal warmth is unprecedented for at least the past millennium. These posterior probabilities imply substantially higher confidence than the “likely” assessment by M08 and IPCC (2007) (a 67% level of confidence). Indeed, a probability of 99% not only exceeds the IPCC “very likely” threshold (90%), but reaches the “virtually certain” (99%) threshold. In this sense, the MW analysis, using the proper proxy data and proper methodological choices, yields inferences regarding the unusual nature of recent warmth that are even more confident than expressed in past work.

An important real issue is whether proxy data provides more information than naive models (such as the mean of the calibrating data for instance) or outperform random noise of various types. This is something that has been addressed in many previous studies which have come to very different different conclusions than MW, and so the reasons why MW came to their conclusion is worth investigating. Two factors appear to be important – their use of the “Lasso” method exclusively to assess this, and the use of short holdout periods (30 years) for both extrapolated and interpolated validation periods.

So how do you assess how good a method is? This is addressed in almost half of the discussion papers – Tingley in particular gives strong evidence that Lasso is not in fact a very suitable method, and is outperformed by his Composite Regression method in test cases, Kaplan points out that using noise with significant long term trends will also perform well in interpolation. Both Smith and the paper by Craigmile and Rajaratnam also address this point.

In our submission, we tested all of the MW methods in “pseudoproxy” experiments based on long climate simulations (a standard benchmark used by practitioners in the field). Again, Lasso was outperformed by almost every other method, especially the EIV method used in M08, but even in comparison with the other methods MW introduced. The only support for ‘Lasso’ comes from McIntyre and McKitrick who curiously claim that the main criteria in choosing a method should be how long it has been used in other contexts, regardless of how poorly it performs in practice for a specific new application. A very odd criteria indeed, which if followed would lead to the complete cessation of any innovation in statistical approaches.

The MW rebuttal focuses a lot on SMR and we will take the time to look into the specifics more closely, but some of their criticism is simply bogus. They claim our supplemental code was not usable, but in fact we provided a turnkey R script for every single figure in our submission – something not true of their code, so that is a little cheeky of them [as is declaring that one of us to be a mere blogger, rather than a climate scientist ;-) ]. They make a great deal of the fact that we only plotted the ~50 year smoothed data rather than the annual means. But this seems to be more a function of their misconstruing what these reconstructions are for (or are capable of) rather than a real issue. Not least of which, the smoothing allows the curves and methods to be more easily distinguished – it is not a ‘correction’ to plot noisy annual data in order to obscure the differences in results!

Additionally, MW make an egregiously wrong claim about centering in our calculations. All the PC calculations use prcomp(proxy, center=TRUE, scale=TRUE) to specifically deal with that, while the plots use a constant baseline of 1900-1980 for consistency. They confuse plotting convention with a calculation.

There is a great deal to digest in these discussions, and so we would like to open the discussion here to all of the authors to give their thoughts on how it all stacks up, what can be taken forward, and how such interactions might be better managed in future. For instance, we are somewhat hesitant to support non-peer reviewed contributions (even our own) in the literature, but perhaps others can make a case for it.

In summary, there is much sense in these contributions, and Berliner’s last paragraph sums this up nicely:

The problem of anthropogenic climate change cannot be settled by a purely statistical argument. We can have no controlled experiment with a series of exchangeable Earths randomly assigned to various forcing levels to enable traditional statistical studies of causation. (The use of large-scale climate system models can be viewed as a surrogate, though we need to better assess this.) Rather, the issue involves the combination of statistical analyses and, rather than versus, climate science.

Hear, hear.

PS: The full code, data, and additional supplements from SM&R are available here.

89 Responses to "Responses to McShane and Wyner"

Stephen says

17 Dec 2010 at 6:13 AM

Hi Gavin,

thanks for the considered reply to my #46 above. I believe I follow that; you’re saying that the information that one would want to test against is lost in averaging in the ensemble mean. My point remains, I think. Which models should one examine? You say look at all of them. Which runs? – you’d presumably want to do a bunch of them and look at the results you get. But is there not so much variability, so much noise, across different models and different runs of different models that you are bound to find – by chance – examples of the claimed real data pattern? How many possible “false positives” would you need to see satisfy yourself that you were getting useful information, and not coincidental matched from the individual runs? A single example? A certain percentage?

[Response: This is hardly a unique problem to this case. Classical statistics of the sort practiced by McKitrick by convention uses a 95% cutoff, so if you found that this pattern occurred less than one time in 20 runs you might start to think that there was something to be explained. However, it still would not prove that there was contamination – perhaps the real world was just one of those times. McKitrick’s hypothesis would have a lot more traction if it actually predicted something observable, rather than being a post hoc validation. – gavin]

Furthermore, the “results” are just the (highly variable) output of a computer model. What’s to stop a researcher from simply keep pushing the model “go” button until he or she got the desired result (in whatever direction), and their opponent doing the same in the opposite direction (or at least each side being suspicious that the other had done that)?

[Response: The set of model runs useful for this kind of exercise are quite limited – essentially being limited to the simulations archived for IPCC AR4/CMIP3 (and that are now being produced for CMIP5). They take months of super-computer time and I assure you that no group is doing them for any reason remotely connected to McKitrick. That’s an amusing thought actually… – gavin]

In short, how can it be useful to declare “ah look, I found the same pattern in a model run that you claim to have found in the real data, therefore the null hypothesis is not disproved” (or the reverse, mutatis mutandis) when there is so much noise and variability in individual model runs. Different people doing different runs of different models will (may) get different results.

[Response: It’s precisely because there is variability in the runs that this is useful. At minimum, the fact that I found similar ‘significant’ patterns in the 5 runs I looked at (and I only looked at those 5), indicated to me that the significance tests McKitrick proposed are way too generous to his hypothesis – which in my opinion is hopelessly flawed in conception in any case. But it is open to him to show his case more forcefully, but unfortunately each of his attempts have only added noise to an already very weak signal. I told him years ago what would be needed to make a reasonable case, but he has yet to do it. – gavin]
Martin Vermeer says

17 Dec 2010 at 6:23 AM

Philip #47, I would say that this supposed issue of removing “poor” proxies is really a diversion. I have been playing with M&W’s code (well written and easy to use BTW) and, while I did notice that the number of proxies they used was larger than that of Mann et al. for the corresponding period, I never got around to figuring out why.

What I did notice was, that M&W performed a set of traditional regressions (as reported in their Figure 14) truncating not only the proxy data set by retaining only a limited number of principal components, but similarly also truncating the instrumental data used for calibration. In this way they produced 4×4 = 16 of the total of 27 reconstruction curves plotted in Figure 14, for numbers of PCs of 1, 5, 10 and 20, for both the proxy and the instrumental data.

The remarkable thing that I found was that of those 16 recons, no less than 8 — for proxy PC numbers of both 5 and 10, and for any number of instrumental PCs — lie very close to each other and the Mann et al. 2008 corresponding reconstruction curve!

It thus appears to me (as an amateur) that PCA truncation of the calibration data acts as a regularization technique rendering harmless the effect of including these “poor” proxies, as well as that of retaining a too large number (10) of proxy principal components.

So, M&W tried very hard not to reproduce the Mann et al. 2008 solution, but failed in no less than eight instances… see here, the curves in blue and cyan. See also my annotated Figure 13, showing that, even for the somewhat questionable verification metric used by M&W, these recons are actually performing pretty well…
Kevin McKinney says

17 Dec 2010 at 7:44 AM

#50–Interesting speculation, Dan, but what consequence is to be drawn? Do we ignore the knowledge we have (imperfect and incomplete though it may be) on the off chance that no matter what we do, things will get worse?

After all, there’s a chance that Earth could get hit with a intragalactic gamma ray burst or, more prosaically, a really massive asteroid, too. And then all this time spent commenting on RC would be wasted. . .
Yvan Dutil says

17 Dec 2010 at 9:05 AM

#48 “There has been some reports that the area weighting might not have been done corrected. This would have overweighted the arctic region and. in consequence, the slope of the baseline. Could anyone you has actually read the code can comment on it?

[Response: No idea what you are referring to. Can you be more precise? – gavin]”

The contentious point was that M&W combined the temperature proxy without weighting by the area. The claim cames from the fact they did not mention they did in the the paper and also the fact that their curve was almost a carbon copy of the Kaufman et al curve for northern region. This gave the impression that high latitudes were overweighted.
Bob (Sphaerica) says

17 Dec 2010 at 9:21 AM

50 (Dan H),

History tells us of mass chaos caused by a massive downfall in agricultural yields occurring numerous time.

Yes. In fact, such climate change is suspected in the downfall of any number of civilizations (Maya, Anasazi, etc.). Of course, in almost all of those cases, the droughts/negative effects have been “local” (e.g. North America only), not global.

However, recent crop failures pale in comparison to those that have occurred throughout history.

This is in fact true in respect to the “relatively stable climate regime” of the past two thousand years. In that time, the regular, long term swings of the climate have been enough to destroy individual civilizations, but not all human civilization at once.

Today, however, we face a problem which is radically different in three respects.

First, civilization is global, not local. We’re not talking about Rome or North America, we’re talking about everyone, everywhere.

Second, the human population is currently dependent upon a greatly enhanced agricultural output. We’re not talking about a city state of 50,000 or 200,000 people where most of the population is directly involved in food production, we’re talking about an interconnected civilization of billions where (at least in the west) almost no one does anything more than shop for food. People don’t provide for themselves… the system provides.

Lastly, we’re not talking about a sustained 1C swing in global temperatures (and associated precipitation changes, which are the real killer) as in the past 2,000 years (with local swings of — I’m guessing — 3C to 5C). We’re talking about a 3C to 6C global swing, with local swings up to 8C or 12C or even 20C (very, very worst case, so don’t jump on that). And this isn’t something that will necessarily pass in “just” a few hundred years.

[As a side note, part of the problem in your presentation is the use of “recent crop failures” as a reference point. No one is all that worried about what has happened to date. Today isn’t the problem. 30 or 50 years from today is the problem.

This is very much like the rather optimistic man who fell off of the top of the Empire State Building. Each time he passed an open window, he was heard to say “so far, so good.”

Regarding the relatively stable climate regime of the last couple hundred years, has anyone asked whether that was the anomaly.

Short answer — Yes, of course. That’s why deniers like to argue about the Medieval Warm Phallacy, because it implies that this has happened before, so it’s not a problem. But a rather large pile of paleoclimate evidence leads to a markedly different conclusion.
Stephen says

17 Dec 2010 at 9:30 AM

Gavin,

thanks for taking the time to respond again at #51; your thoughts are much appreciated.
Malcolm Hughes says

17 Dec 2010 at 12:18 PM

Phil #47: As the author responsible for suggesting the selection criteria for tree-ring data in the Mann et al papers (1998, 1999, 2000, 2008, 2009 etc) I can tell you why the particular criteria were used, and why there is a clear basis for them. It is important to note that the requirement for at least 8 series was coupled with the criterion that there be a mean correlation equal to or greater than 0.5 between the individual series at one site that were combined to produce the site chronology. Wigley et al (1984) derived and tested a statistic they called ‘Expressed Population Signal’ (EPS) that has since been referred in many, many publications (444 by December 17 2010 according to the ISI Web Of Knowledge). They wrote ‘time series are averaged to enhance a common underlying signal or combined to produce area averages. How well, then, does the average of a finite number (N) of time series represent the population average…?’. To calculate EPS you need N and the mean correlation between the N series (rbar). In FORTRAN terms it is given by N*rbar/(1+(N-1)*rbar). If you write a simple MS-Excel formula you can calculate EPS for various values of rbar and N. Setting rbar as 0.5 shows EPS rising steeply up to 0.89 at N ~ 8, and then yielding very little increase in EPS for each additional series. By the way, the Wigley et al. (1984) paper includes not only testing of this and another statistic in real-life use, but also includes a formal derivation of them. Of course, as with any statistic, EPS is a guide to judgment and the assumptions on which it is based must be borne in mind. Given how much attention has been given to the problem of replication of tree-ring data in the published literature, as witnessed by the frequent citing of the Wigley et al (1984) paper, McShane and Wyner’s rejoinder reveals a distinct lack of familiarity with the most basic material on which they chose to pronounce.
Reference:
Wigley et al. 1984. Journal of Climate and Applied Meteorology, 23, 201-203.
steven mosher says

17 Dec 2010 at 1:52 PM

That’s a good point Gavin. With his code in hand could you do that?

[Response: Yes, and I have for each iteration I have looked at. And McKitrick’s conclusions fail to hold up every time. There is a limit to how many times I’m going to do it again. Anyone else could do so themselves using the archived data for Schmidt (2009) or data from IPCC AR4. Note that this requires looking at individual runs, not ensemble means. – gavin]

##
Thanks gavin. I think that kind of test could move discussions forward and get them off the whole discussion of “peer review.” The latter question just devolves into a side show, while the former is actually discussable even by people who don’t like each other personally. FWIW
CM says

17 Dec 2010 at 2:51 PM

Bob #54, re: “Medieval Warm Phallacy”,

Clearly you’re as fed up as I am with the phallic imagery of the [winks knowlingly] “hockey stick” debate.

(Sorry. Friday evening. Will shut up now.)
Bob (Sphaerica) says

17 Dec 2010 at 3:40 PM

57 (CM),

Must be something in the air. Visit my recent comments here and here.

Nothing, however, beats retitling the Musthavebeen Warm Period for quick, random laughs.

Tis the season.

(Sorry. Friday afternoon, too near to Xmas. Will shut up now, too.)
Energy Moron says

19 Dec 2010 at 9:38 AM

1) M&W have convinced me that the previous 30 years are sufficiently outside of the norm that there is something significant going on. Quite frankly I don’t think it makes a difference what happened 1,000 years ago. The M&W paper has significantly strengthened the case that something is happening.

2) The most telling comment actually appears as a footnote: “On the other hand, perhaps our model is unable to detect the high level of and sharp run-up in recent temperatures because anthropogenic factors have, for example, caused a regime change in the relation between temperatures and proxies. While this is certainly a consistent line of reasoning, it is also fraught with peril for, once one admits the possibility of regime changes in the instrumental period, it raises the question of whether such changes exist elsewhere over the past 1,000 years. Furthermore, it implies that up to half of the already short instrumental record is corrupted by anthropogenic factors, thus undermining
paleoclimatology as a statistical enterprise.”

Going back to the editorial the comment regarding the null hypothesis remains. The physics of warming have been known since Fourier and Tyndall; the question is how quickly it will occur. Once again I don’t care about what happened 1,000 years ago… the data is screaming that soemthing is happening now.

While the statistical methods cannot say what is causing the deviation, the first place to look is to somehow incorporate what the large supercomputer runs are saying as a proxy within the proxy.

Bringing in another paper, the Dressler… 2% correlation coefficient? You have to be kidding… and the technique is very elementary, not even filtering out the known MJO impacts. I look at the interpretation in that paper and simply note it convinces me that other things are much more important than the effect which they are trying to measure!

No, somebody is going to have to form a proxy of the long term models to incorporate as “physics” for the past 30 years for these models to be worthwhile…

3) There are lies, damn lies, and statistics–Mark Twain
Energy Moron says

19 Dec 2010 at 10:36 AM

Figure 8 of M&W is the most important result. Once again this shows that no matter how one slices and dices the information something happened in the 1960’s. Statistics alone cannot tell you anything about the causes.

It’s getting hotter folks.

On a lighter note it should be noted that this website made an important advancement as to the possible causes back in 2007 with the rediscovery of the sheep-albedo feedback caused by the introduction of polyester back in the 1960’s.

I have to point out that the effect (not the cause) was noted by E. Presley back in 1972 with his observation that “I feel the temperature rising” without realizing the impact of the polyester jumpsuit.
Energy Moron says

19 Dec 2010 at 10:38 AM

Figure 8 of M&W is the most important result. Once again this shows that no matter how one slices and dices the information something happened in the 1960’s. Statistics alone cannot tell you anything about the causes.

On a lighter note it should be noted that this website made an important advancement as to the possible causes back in 2007 with the rediscovery of the sheep-albedo feedback caused by the introduction of polyester back in the 1960’s.

I have to point out that the effect (not the cause) was noted by E. Presley back in 1972 with his observation that “I feel the temperature rising” without realizing the impact of the polyester jumpsuit.
Hank Roberts says

19 Dec 2010 at 5:16 PM

Meta note on blog science:

When folks reading here run into discussions about this topic elsewhere, it will inform them if you add a cross-reference to info above from Malcolm Hughes (17 December 2010 at 12:18 PM) that begins:
“Phil #47: As the author responsible for suggesting the selection criteria for tree-ring data in the Mann et al papers (1998, 1999, 2000, 2008, 2009 etc) I can tell you why the particular criteria were used, and why there is a clear basis for them. It is important to note ….”
Nicolas Nierenberg says

20 Dec 2010 at 12:18 PM

Gavin, since you commented on our paper I have put up my comments at my blog.
Philip Machanick says

21 Dec 2010 at 4:42 AM

Malcolm #57: Thanks for the explanation. Once again the willingness of the real experts to help out those of us willing to learn is a great feature of this site.

I don’t know of any other area of science where those unfamiliar with the field have the arrogant presumption that they know better than those who’ve worked with the data for decades, and attack their work without a thorough analysis of how they got there – and do so publicly. An outsider can of course expose errors missed by those close to the work, but I’m still waiting for that to happen with anything significant to any fundamental findings.
Philip Machanick says

21 Dec 2010 at 4:47 AM

Nitpick: “A very odd criteria indeed” should be “A very odd criterion indeed”.
Philip Machanick says

21 Dec 2010 at 4:58 AM

Nicolas #65: if you have something to add to the discussion, please add it here. Some of us are limited in time to go to every blog where someone has an opinion.
Nicolas Nierenberg says

21 Dec 2010 at 1:03 PM

Phillip,

A very silly comment indeed. I am the co-author of the paper, and all you would have to do is follow the link to see my remarks. That is hardly going to every blog where someone has an opinion. If you don’t want my opinion about my paper that’s fine. I really keep hoping we can raise the level of discourse.
Bob (Sphaerica) says

21 Dec 2010 at 5:34 PM

67 (Philip),

Nitpick counter…

“Criteria” has been in fairly common, informal usage as an alternative singular form of criterion for over half a century (see Webster here). As such, it is likely to soon be granted status as a new, official definition. Consider the similar but different case of the word “agenda“.

The word “nitpick,” on the other hand, has itself been around for less than 50 years, and so its particular use in accosting the use of “criteria” in the singular puts you in a bit of an embarrassingly contradictory sticky wicket!
Donna says

22 Dec 2010 at 11:53 AM

Nicholas – it really would be nice if you did put your comments here. The reason being is that it makes it very hard to follow the discussion if you put all your refutations/comments about what you see as errors etc that you feel that people like Gavin made in your blog versus putting them here where there has been an invitation to use this spot as a point where the discussion can be focused.
Its very hard to keep flipping back and forth to try and see what you think Gavin did/didn’t do – particulary because you don’t provide any links to his side of the discussion.
oneuniverse says

23 Dec 2010 at 4:59 PM

Hi Donna, if you have questions for Nicolas Nierenberg, you may want to post them at his site, however, I’ve pasted his response below :

“Socioeconomic Patterns in Climate Data” by Ross McKitrick and me (MN 2010) has just been accepted for publication by the Journal of Economic and Social Measurement. It can be accessed here. This paper is largely in response to Gavin Schmidt’s 2009 paper “Spurious Correlations…” (S09) that I have discussed earlier. S09 was published in the International Journal of Climatology (IJOC), which subsequently rejected an earlier version of MN2010. I was very happy to provide a bit of the work on this paper. In particular I did some analysis, some modeling, and helped a bit with the editing.
There is, as often seems to be the case in climate science, some heated discussion surrounding two distinct areas with our paper. First there is the question of whether we received a fair hearing in peer review from Journal of Climate. Second once again Gavin is saying that our conclusions are incorrect. I should add that he has done this without benefit of reading our actual paper, but it seems fairly clear that reading the paper will not change his mind.
For me there are two distinct fairness and good practice issues. First S09 was clearly a response on Ross’s earlier work. I’m sure this is too much to ask, but Gavin should have sent his paper to Ross for comments before publishing. It would have been the right thing to do scientifically, but I’m not sure how much this is about science. Failing Gavin doing that then IJOC certainly should have asked the author’s of the previous papers if they had comments. At the absolute minimum they should have offered space for responses in their publication. They didn’t do any of these things, and it doesn’t appear to me as if the reviewers of either S09 or MN2010 even read the predecessor papers. Second the objections to MN2010 from IJOC didn’t have to do with whether we were right. They had to do with whether they felt the predecessor papers were the right approach at all. But the problem is that they were different, and less specific, arguments than those in S09. The weird thing is that these comments weren’t themselves subject to peer review or response, so from IJOC’s perspective Gavin’s incorrect arguments were allowed to stand, because the reviewers had altogether different objections to Ross’s earlier work. In my opinion they should have asked us to submit a response rather than a paper in order to resolve the situation, but they didn’t.
In response to our paper Gavin is now making new technical arguments about why we are incorrect. The first argument is that he has drawn a graph that shows spatial autocorrelation (SAC) of the residuals. It is at least nice of him to acknowledge that the argument is S09 was incorrect, and that you need to look at the residuals. The problem is that he is still not doing any type of standard test for SAC. These are well known, and we have done those tests in our paper. This part is really amazing. I’m not an expert in this area, but back when I was looking at this I was able to quickly find a text on the subject and find these standard tests. Who would make a statistical argument without using the standard statistical tests in the literature? We have also shown the effect of allowing for SAC where necessary and that the results stand. So in my opinion that is what he needs to respond to. His second argument is that it is possible to see these types of correlations in a single instance of a GCM run. This will take a little more examining.
In S09 Gavin showed several GCM runs. Using those he showed that some economic variables were significant in the same regression. Since, of course, socioeconomic variables can’t be influencing a GCM this shows that these types of correlations are spurious. There are two problems. First, where they were significant the coefficients were very small, and of the opposite sign of those found with the real world climate data. Second, and rather ironically, if you allow for SAC they lose all significance, unlike those from real world climate data. In other words he managed to incorrectly argue that Ross’s earlier results were wrong because of SAC, and then make a flawed argument because he didn’t allow for SAC.
Now he is making a different argument, which is that if you do a whole bunch of GCM runs you will see a result exactly like Ross’s earlier work. The problem is that none of the runs in S09 look like that, and he isn’t producing any others. If he does then I guess we could take a look. Even if it does happen sometimes, and I guess it could as a matter of random outcomes, it would need to happen a lot for our conclusions to be incorrect. That is the whole idea of significance testing.
These results indicate urban heat island (UHI) and other measurement issues may be affecting the published long-term land temperature trends. I believe that this result is plausible given what is known about UHI and the lack of meta data for large portions of the world. The results also indicate that it is in fact areas where we have the least amount of meta data and the poorest records that are the most affected. Also remember that land makes up only one third of the Earth’s surface so even if there were a 50% error in land trends this would only be a 15% difference in the overall trend. Therefore this shouldn’t be an argument over the big picture. But people building models need accurate measurements of the various portions of the temperature trend, so they should be quite interested if corrections need to be made. The results of any one study aren’t definitive of course, but it should be taken seriously and additional work should be encouraged rather than huge amounts of energy and time being spent on spurious arguments trying to get rid of it.
Donna says

23 Dec 2010 at 10:28 PM

oneuniverse – thank you for posting the comments. I guess I would have preferred that Nicolas Nierenberg would have done so because that would be a sign to me that he understtod the point I was making.
Given some of his complaints about wanting people to email, to send papers to others for a read before publishing, his behavaior seems odd. If he and his coauthor did specifically request that gavin not be a reviewer and he thinks that is fine, then why all the comments that others should have sent their work to those that they were commenting on. Does not appear to be a consistent standard of behavior.
Also his last statement seems, for lack of a better term “off” – it seems like their paper was treated seriously. Whether or not he agrees with the arguments/comments, people evidently did not just dismiss it out of hand but took the time to do a technical analysis of it and post what they thought were the weaknesses. That hardly counts as spurious arguments trying to get rid of it.
Philip Machanick says

24 Dec 2010 at 3:06 AM

Bob (Sphaerica) #70: Websters have not yet accorded it official status. What next? Accepting that “begs the question” now means “invites the question”, rather than a flawed logical argument? “Agenda” is a different case as you could argue it is a kind of collective noun. “Nitpick” may be new but that doesn’t make it wrong.

Back to the main topic: Nicolas #69: calling me silly is hardly raising “the level of discourse”.

While I agree that there is value in letting an author know if you are planning a rebuttal, this is not in my experience (in fields other than climate science) universal practice. Did you ask your co-author if he routinely does this? If you read this article by Ben Santer, you may get some hint as to why the “level of discourse” may be problematic.
Barton Paul Levenson says

24 Dec 2010 at 6:58 AM

one 72: First there is the question of whether we received a fair hearing in peer review from Journal of Climate.

BPL: Right, because the warmers are out to suppress dissent! Why don’t you try Energy & Environment, instead? I’m sure they’ll give you a fair hearing–i.e., one which lets you publish whatever you want, as long as it’s anti-AGW theory.
Donna says

24 Dec 2010 at 10:15 AM

One useful comment in the excerpt that one pulled over from Nicholas Nierenberg’s blog is where he states “The weird thing is that these comments weren’t themselves subject to peer review or response, so from IJOC’s perspective Gavin’s incorrect arguments were allowed to stand”. I don’t know much about comments being subject to peer reviewed or response but it sems unlikely to me that any journal would do what this seems to imply. Comments are comments and I would guess that most people would know that, just like blog posts, they are hardly the final say on anything. Of course, since my impression on science is that nothing is the “final say” since people are always challenging how we can best describe how things work, that is not surprising. I guess what is surprising is that someone would think that the comments were the final say.
I think in science what constitues the closest to the final say is when you go back and look to see that a mechanism/model/idea in a paper is getting cited and used in more research so that it has become one of the building blocks to an understanding of some aspect of how things work.
Bob (Sphaerica) says

24 Dec 2010 at 10:56 AM

Has anyone read the blog post by Dr. Rosemary Redfield — Arsenic-associated bacteria (NASA’s claims) — addressing the recent rather well publicized and seemingly ground-breaking paper by Dr. Wolfe-Simon, along with the long stream of comments??

The parallels with McShane and Wyner are striking, as well as with climate change science in general and specifically many other “climate soap operas.”

What is interesting is to see it happen in a completely different branch of science, in a microcosm totally isolated from the wider, carnivorous world of AGW science. It’s almost like a test tube experiment in itself of the collision of traditional science, media promotion, human nature and the Internet. There are even implications of using deceptive tricks in constructing graphs!!!

Key components:

1) The science goes against the mainstream.
2) The science was announced in a live NASA press briefing (an unusual step, and one that apparently puts off other scientists).
3) The idea that science was done “by press release” (i.e. a dramatic press briefing) was criticized.
4) A scientist used a blog post to criticize (with details) the paper.
5) That scientist also got more than a little “snarky” in the blog post.
6) Commenters, often fairly erudite and apparently qualified, piled on, or attacked the blogger, or each other (including a fair number comments on spelling and grammar, just like here!).
7) Science was discussed.
8) Grammar was discussed.
9) Etiquette (in both science and blogging) was discussed.
10) The quality of the reviewers was discussed.
11) The value and quality of the peer review process was discussed (should the paper ever have made it past peer review?).
12) The value and quality of the blog-post approach to science was discussed.
13) The authors of the original paper asked that comments and questions be sent to them rather than discussed on the blog, which was seen as contradictory to the public media fanfare given to the paper’s announcement.

I’d strongly recommend reading the blog post and comments to everyone. It makes fascinating reading, and the fact that the parallels occur in a closed environment (a sort of test tube experiment, where the scientists are in the tube!), isolated from the political nature of the climate change debate, helps to highlight the fact that science itself may (under the pressures of the Internet Revolution) be on the cusp of a dramatic change in the paradigm that defines business-as-usual, or at least it’s at a crisis point where perhaps science as a culture needs to get together and lay out new ground rules on when, where, what and how things should be discussed — and what is appropriate blog behavior for reputable scientists in the field, and how they treat each other.

Basically, one could afford to be snarky in 1960, because it such snarkiness was expressed in phone calls or conversations at conventions, in ways that lived and died in moments, or perhaps by he-said-she-said rumors circulated among other scientists. That is nothing like the open-to-everyone-out-there-forever buzz-mill that the Internet has created.
Bob (Sphaerica) says

24 Dec 2010 at 11:00 AM

74 (Philip),

I was joking (mostly).
oneuniverse says

24 Dec 2010 at 12:51 PM

Dear moderators, I’ve posted a nearly identical comment to my #72 (reworded to get through the ‘duplicate comment’ block), apologies, please delete if possible. As explanation, I received an ‘ERROR: That reCAPTCHA response was incorrect.’ message for every submission, and thought that none had got through.

Donna, my only complaint against M&N’s behaviour during the review process would be if they hadn’t provided Gavin with a copy of MN2010. I’m more interested in the technical arguments, however, which is why I reproduced his comments.

Philip, I thought your comment to Nicolas was unnecessarily rude. It was maybe humurously ironic – you wrote: “Some of us are limited in time to go to every blog where someone has an opinion.”, yet you link your name to your blog, which is called “Opinionations – Philip Machanick’s views on the world”.

BPL, you’ve wrongly attributed Nicolas’s comment as mine in your #75. Your own response then mischaracterises what he said – his comments are restricted to the particular review process, he makes no extrapolations (BPL: “Right, because the warmers are out to suppress dissent!”). I think you’ve created a strawman.
Donna says

24 Dec 2010 at 4:28 PM

oneuniverse – while your only complaint against M&N’s behaviour during the review process would be if they hadn’t provided Gavin with a copy of MN2010, I noted several things that seemed off. I remain unconvinced that they held themselves to the standard to which they seem to think others should have been held.
I am also interested in the technical discussions though I suspect that some may be tired of explaining their points again.
oneuniverse says

24 Dec 2010 at 6:28 PM

Donna, there’s an interesting discussion at the ClimateAudit thread – eg. commenter ‘pete’ has made some good critical points, Ross McKitrick has responded, including some new calculations, although more work that needs to be done before the area illuminated by the criticism is explored by analysis.

I don’t think that such discussion can be fairly described as one ‘side’ futilely explaining things to the other.
Barton Paul Levenson says

25 Dec 2010 at 5:24 AM

one,

Sorry I misattributed that quote. My other comments I think I’ll let stand. Anyone who complains that a respected journal didn’t give them a fair hearing usually has an axe to grind.
Hank Roberts says

25 Dec 2010 at 10:25 AM

> flipping back and forth to try and see what you think …
> you don’t provide any links to his side of the discussion.

Quoting and citing sources — not required for blog science.
Nevertheless, it would be a good idea to raise the level that much.
oneuniverse says

25 Dec 2010 at 3:10 PM

Barton (#81) : “Anyone who complains that a respected journal didn’t give them a fair hearing usually has an axe to grind.”

Perhaps, but each case should be judged on its own merit. Not all complaints against respected institutions are bogus. Not sure if you’ve read McKitrick and Nierenberg’s letter to the IJOC explaining their grievances with the review, but if their descriptions of and quotations from the reviewers’ comments are accurate, they have a good point – I’d agree with them that two of the three reviewers seem to have raised many spurious or incorrect criticisms as justifications for not publishing their paper. You’ll note that MN weren’t given an opportunity to respond to the reviewers’ comments

Hank: “Quoting and citing sources — not required for blog science.”

Nicolas’ post was written partly in response to Gavin’s comment #39. Both made reference only to S09 and MN10.
Someone says

1 Jan 2011 at 1:27 AM

Re #72: It appears on his blog (see #65), in his comment 12/26/2010, Nicolas Nierenberg acknowledges not having tested whether “the significance was lost if you adjusted for SAC” also applies if you use individual runs instead of ensembles. (This refers to the false-positive significance in GCMs). He also indicates he doesn’t intend to make that test as “we could keep going indefinitely”, meaning he intends to ignore Gavin’s point since he doesn’t know if it would be the “definite test”.
Hank Roberts says

1 Jan 2011 at 3:15 AM

> on his blog

December 26, 2010 10:20 AM
Donna says

1 Jan 2011 at 4:08 PM

one – “Perhaps, but each case should be judged on its own merit. Not all complaints against respected institutions are bogus. Not sure if you’ve read McKitrick and Nierenberg’s letter to the IJOC explaining their grievances with the review, but if their descriptions of and quotations from the reviewers’ comments are accurate, they have a good point – I’d agree with them that two of the three reviewers seem to have raised many spurious or incorrect criticisms as justifications for not publishing their paper. You’ll note that MN weren’t given an opportunity to respond to the reviewers’ comments”
To decide if there was really a case against the journal then there is a whole lot of information that is missing. You said “if their description of and quotation from the reviewers are accurate” which minus the actual complete documents there is no way to prove one way or the other. Also – their claim is that they were treated badly (implying that they were given special treatment different than others recieve). They need to prove that too – if others also get criticism that the authors think is spurious or incorrect would tend to show that its not a “they are out to get us” but the fact that reviewers may make mistakes (if the criticism really is inaccurate) or don’t see things exactly the same as the authors (what the author thinks is spurious, the reviewer thought was pertinent).
I really doubt that they can show that they got treated worse than others – particulary when they seem to have gone out of their way to avoid having the person who was most likely to have been the best reviewer deliberately excluded.
And then you say that MN weren’t given the opportunity to respond to the reviewers coments. Again, this is only relevent if the most common practice at that journal was to have the authors respond to the reviewers comments and that they were treated outside of the norm (which I doubt).
They more come across as people who are new to a process and don’t understand how it works than as having any true case of unequal treatment. And like many people who believe that they are “right” – are haivng some trouble understanding why others are saying that they don’t agree and think that they missed some important items. Of course they (MN) don’t like it but it hardly constitutes any sort of proof a consiparcy against them.
I read the blog etc and find some of the comments weird. A comment was made that a critique made wasn’t part of the original concerns raised so it would basically be ignored. If I cared about whether what I was proposing was accurate or not, then whether or not a point was raised today or yesterday would be irrelevent. Was the critque valid, did it raise questions about the results I got would be what matters, not some weird sort of of ‘well you raised that point too late”.
Maybe they think this a game of some sort versus an issue with profound implications. It isn’t.
Peter Guttorp says

11 Jan 2011 at 1:09 AM

Two points: It is standard in statistical journals to have the discussions of a paper be judged by the editors only. They are typically not peer reviewed. The original paper, however, is typically reviewed more carefully than most.
(I know this because I am one of two co-editors of Environmetrics).
Second, the lasso method is a standard approach to constrained regression in modern statistical science. The constraint is on the absolute values of the coefficients.
That being said, I think the paper by McShane and Wyner suffers from a lack of understanding of the data. But it is an effort by skilled statisticians to analyze what they think is a standard data set. As such it is a useful addition to the literature, and what I really think is needed is more skilled statisticians getting interested in the field. There has been some growth, lately, and I am involved in several efforts to increase that growth.

[Response: The problem with basically unsupervised discussion on an issue that has a lot of ‘background’, is that the quality of many statements rarely rises above a blog comment. Whether M&W’s efforts are ‘useful’ depends clearly on whether the few innovative elements they introduced can be extracted from post-hoc justifications like the idea that large interannual variability must be plotted to avoid anyone noticing the variation in skill in reconstructing long term variability across different methods. Their use of lasso in one section to demonstrate how bad reconstructions are, and then it’s abandonment for another method in another section without any declared justification is another issue that peer-review from within the paleo-climatological community would have likely flagged. As the discussion papers clearly show, there is no shortage of independent people capable of assessing new statistical papers in this field. Outside peer-review is even more warranted when people come in and make sweeping statements based on, as you say, a lack of understanding of the data, and a lack of having even read (let alone understood) most of the extant literature. I’m sure a better managed approach from the outset would have given rise to a much more focused and useful exercise. – gavin]
Dikran Marsupial says

12 Jan 2011 at 4:33 AM

Peter Guttorp@88 I would suggest that any value the McS&W paper has is largely negated by its implicit encouragement for statisticians to analyse environmental datasets without fully engaging with the existing litterature and hence gaining an awareness of the likely issues. It would indeed be a good thing for more statisticans to become interested in the field, however it is best done by close collaboration with climatologists (as McS&W demonstrate).

BTW the McShane and Wyner paper will be discussed at the Cross-Validated journal club later this month, I expect the discusion would greatly benefit from statistically minded RCers contrubutions!

http://meta.stats.stackexchange.com/questions/685/second-cross-validated-journal-club