RealClimate logo

On replication

Filed under: — gavin @ 8 February 2009

This week has been dominated by questions of replication and of what standards are required to serve the interests of transparency and/or science (not necessarily the same thing). Possibly a recent example of replication would be helpful in showing up some of the real (as opposed to manufactured) issues that arise. The paper I’ll discuss is one of mine, but in keeping with our usual stricture against too much pro-domo writing, I won’t discuss the substance of the paper (though of course readers are welcome to read it themselves). Instead, I’ll focus on the two separate replication efforts I undertook in order to do the analysis. The paper in question is Schmidt (2009, IJoC), and it revisits two papers published in recent years purporting to show that economic activity is contaminating the surface temperature records – specifically de Laat and Maurellis (2006) and McKitrick and Michaels (2007).

Both of these papers were based on analyses of publicly available data – the EDGAR gridded CO2 emissions, UAH MSU-TLT (5.0) and HadCRUT2 in the first paper, UAH MSU-TLT, CRUTEM2v and an eclectic mix of economic indicators in the second. In the first paper (dLM06), no supplementary data were placed online, while the second (MM07) placed the specific data used in the analysis online along with an application-specific script for the calculations. In dLM06 a new method of analysis was presented (though a modification of their earlier work), while MM07 used standard multiple regression techniques. Between them these papers and their replication touch on almost all of the issues raised in recent posts and comments.

Data-as-used vs. pointers to online resources

MM07 posted their data-as-used, and since those data were drawn from dozens of different sources (GDP, Coal use, population etc. as well as temperature), trends calculated and then gridded, recreating this data from scratch would have been difficult to say the least. Thus I relied on their data collation in my own analysis. However, this means that the economic data and their processing were not independently replicated. Depending on what one is looking at this might or might not be an issue (and it wasn’t for me).

On the other hand, dLM06 provided no data-as-used, making do with pointers to the online servers for the three principle data sets they used. Unlike for MM07, the preprocessing of their data for their analysis was straightforward – the data were already gridded, and the only required step was regridding to a specific resolution (from 1ºx1º online to 5ºx5º in the analysis). However, since the data used were not archived, the text in the paper had to be relied upon to explain exactly what data were used. It turns out that the EDGAR emissions are disaggregated into multiple source types, and the language in the paper wasn’t explicit about precisely which source types were included. This was apparent when the total emissions I came up with differed with the number given in the paper. A quick email to the author resolved the issue since they hadn’t included aircraft, shipping or biomass sources in their total. This made sense, and did not affect the calculations materially.

Data updates

In all of the data used, there are ongoing updates to the raw data. For the temperature records, there are variations over time in the processing algorithms (satellites as well as surface stations), for emissions and economic data, updates in reporting or estimation, and in all cases the correction of errors is an ongoing process. Since my interest was in how robust the analyses were, I spent some time reprocessing the updated datasets. This involved downloading the EDGAR3 data, the latest UAH MSU numbers, the latest CRUTEM2/HadCRU2v numbers, and alternative versions of the same (such as the RSS MSU data, HadCRUT3v, GISTEMP). In many cases, these updates are in different formats, have different ‘masks’ and required specific and unique processing steps. Given the complexity of (and my unfamiliarity with) of economic data, I did not attempt to update that, or even ascertain whether updates had occurred.

In these two papers then, we have two of the main problems often alluded to. It is next-to-impossible to recreate exactly the calculation used in dLM07 since the data sets have changed in the meantime. However, since my scientific interest is in what their analysis says about the real world, any conclusion that was not robust to that level of minor adjustment would not have been interesting. By redoing their calculations with the current data, or with different analyses of analogous data, it is very easy to see that there is no such dependency, and thus reproducing their exact calculation becomes moot. In the MM07 case, it is very difficult for someone coming from the climate side to test the robustness of their analysis to updates in economic data and so that wasn’t done. Thus while we have the potential for an exact replication, we are no wiser about its robustness to possibly important factors. I however was able to easily test the robustness of their calculations to changes in the satellite data source (RSS vs. UAH) or to updates in the surface temperature products.


MM07 used an apparently widespread statistics program called STATA and archived a script for all of their calculations. While this might have been useful for someone familiar with this proprietary software, it is next to useless for someone who doesn’t have access to it. STATA scripts are extremely high level, implying they are easy to code and use, but since the underlying code in the routines is not visible or public, they provide no means by which to translate the exact steps taken into a different programming language or environment. However, the calculations mainly consisted of multiple linear regressions which is a standard technique, and so other packages are relatively easily available. I’m an old-school fortran programmer (I know, I know), and so I downloaded a fortran package that appeared to have the same functionality and adapted it to my needs. Someone using Matlab or R could have done something very similar. It was a simple matter to then check that the coefficients from my calculation and that in MM07 were practically the same and that there was a one-to-one match in the nominal significance (which was also calculated differently). This also provides a validation of the STATA routines (which I’m sure everyone was concerned about).

The processing in dLM06 was described plainly in their paper. The idea is to define area masks as a function of the emissions data and calculate the average trend – two methods were presented (averaging over the area then calculating the trend, or calculating the trends and averaging them over the area). With complete data these methods are equivalent, but not quite when there is missing data, though the uncertainties in the trend are more straightforward in the first case. It was pretty easy to code this up myself so I did. Turns out that the method used in dLM07 was not the one they said, but again, having coded both, it is easy to test whether that was important (it isn’t).


Given the data from various sources, my own codes for the processing steps, I did a few test cases to show that I was getting basically the same results in the same circumstances as was reported in the original papers. That worked out fine. Had their been any further issues at this point, I would have sent out a couple of emails, but this was not necessary. Jos de Laat had helpfully replied to two previous questions (concerning what was included in the emissions and the method used for the average trend), and I’m sure he or the other authors involved would have been happy to clarify anything else that might have come up.

Are we done? Not in the least.


Much of the conversation concerning replication often appears to be based on the idea that a large fraction of scientific errors, or incorrect conclusions or problematic results are the result of errors in coding or analysis. The idealised implication being, that if we could just eliminate coding errors, then science would be much more error free. While there are undoubtedly individual cases where this has been the case (this protein folding code for instance), the vast majority of papers that turn out to be wrong, or non-robust are because of incorrect basic assumptions, overestimates of the power of a test, some wishful thinking, or a failure to take account of other important processes (It might be a good idea for someone to tally this in a quantitative way – any ideas for how that might be done?).

In the cases here, the issues that I thought worth exploring from a scientific point of view were not whether the arithmetic was correct, but whether the conclusions drawn from the analyses were. To test that I varied the data sources, the time periods used, the importance of spatial auto-correlation on the effective numbers of degree of freedom, and most importantly, I looked at how these methodologies stacked up in numerical laboratories (GCM model runs) where I knew the answer already. That was the bulk of the work and where all the science lies – the replication of the previous analyses was merely a means to an end. You can read the paper to see how that all worked out (actually even the abstract might be enough).

Bottom line

Despite minor errors in the printed description of what was done and no online code or data, my replication of the dLM07 analysis and it’s application to new situations was more thorough than I was able to do with MM07 despite their more complete online materials. Precisely because I recreated the essential tools myself, I was able to explore the sensitivity of the dLM07 results to all of the factors I thought important. While I did replicate the MM07 analysis, the fact that I was dependent on their initial economic data collation means that some potentially important sensitivities did not get explored. In neither case was replication trivial, though neither was it particularly arduous. In both cases there was enough information to scientifically replicate the results despite very different approaches to archiving. I consider that both sets of authors clearly met their responsibilities to the scientific community to have their work be reproducible.

However, the bigger point is that reproducibility of an analysis does not imply correctness of the conclusions. This is something that many scientists clearly appreciate, and probably lies at the bottom of the community’s slow uptake of online archiving standards since they mostly aren’t necessary for demonstrating scientific robustness (as in these cases for instance). In some sense, it is a good solution to a unimportant problem. For non-scientists, this point of view is not necessarily shared, and there is often an explicit link made between any flaw in a code or description however minor and the dismissal of a result. However, it is not until the “does it matter?” question has been fully answered that any conclusion is warranted. The unsatisfying part of many online replication attempts is that this question is rarely explored.

To conclude? Ease of replicability does not correlate to the quality of the scientific result.

And oh yes, the supplemental data for my paper are available here.

295 Responses to “On replication”

  1. 201
    captdallas2 says:

    There is an interesting discussion on an attempt to replicate Steig et al Antarctic warming going on now in the blogosphere. Perhaps Gavin would like to comment.

  2. 202
    paulina says:

    Commentary for Michael Smith:

    “If the distribution encompasses the observed correlations, then the null hypothesis (that there is no contamination)cannot be rejected”

    This statement has the logical form:

    If X, then not-Y

    This is equivalent to:

    If Y, then not-X.

    But it implies nothing about the case:

    If not-X, then…

    i.e., it says nothing, “logically” or otherwise, about:

    If the distribution does NOT encompass the observed correlations, then…

  3. 203
    Lawrence Coleman says:

    Re:198 Thanks Pete, you’ve given be well moderated anwsers to my questions in the past but in relation to my argument in 185 to what degree of duplication is necessary for a hypotothis to become scientific fact and how many independent studies must be undertaken and how many valuable years will this take (I almost used the term ‘waste’). If we wait for govermental agencies to get 100% certainty as to what degree climate change is natural variability and what degree is anthropogenic we’re cooked already. The luxury scientists had in the past on being fastidious is NOT relevent anymore for this subject…we are in unprecedented times. If a raging forest fire is hurtling towards your house at 50miles/hr, (very revalant for me in australia although not living in victoria thank god!) even if you were a nobel winning scientist you are not likely to go and measure relative humidity, which vectors and variabilty the wind is coming from, the relative darkening of the sky, a spectrograph of the brightness of the glow on the horizon are going to get your most valuable things and family jump in your car and get the hell out of there taking decisive action that will ultimately save you and your family’s life. I can see no difference at all with climate change and the response we all have to take. Sorry for being blunt..but you would have to be the biggest numbskull not to believe that CC is caused primararily by human factors. 400000 years ago the water level was 21ft higher (recent studies in Bermuda) than today with similar and corresponding rise in CO2, today it is over 25-30% higher than 400kyrs ago and guess what the sea level rise is again accelerating year by year. We simply cannot wait for the type of replication scientists were used to in the past. The scientific community and media has to make it blindingly obvious to any policy maker that the time to take decisive action is TODAY.

  4. 204

    Hank #107, tamino #108: this is somewhat relevant.

  5. 205

    RE 201,

    Tell him to publish it in a scientific journal so we can see what it is worth… I don’t have time to check out all the “tries” that are done on blogs.

  6. 206

    Magnus, yeah.

    But off-the-cuff, one obvious problem with the method proposed is that the “simple data regrouping” also transmogrifies the empirical correlation matrix of the data…

    Remember the RegEM is based on using the empirical correlation directly between the station data. That already includes implicitly the long range spatial correlations present in the data, in a theoretically optimal way.

    Now you move around some points, by up to 550 nm (or more?) and merge a number of them into new points, averaging their data. The new “virtual stations” will have a vastly smeared out correlation matrix, with probably a much longer correlation range. IOW where the data is sparse, in the East, it will also get a higher areal weight. I think. Or something.

    Why oh why re-do something the method is already supposed to be doing right?

    [Response: Do you really need an answer to that? – gavin]

  7. 207
    Bill Hamilton says:

    I’m new to RealClimate, so perhaps I’m asking for something that is easily available but just don’t know where to find it. I would like to be able to understand how the various computer programs that simulate future climate are put together. If possible I’d like to be able to look at some of the code. Can you point me to a source? Thanks.

    [Response: NCAR CCSM or GISS ModelE are freely available for download (the former is prettier than the latter). – gavin]

  8. 208
    dhogaza says:

    There is an interesting discussion on an attempt to replicate Steig et al Antarctic warming going on now in the blogosphere

    Watts et al are science illiterates, there’s really no point for adults to waste their time with their pseudo-scientific endeavors.

  9. 209
    Ike Solem says:

    I find this discussion about code and data to be very entertaining – and I’d suggest that the same level of scrutiny be applied to the various clean coal proposals that are circulating through Congress, especially the issue of FutureGen, the public-private partnership between coal and mining and electric utility interests and the Department of Energy, set up in 2003 and managed by the contractor Battelle (who also operates the National Renewable Energy Lab for the the DOE).

    There are no publications on the supposed technology, and no data on performance has been made available. Even though the project was set up with DOE funding, all data on the technology has been kept secret because it is “proprietary” – and the recent stimulus package included $2 billion for FutureGen. There is certainly a large issue there, right?

    Most likely, the technology just doesn’t work and the project is simply being promoted as a greenwashing effort by the coal industry and a cash cow for whoever gets the contract. In particular, the effort to sequester the emissions is doomed to failure, which we’ve explained on simple mass balance and thermodynamic grounds: there’s no place to store the CO2, and capturing all of it would probably suck up 90% of the power plants electric output – meaning 1/10th the power per ton of coal compared to a modern dirty power plant.

    What is obvious is that the world will have to voluntarily stop burning coal, and that means shutting down the coal-fired power plants that do exist and replacing them with solar and wind energy. This will be a gigantic task; coal-fired electricity generation is at something like 2,000,000 GW of power, while wind is around 20,000 GW and solar is near 500 GW – and that’s just in the US. There is no way a ‘market-based approach’ is going to change that situation.

    For a good discussion of the political manioulation that results in $2 billion for a coal plant and nothing at all for solar and wind demonstration systems, despite popular demand, see this:

    At the same time Peabody Energy celebrated an eightfold increase in profits last quarter and announced its intention to reopen the controversial and widely denounced strip mine on tribal lands on Black Mesa in Arizona, Sen. Durbin has been arm-twisting Department of Energy Secretary Stephen Chu and President Barack Obama into subsidizing Peabody’s–and a host of the world’s largest extraction companies–FutureGen boondoggle.

    Another coal plant that should be halted is the Desert Rock coal plant – indeed, a public moratorium on coal plant construction should be implemented, and a plan for closing down and replacing every coal-fired plant in the U.S. should be drawn up immediately.

  10. 210
    Hank Roberts says:

    Yup. Wossname up there doesn’t like the recipe/cookbook metaphor and thinks that’s a herring of the reddish variety. Block that metaphor.

    Michael Tobis and I both saw the original exchange as a cookbook/recipe issue, though from different perspectives.

    The exchange escalated fast; it went something like

    —I want the code (reply, see the Matlab code)
    —I need _your_ code (reply, I’m packing, you can take my class)
    —I mean the _Antarctic_ code (reply, leaving for Antarctica now)
    —I run a company, give me it all (beep…please leave a message)

    Tobis didn’t recognize jeffid. He answered discussing what’s ideal.

    Cookbook/recipe demand? Yeah, I think that’s still about right.
    A good cookbook includes general instructions; a good recipe includes specifics about amounts. Between chefs, a respectful answer may well be “I got the ingredients _there_, of course you know what to do.”

    Ideally, each question is answered by providing a cookbook as well as the recipe, and perhaps even a guide to how to find a kitchen. But between chefs, that would be a snarky answer. Ya can’t win, when you don’t know the individual who’s asking the question and why.

    How much detail and direction? It always depends on the cook:

  11. 211
    pete best says:

    Re #203, I muse around the web and post on several sites about this topic but James Hansen testified in 1988 and now its 2009 and look at the available mitigation of CO2, errr, none, emissions have increased globally. Politicians talk the talk but the walk is small and in Austealia on a baking rock on solar abilities its essentially zero. 22 million people pump out around 330 million tonnes a year I believe, their per head of capita emissions are very high although in global terms, not that bad I suppose.

    The USA, Europe, Australia (and New Zealand) have come nowhere in emissions mitigation and have hooked India and China and many others besides. Its all a bit tragic and total renewables energy for the world is around 1%.

  12. 212
    Hank Roberts says:

    PS, I commend this — good common sense advice for anyone who has a problem figuring something out:

    (Hat tip to Bi, over at Michael Tobis’s thread)

  13. 213
    Steve Reynolds says:

    Ike:”I find this discussion about code and data to be very entertaining – and I’d suggest that the same level of scrutiny be applied to the various clean coal proposals that are circulating through Congress…”

    I agree, but am not surprised that politicians would respond as you discuss. How do you propose to change their response?

  14. 214


    Its Mr. Nierenberg, but thanks for the upgrade.

    I actually have mailed Dr. McKitrick and he has been kind enough to answer. I don’t have Dr. Schmidt’s email address, and so far he hasn’t answered here.

    In this particular case the code isn’t as important to me, as an explanation of how the data was summarized from the underlying source. There is more than one way to do it, and for me it is a nuisance trying to explore and find out how each author did that.

    Neither paper explains the process. I believe that they thought it obvious, but even a first look for me shows that different choices could be made. So far I guess I haven’t made the same choices.

    As to code itself, it is the most accurate explanation generally, but I agree it isn’t the only way.

    [Response: I’m looking into it and I’ll let you know. – gavin]

  15. 215
    Dan says:

    An aside related to the fire issue mentioned farther up:

    Who is responsible? “Environmentalists”. But the image Ball uses to explain the Australia fires is priceless. An authority on Australian climate and environment can’t even take the time to find a southern hemisphere textbook image?

    “Environmentalists played a role in disastrous Australia fire”,
    By Dr. Tim Ball Monday, February 16, 2009

  16. 216
    Dan says:

    Sorry – I didn’t mean to change the subject – the fire comment started me reading on the factors affecting this event, and I ran across this silly figure.

  17. 217
    spilgard says:

    Re 201,
    Sorry, I couldn’t force myself to click the link. It’s too early in the week to endure the intellectual pain of a visit to WUWT. However, the title in the link suggests that the denizens have grown soft. Reduce the warming by just *half*? Back in the good ‘ol days they’d have settled for nothing less than proving a complete reversal of the claimed trend, with falsification of the entire dataset as a bonus.

    Recaptcha: tales, 7:00 News

  18. 218
    caerbannog says:

    Off-topic, but George Will is at it again, spouting the much of the same garbage that the realclimate folks took him to task for over 4 years ago!

    See for Will’s latest.

    Now jump into the realclimate “wayback machine” and go back to 2004:

    If I were a climate-scientist having a beer with Will, I’d tip the bartender generously and ask him/her to dip Will’s beer mug in soapy dishwater before filling it.

  19. 219
    captdallas2 says:

    There is an interesting discussion on an attempt to replicate Steig et al Antarctic warming going on now in the blogosphere

    Watts et al are science illiterates, there’s really no point for adults to waste their time with their pseudo-scientific endeavors.

    I don’t disagree. There is a cliche in my neck of the woods. “Even a blind hog finds an acorn once in a while.”

  20. 220
    dean says:

    One of the primary points of the WUWT discussion is that RegEM assumes that the missing data locations are random in nature. Since the missing data in the Steig paper isn’t random (almost all the interior is “missing”), then is it proper to use a method that assumes otherwise?

    [Response: No. The issue isn’t that the data have to be randomly missing in time or space, but that the value of the missing data is unrelated to the fact that it is missing. – gavin]

    On a similar note, and more on topic for this thread, when you use someone else’s code, it’s incumbent on you to understand the assumptions that go along with that code. Too often those assumptions are not clearly spelled out.

  21. 221
    AL says:

    Dr Schmidt,

    which of your papers are little things? And which not?



  22. 222
    gavin says:

    Nicolas , I went over what I did, and your guess was correct – the data I archived was sampled from the 2.5×2.5 MSU data, not averaged over the 5×5 CRU grid box. I also did it the other way – which is probably better – and re-did the analysis. It makes the coherence of the surface data to the MSU data stronger, but otherwise leaves the analysis untouched in any significant way. If you’d like to investigate further, I updated the supplemental data to include the averaged RSS data and added the results to the readme file.

  23. 223
    captdallas2 says:

    220 Gavin’s response.
    [Response: No. The issue isn’t that the data have to be randomly missing in time or space, but that the value of the missing data is unrelated to the fact that it is missing. – gavin]

    Incorrect, on WUWT the major issue is that Regem doesn’t deal adequately with spacial averaging and/or weighting. Both CA and
    WUWT mention concerns of how Regem can handle non-random data infilling, but that has little to do with replication, the topic of this thread I believe. In replicating Steig, the topic is replication I believe, the Jeffs had difficulty determining how the occupied weather stations and AWS were spacialy dealt with by Regem. That is interesting and deserves more than a wave.

    [Response: WUWT is very confused. The whole point of RegEM is to impute (strange verb) what the spatial correlations are rather than assuming what they should be a priori. RegEM knows nothing about the distance between any points, all it knows about are the similarities in variability between stations. How you combine the reconstruction to form a regional average afterwards is a whole other issue. By the way, my answer was not incorrect – your question however might not have been what you really wanted answering. – gavin]

  24. 224
    captdallas2 says:

    Gavin’s response to 223
    [Response: WUWT is very confused. The whole point of RegEM is to impute (strange verb) what the spatial correlations are rather than assuming what they should be a priori. RegEM knows nothing about the distance between any points, all it knows about are the similarities in variability between stations. How you combine the reconstruction to form a regional average afterwards is a whole other issue. By the way, my answer was not incorrect – your question however might not have been what you really wanted answering. – gavin]

    I have no doubt they (WUWT) are often confused (frustratingly often). But there is little comfort knowing that RegEm is clueless of distances or number of inputs spatially. How RegEm was used spatially in the Steig paper is the question. With 36 to 40 percent of the stations in the west, how were they weighted? Not clear in my opinion, which is something that is required for replication. Hopefully, now my question is more clear. To add: your answer was not incorrect, the question was, for this thread at least.

    [Response: But “how RegEM was used spatially” is a meaningless statement. RegEM takes a data matrix and through an iterative process fills in the missing data based on the correlations of the existing data. It has no clue whether it is dealing with apple sales, tax filings or widgets. If a column in the matrix shows a coherent connection to another one, then it will be used to fill in data where needed. In this case, it is likely that WAIS stations will be more connected to the Peninsula data than that on the opposite coastline, RegEM will therefore fill in preferentially from there. This is simply a reflection of the local weather patterns (and is seen more clearly in the satellite reconstruction). Ask yourself why the trans-antarctic mountains are so prominent in the figures? This mountains are not put in by hand, they are simply the natural demarcation for the various influences. Thus RegEM decides for itself what the radius of influence of any one station is. – gavin]

  25. 225
    captdallas2 says:

    [But “how RegEM was used spatially” is a meaningless statement. RegEM takes a data matrix and through an iterative process fills in the missing data based on the correlations of the existing data. It has no clue whether it is dealing with apple sales, tax filings or widgets. If a column in the matrix shows a coherent connection to another one, then it will be used to fill in data where needed. In this case, it is likely that WAIS stations will be more connected to the Peninsula data than that on the opposite coastline, RegEM will therefore fill in preferentially from there. This is simply a reflection of the local weather patterns (and is seen more clearly in the satellite reconstruction). Ask yourself why the trans-antarctic mountains are so prominent in the figures? This mountains are not put in by hand, they are simply the natural demarcation for the various influences. Thus RegEM decides for itself what the radius of influence of any one station is. – gavin]

    Interesting. The satellite data was IR not RSS or MSU so the replicators could not gather the data needed to complete the replication. Do you know if that data is available online? Also, if the number of stations in each common area are averaged prior to RegEm input, pre-weighting spatially, there appears to be a significant difference in the results. How significant do you feel that is to the analysis?

    [Response: Joey Comiso is apparently working on the data preparation along with sufficient explanation to make it usable. I have no particular insight into the timetable. If all the stations being averaged together are coherent, then it pre-averaging shouldn’t make that much difference. If instead there are some dud stations/bad data mixed in, they will corrupt the average and reduce the coherence with the far-field stations. Bad stations in the standard approach should be incoherent to any variability in the far field stations and so shouldn’t affect the result much. You’d be better off removing them rather than trying to average them away I would think. – gavin]

  26. 226
    captdallas2 says:

    Thanks Gavin, I’ll fish on it a while before I respond. Have a good night.

  27. 227

    This is actually off-topic but I thought this might interest you. Lord Monckton recently put online an article here:

    At the page 3 there is a graph with words:

    “It may be the sun: a strong anti-correlation between intensity and radiosonde temperatures over the past 50 years. Source: Svensmark and Friis-Christensen, 2007.”

    However, [edit – please be polite] the graphs shows heavily edited temperature trend with:

    1.) removed ENSO
    2.) removed effects of volcanic activity

    I saved the pic just in case it is removed:

    You can see that the upper part of the graph is cut off since there are parts of numbers visible at the upper corner. Here’s the full graph:

    The upper part has unedited temperatures, the lower one is edited, most importantly with removed warming trend.

    I’d love to see you publish some kind of response since this kind of behavior is something utterly unacceptable.

    Yours sincerely,

    Tuukka Simonen

  28. 228
    captdallas2 says:

    Ref 224,225

    “How RegEm is used spatially is a meaningless statement.” My bad. I assumed that the lat/lon of the predictor sites and sites to be predicted needed to be considered in RegEm, but that makes sense.

    [Response: No difference from PCA in this regard. Such methods make use only of the mutual covariance between the data, they know nothing about the locations or geometry. -mike]

  29. 229

    I have been having a good time learning R and looking at MM07 and S09. I found some stuff that I think is interesting and look forward to comments. I have made a blog post here.

  30. 230

    Gavin (I hope it is ok to be informal),

    I wrote the entry above before I notice that you had responded. I updated my post just now to reflect your comment.

    Since the points in the global table don’t correspond to the sat grid cells, which of the four choices did you make as to which grid cell to select? I am now planning to average all four, which is what I think you are saying you did in your update?

    [Response: In my original sample, it was top-right. The average I posted is for all four cells, weighted by the area (or cos(mid-point latitude)). – gavin]

  31. 231
    tamino says:

    Re: #229 (Nicolas Nierenberg)

    What you haven’t addressed, and neither did MM, is the spatial correlation of the data series. This is the reason for the spurious correlations and that the results of MM are not just suspect, they’re in error.

    The claim in MM07 that they did GLS (generalized least squares) is incorrect. They did OLS (ordinary least squares), but attempted to correct the confidence intervals — not for correlation but for heteroskedasticity. In essence they assume that there’s no spatial correlation — an assumption which is, frankly, ludicrous. Doing so leads to hugely exaggerated estimates of significance.

  32. 232


    In my post I addressed what I set out to address which was mainly the demonstration that the choice of grid location was arbitrary and undocumented. My other comments just follow the logic of both papers.

    Why do you say that they didn’t use GLS? The paper says that they did.

  33. 233
    tamino says:

    Re: #232 (Nicolas Nierenberg)

    I said they didn’t use GLS because they didn’t.

    The paper states “Equation (3) was estimated using Generalized Least
    Squares (GLS) as follows.” They then proceed to describe a procedure which is most definitely not GLS. They may have made this mistake because some of the STATA documentation incorrectly indicates that certain procedures are a form of GLS when they’re not.

    What they did was OLS. They then attempt to compensate the probable variation in the parameters for heteroskedasticity of the residuals, by assuming that the variance-covariance matrix of the residuals is diagonal. This amounts to assuming that the noise in the data shows no spatial correlation — which assumption is, frankly, ludicrous.

    But: even under their no-correlation assumption, they didn’t do GLS. The way to do that is to use the assumed form for the var-covar matrix of the residuals in order to compute the regression itself; they only used it to estimate the probable variation of the parameters.

    And: even if *do* use GLS, it won’t be right if you get the var-covar matrix wrong. Assuming no spatial correlation is wrong.

  34. 234

    Tamino (#233) yeah. Incredible.

    Don’t you think that scientists writing climatology papers involving non-trivial statistical methodology should consult with professional statisticians?

    Yeah I know, being mean ;-)

  35. 235
    Mark says:

    Martin Vermeer, #234. Or just average?

  36. 236
    Colin A says:

    Gavin and Mike (re. missing at random in RegEM) – I understand that spatiotemporal patterns in missing data are in general meaningless to RegEM. Per Rubin and Schneider (and your comments), the randomness in question only entails that the value of a missing datapoint cannot impact the probability that the datapoint is missing.

    However, very obvious spatial patterns in climate trends do exist (both in terms of absolute temperature and warming). Therefore, spatial patterns in missing data would often indicate nonrandomness in the data that is meaningful to the RegEM analysis—even if RegEM doesn’t care about the spatial pattern itself. We can (and have) accounted for this issue on a global scale without major problems. But it’s less clear how this fits with the Steig et. al. paper on Antarctic warming.

    On a global scale, spatial patterns in missing data often offset each other, and are thereby less problematic. For example, the randomness assumption would clearly be violated if the only place from which data was missing was the Arctic–because that data would be non-randomly cold, and would have a non-random warming trend (using Rubin’s definition of random). But if we are also missing a lot of data from the Sahara, the two sets of missing data largely offset each other as far as RegEM is concerned.

    However, when looking at a smaller geography like Antarctica, you don’t necessarily get these offsets. There does appear to be a spatiotemporal pattern to the missing data. There is also a fairly clear difference between the climate in East and West Antarctica in both the observed data and the imputed data (after all, the conclusion of the paper largely relies on the fact that the imputed data, on average, is colder than more recent, observed data for East Antarctica). Thus, the spatiotemporal pattern gives at least prima facie reason to question whether or not the missing data is random in a sense that is relevant to the RegEM analysis.

    I definitely do not have enough background here to diagnose whether this is actually an important issue in the Steig et. al. paper. Maybe the missing data in Antarctica is still MAR, and even if it is not, it might not detract from analysis. However, it does make me less surprised that the analysis produced a novel conclusion. I’d appreciate any feedback here. If there is a body of work already tackling this type of issue, a reference would also work fine.

    [Response: There is a substantial body of work testing how such infilling methods work in practice, i.e. whether the (somewhat weak) type of stationarity assumptions implicit in PCA- (or related variants such as RegEM)-based infilling algorithms work in the types of situations commonly encountered where it may not be obvious as to whether or not the assumptions are satisfied. See both the Schneider et al paper and various other papers cited in Steig et al (and papers cited by those papers) using artificial data, etc. It is precisely because one is never absolutely certain about the degree to which the underlying assumptions hold up, that cross-validation–i.e. testing whether the model works in predicting withheld data— as done in Steig et al, (using both PCA and RegEM-based infilling approaches), and other related studies, is so important. -mike]

  37. 237
    Marcus says:

    A note on “using artificial data” from Gavin’s response: This is one of my common critiques of skeptic studies (eg, Schwarz and his “response time” method to calculating CS, or Spencer and his attempt to show that CO2 increases in the atmosphere are natural and not anthropogenic): using artificial data should be a first test of the method. Since we only have 1 earth, running control experiments is hard. And models, of course, can never be perfect representations of the earth. But, they are a near-perfect testbed for methodologies: since we _know_ all the relevant information in the model (in this case, temperatures over the entire Antarctic, in Schwarz’ case, the response of the model to doubled CO2, in Spencer’s case the real source of increased CO2), then using a methodology on “imperfect information” from the model is a good first test of whether the methodology can back out a reasonable answer.

    This may not be a sufficient test of the methodology, but it is (almost always) a necessary test.

    And I’m glad (and not surprised) to hear that Schneider and other papers in the “reality” community do do the appropriate artificial data tests.

  38. 238
    Chris Ferrall says:

    Tamino does not seem to understand that Huber-White robust standard errors allow for arbitrary correlations within clusters. The procedure does not just correct for heteroscedasticity. MM07 corrected for clustering within countries because they used country-level data on smaller cells. The test then was not a test of spatial autocorrelation due to climate but due to data clustering. However, they fail to reject the null of no correlation (or heteroscedasticity) within countries. (Of course no real scientist would misinterpret that sentence as accepting the null hypothesis. Nor would any climate scientist misinterpreted a p-value as an ex post probability.)

    Tamino is correct that the reported coefficients are OLS. He does not explain that in single equation estimation it makes sense NOT to adjust OLS coefficients for deviations from the Gauss-Markov assumptions, especially when the weights are not known a priori. Why? Because OLS estimates remain unbiased even under the alternative. Meanwhile if one attempts 2SLS on a single equation you introduce a bias in small samples. And if you fail to reject the null, which is what the test did, these estimates are now biased AND inefficient relative to OLS. Thus MM07 following standard operating procedures.

    Tamino in his infinite knowledge does not realize that Stata (which is a name not an acronym, so not STATA) follows the convention in associating GLS with corrections to standard errors even if coefficient estimates are OLS, because it does GENERALIZE OLS and it is a form of observation WEIGHTING, not in terms of the coefficients but in terms of the information about the variance matrix of residuals (thus estimated standard errors).

    Regardless of naming conventions the results reported in the paper are clearly stated.

    But, hey, climate scientists are real scientists and they serve as the ultimate judge of all technique and interpretation. So don’t listen to me.

    [Response: The point tamino was making, which is correct, is that clustering by country is not the same as looking for and correcting the spatial correlation in the data. A claim that the generalisation of OLS ‘deals’ with this issue in this particular case, is therefore not supported. You just have to look at the difference between Europe and the N. America to see the very different impact country clustering will have – and in neither case is the real issue of spatial correlation dealt with. – gavin]

  39. 239
    Chris Ferrall says:

    “Assuming no spatial correlation is wrong.”

    Of course this is not a scientific statement, in that it is not quantitative. For example, according to Tamino the assumptions of general relativity are wrong because they are violated at the sub atomic level. The fact that this violation is insignificant at the galactic level does not matter. They are wrong, wrong, wrong!

    Tamino seems to believe that demonstrating (significant) correlation among observables is sufficient to make inference under the assumption of no correlation invalid. And Tamino continues to assert without justification that the robust standard errors reported in the second part of this paper do not relax the assumption of no correlation.

    If Tamino said “testing for arbitrary correlation in residuals within countries and failing to reject the null of no correlation is not likely to be a good test of autocorrelation in residuals due to climate” then I would find what he had to say next about this topic worth some attention.

  40. 240
    Ray Ladbury says:

    Chris Ferrall, we aren’t talking about anything requiring string theory to resolve here. Climate doesn’t respect borders. GDP does. As such, it seems to me that Tamino’s objection is quite cogent. If I were the authors, I would want to address it and see if it affected the analysis–assuming the purpose of the analysis was to illuminate the science.

  41. 241
    Chris Ferrall says:

    Tamino claims that robust standard errors as computed did not control for correlation across observations. That is NOT correct yet he continues to repeat that claim. He is wrong and continues to be wrong and misleading about the limitations of the original analysis (which indeed has limitations). One might call this Huberis.

    Indeed as everyone says … it would be useful to carry out an analysis that captures this vaunted spatial autocorrelation based on climate that everyone thinks is the Achilles heel of this study. Dr. Schmidt’s replication did not do this either and ignored the tests carried out in the original study.

    Everyone here falls into the trap that any deviation from the null hypothesis is critical. It is not a quantitative issue concerning residuals – it is a qualitative issue that can be answered by looking at observables. And everyone seems to believe that a test of a particular alternative (arbitrary correlation within countries) says nothing about a related alternative (spatial correlation due to climate).

    But … I agree with this: MM07 could have clustered Europe even though the data is not clustered. Perhaps that would have affected their failure to reject the null. Dr. Schmidt should be able to carry out that test very easily so I look forward to seeing the results.

  42. 242
    Hank Roberts says:

    Chris Ferrall claims Tamino made an error about general relativity.
    The word ‘relativity’ appears twice in the thread.

    Can you point to a basis for this claim? Use the ‘find’ tool.

    Or is this illustrating the tactic of taking something used in a particular context so far out of context that it breaks, and claiming that proves it couldn’t possibly have been correct where it was actually being used? Showing why scientists mistrust the motives of those who want to take all their work for free to nitpick it?

    Or some other intent here?

  43. 243
    Hank Roberts says:

    > One might call this Huberis.

    Petard, engineer, boom.

  44. 244
    tamino says:

    Re: Chris Ferrall

    I understand perfectly well that the procedure used by MM07 addresses the correlation within clusters. All you’ve done is attempt to deflect attention from the real point: that MM07 utterly ignores the correlation between clusters. Which invalidates their results.

    As for “assuming no spatial correlation is wrong,” I stand corrected. I should have said “There’s undeniable spatial correlation in these data, so ignoring it is wrong.”

    I’ve got no beef with correcting OLS for correlation rather than using GLS, I do it all the time. But I don’t call it GLS. It’s probably an honest mistake due to the way Stata (not STATA — thanks for the triviality) documents their procedures. But it’s already caused commenters here falsely to believe that their procedure does correct for all correlations, simply by the use of the term; as such it deserves correction.

    If you had contributed anything other than sound and fury signifying nothing, I might be interested in your further opinions. But since you chose only to strut and fret your hour upon the stage, I can only hope you’ll then be heard no more.

  45. 245
    Mark says:

    Chris Ferrall #241 “One might call this Huberis.”

    One could. One would wonder what you were on about. Maybe Hubris, but then again, who can tell.

    “Indeed as everyone says ” Isn’t true either. Many say. Not everyone. And many who say it would be useful would add that other things would be MORE useful. Cleaning the tiling grout in the bathroom may be useful, but if your mum is having a heart attack, phoning the ambulance would be more useful.

    And I also fail to see where you have shown Tamino was wrong except just by stating it is so. [edit – please keep it civil]

  46. 246
    dhogaza says:

    But, hey, climate scientists are real scientists and they serve as the ultimate judge of all technique and interpretation. So don’t listen to me.

    Tamino’s not a climate scientist, he’s a professional statistician, more than qualified to comment on the quality of the statistical analysis done by M&M.

    Your arrogance isn’t helping your cause, BTW. When people present themselves as being superior scientists to all those doing work in climate science, it’s pretty obvious to the objective observer that they’re just blowin’ smoke.

  47. 247

    I have updated my results using the 5×5 grids after a couple of false starts. My results while similar to S09 are not identical. I show more significance for the economic data using the RSS tropospheric values. This is probably caused by a difference in how the RSS data are processed which is not documented in either S09 or the SI.

    Before you all jump down my throat about spatial correlation I am just looking at the particular conclusion in S09 that using the RSS data got a significantly different result than the UAH data. I didn’t get the same result.

    My post is here.

    I also note that Dr. McKitrick wrote a paper discussing the spatial correlation issue. It can be found here. I don’t know the comments from tamino and Gavin already take this response into account.

  48. 248

    I’m probably the only one who cares, but I suddenly recalled that S09 used an updated temperature set in addition to the RSS data. I made that change and now my results are consistent with S09. It requires both changes to get this result. The UAH data combined with the updated surface data still gets the same result as MM07

  49. 249
    Chris Ferrall says:

    Thanks for convincing me of your view by such reasoned arguments. Clearly peer review failed in the case of this science journal since, as you have so clearly convinced me, the results they published are wrong.

    “I can only hope you’ll then be heard no more.”

    Indeed! I have now deleted Real Climate from my bookmarks so that the science will not be impeded any further by dealing with such a feeble mind.

    [Response: How convenient for you. Heaven forbid that you not find an excuse to engage without the sarcasm. Regardless, the correlations published by MM07 aren’t GLS. – gavin]

  50. 250
    Ian says:

    re Chris Ferrall’s posts:

    For anyone not familiar with academic subcultures, many people in Econ departments are primed to be aggressive in every interaction. Econ people seem to like this style and take it as a sign of intelligence; most of the rest of us refer to it as “d*** measuring.” It’s especially embarrassing when the criticism and bluster wander so far from the operative point, as demonstrated by Mr. Ferrall above.