This week has been dominated by questions of replication and of what standards are required to serve the interests of transparency and/or science (not necessarily the same thing). Possibly a recent example of replication would be helpful in showing up some of the real (as opposed to manufactured) issues that arise. The paper I’ll discuss is one of mine, but in keeping with our usual stricture against too much pro-domo writing, I won’t discuss the substance of the paper (though of course readers are welcome to read it themselves). Instead, I’ll focus on the two separate replication efforts I undertook in order to do the analysis. The paper in question is Schmidt (2009, IJoC), and it revisits two papers published in recent years purporting to show that economic activity is contaminating the surface temperature records – specifically de Laat and Maurellis (2006) and McKitrick and Michaels (2007).
Both of these papers were based on analyses of publicly available data – the EDGAR gridded CO2 emissions, UAH MSU-TLT (5.0) and HadCRUT2 in the first paper, UAH MSU-TLT, CRUTEM2v and an eclectic mix of economic indicators in the second. In the first paper (dLM06), no supplementary data were placed online, while the second (MM07) placed the specific data used in the analysis online along with an application-specific script for the calculations. In dLM06 a new method of analysis was presented (though a modification of their earlier work), while MM07 used standard multiple regression techniques. Between them these papers and their replication touch on almost all of the issues raised in recent posts and comments.
Data-as-used vs. pointers to online resources
MM07 posted their data-as-used, and since those data were drawn from dozens of different sources (GDP, Coal use, population etc. as well as temperature), trends calculated and then gridded, recreating this data from scratch would have been difficult to say the least. Thus I relied on their data collation in my own analysis. However, this means that the economic data and their processing were not independently replicated. Depending on what one is looking at this might or might not be an issue (and it wasn’t for me).
On the other hand, dLM06 provided no data-as-used, making do with pointers to the online servers for the three principle data sets they used. Unlike for MM07, the preprocessing of their data for their analysis was straightforward – the data were already gridded, and the only required step was regridding to a specific resolution (from 1ºx1º online to 5ºx5º in the analysis). However, since the data used were not archived, the text in the paper had to be relied upon to explain exactly what data were used. It turns out that the EDGAR emissions are disaggregated into multiple source types, and the language in the paper wasn’t explicit about precisely which source types were included. This was apparent when the total emissions I came up with differed with the number given in the paper. A quick email to the author resolved the issue since they hadn’t included aircraft, shipping or biomass sources in their total. This made sense, and did not affect the calculations materially.
In all of the data used, there are ongoing updates to the raw data. For the temperature records, there are variations over time in the processing algorithms (satellites as well as surface stations), for emissions and economic data, updates in reporting or estimation, and in all cases the correction of errors is an ongoing process. Since my interest was in how robust the analyses were, I spent some time reprocessing the updated datasets. This involved downloading the EDGAR3 data, the latest UAH MSU numbers, the latest CRUTEM2/HadCRU2v numbers, and alternative versions of the same (such as the RSS MSU data, HadCRUT3v, GISTEMP). In many cases, these updates are in different formats, have different ‘masks’ and required specific and unique processing steps. Given the complexity of (and my unfamiliarity with) of economic data, I did not attempt to update that, or even ascertain whether updates had occurred.
In these two papers then, we have two of the main problems often alluded to. It is next-to-impossible to recreate exactly the calculation used in dLM07 since the data sets have changed in the meantime. However, since my scientific interest is in what their analysis says about the real world, any conclusion that was not robust to that level of minor adjustment would not have been interesting. By redoing their calculations with the current data, or with different analyses of analogous data, it is very easy to see that there is no such dependency, and thus reproducing their exact calculation becomes moot. In the MM07 case, it is very difficult for someone coming from the climate side to test the robustness of their analysis to updates in economic data and so that wasn’t done. Thus while we have the potential for an exact replication, we are no wiser about its robustness to possibly important factors. I however was able to easily test the robustness of their calculations to changes in the satellite data source (RSS vs. UAH) or to updates in the surface temperature products.
MM07 used an apparently widespread statistics program called STATA and archived a script for all of their calculations. While this might have been useful for someone familiar with this proprietary software, it is next to useless for someone who doesn’t have access to it. STATA scripts are extremely high level, implying they are easy to code and use, but since the underlying code in the routines is not visible or public, they provide no means by which to translate the exact steps taken into a different programming language or environment. However, the calculations mainly consisted of multiple linear regressions which is a standard technique, and so other packages are relatively easily available. I’m an old-school fortran programmer (I know, I know), and so I downloaded a fortran package that appeared to have the same functionality and adapted it to my needs. Someone using Matlab or R could have done something very similar. It was a simple matter to then check that the coefficients from my calculation and that in MM07 were practically the same and that there was a one-to-one match in the nominal significance (which was also calculated differently). This also provides a validation of the STATA routines (which I’m sure everyone was concerned about).
The processing in dLM06 was described plainly in their paper. The idea is to define area masks as a function of the emissions data and calculate the average trend – two methods were presented (averaging over the area then calculating the trend, or calculating the trends and averaging them over the area). With complete data these methods are equivalent, but not quite when there is missing data, though the uncertainties in the trend are more straightforward in the first case. It was pretty easy to code this up myself so I did. Turns out that the method used in dLM07 was not the one they said, but again, having coded both, it is easy to test whether that was important (it isn’t).
Given the data from various sources, my own codes for the processing steps, I did a few test cases to show that I was getting basically the same results in the same circumstances as was reported in the original papers. That worked out fine. Had their been any further issues at this point, I would have sent out a couple of emails, but this was not necessary. Jos de Laat had helpfully replied to two previous questions (concerning what was included in the emissions and the method used for the average trend), and I’m sure he or the other authors involved would have been happy to clarify anything else that might have come up.
Are we done? Not in the least.
Much of the conversation concerning replication often appears to be based on the idea that a large fraction of scientific errors, or incorrect conclusions or problematic results are the result of errors in coding or analysis. The idealised implication being, that if we could just eliminate coding errors, then science would be much more error free. While there are undoubtedly individual cases where this has been the case (this protein folding code for instance), the vast majority of papers that turn out to be wrong, or non-robust are because of incorrect basic assumptions, overestimates of the power of a test, some wishful thinking, or a failure to take account of other important processes (It might be a good idea for someone to tally this in a quantitative way – any ideas for how that might be done?).
In the cases here, the issues that I thought worth exploring from a scientific point of view were not whether the arithmetic was correct, but whether the conclusions drawn from the analyses were. To test that I varied the data sources, the time periods used, the importance of spatial auto-correlation on the effective numbers of degree of freedom, and most importantly, I looked at how these methodologies stacked up in numerical laboratories (GCM model runs) where I knew the answer already. That was the bulk of the work and where all the science lies – the replication of the previous analyses was merely a means to an end. You can read the paper to see how that all worked out (actually even the abstract might be enough).
Despite minor errors in the printed description of what was done and no online code or data, my replication of the dLM07 analysis and it’s application to new situations was more thorough than I was able to do with MM07 despite their more complete online materials. Precisely because I recreated the essential tools myself, I was able to explore the sensitivity of the dLM07 results to all of the factors I thought important. While I did replicate the MM07 analysis, the fact that I was dependent on their initial economic data collation means that some potentially important sensitivities did not get explored. In neither case was replication trivial, though neither was it particularly arduous. In both cases there was enough information to scientifically replicate the results despite very different approaches to archiving. I consider that both sets of authors clearly met their responsibilities to the scientific community to have their work be reproducible.
However, the bigger point is that reproducibility of an analysis does not imply correctness of the conclusions. This is something that many scientists clearly appreciate, and probably lies at the bottom of the community’s slow uptake of online archiving standards since they mostly aren’t necessary for demonstrating scientific robustness (as in these cases for instance). In some sense, it is a good solution to a unimportant problem. For non-scientists, this point of view is not necessarily shared, and there is often an explicit link made between any flaw in a code or description however minor and the dismissal of a result. However, it is not until the “does it matter?” question has been fully answered that any conclusion is warranted. The unsatisfying part of many online replication attempts is that this question is rarely explored.
To conclude? Ease of replicability does not correlate to the quality of the scientific result.
And oh yes, the supplemental data for my paper are available here.