On replication

8 Feb 2009 by Gavin

This week has been dominated by questions of replication and of what standards are required to serve the interests of transparency and/or science (not necessarily the same thing). Possibly a recent example of replication would be helpful in showing up some of the real (as opposed to manufactured) issues that arise. The paper I’ll discuss is one of mine, but in keeping with our usual stricture against too much pro-domo writing, I won’t discuss the substance of the paper (though of course readers are welcome to read it themselves). Instead, I’ll focus on the two separate replication efforts I undertook in order to do the analysis. The paper in question is Schmidt (2009, IJoC), and it revisits two papers published in recent years purporting to show that economic activity is contaminating the surface temperature records – specifically de Laat and Maurellis (2006) and McKitrick and Michaels (2007).

Both of these papers were based on analyses of publicly available data – the EDGAR gridded CO2 emissions, UAH MSU-TLT (5.0) and HadCRUT2 in the first paper, UAH MSU-TLT, CRUTEM2v and an eclectic mix of economic indicators in the second. In the first paper (dLM06), no supplementary data were placed online, while the second (MM07) placed the specific data used in the analysis online along with an application-specific script for the calculations. In dLM06 a new method of analysis was presented (though a modification of their earlier work), while MM07 used standard multiple regression techniques. Between them these papers and their replication touch on almost all of the issues raised in recent posts and comments.

Data-as-used vs. pointers to online resources

MM07 posted their data-as-used, and since those data were drawn from dozens of different sources (GDP, Coal use, population etc. as well as temperature), trends calculated and then gridded, recreating this data from scratch would have been difficult to say the least. Thus I relied on their data collation in my own analysis. However, this means that the economic data and their processing were not independently replicated. Depending on what one is looking at this might or might not be an issue (and it wasn’t for me).

On the other hand, dLM06 provided no data-as-used, making do with pointers to the online servers for the three principle data sets they used. Unlike for MM07, the preprocessing of their data for their analysis was straightforward – the data were already gridded, and the only required step was regridding to a specific resolution (from 1ºx1º online to 5ºx5º in the analysis). However, since the data used were not archived, the text in the paper had to be relied upon to explain exactly what data were used. It turns out that the EDGAR emissions are disaggregated into multiple source types, and the language in the paper wasn’t explicit about precisely which source types were included. This was apparent when the total emissions I came up with differed with the number given in the paper. A quick email to the author resolved the issue since they hadn’t included aircraft, shipping or biomass sources in their total. This made sense, and did not affect the calculations materially.

Data updates

In all of the data used, there are ongoing updates to the raw data. For the temperature records, there are variations over time in the processing algorithms (satellites as well as surface stations), for emissions and economic data, updates in reporting or estimation, and in all cases the correction of errors is an ongoing process. Since my interest was in how robust the analyses were, I spent some time reprocessing the updated datasets. This involved downloading the EDGAR3 data, the latest UAH MSU numbers, the latest CRUTEM2/HadCRU2v numbers, and alternative versions of the same (such as the RSS MSU data, HadCRUT3v, GISTEMP). In many cases, these updates are in different formats, have different ‘masks’ and required specific and unique processing steps. Given the complexity of (and my unfamiliarity with) of economic data, I did not attempt to update that, or even ascertain whether updates had occurred.

In these two papers then, we have two of the main problems often alluded to. It is next-to-impossible to recreate exactly the calculation used in dLM07 since the data sets have changed in the meantime. However, since my scientific interest is in what their analysis says about the real world, any conclusion that was not robust to that level of minor adjustment would not have been interesting. By redoing their calculations with the current data, or with different analyses of analogous data, it is very easy to see that there is no such dependency, and thus reproducing their exact calculation becomes moot. In the MM07 case, it is very difficult for someone coming from the climate side to test the robustness of their analysis to updates in economic data and so that wasn’t done. Thus while we have the potential for an exact replication, we are no wiser about its robustness to possibly important factors. I however was able to easily test the robustness of their calculations to changes in the satellite data source (RSS vs. UAH) or to updates in the surface temperature products.

Processing

MM07 used an apparently widespread statistics program called STATA and archived a script for all of their calculations. While this might have been useful for someone familiar with this proprietary software, it is next to useless for someone who doesn’t have access to it. STATA scripts are extremely high level, implying they are easy to code and use, but since the underlying code in the routines is not visible or public, they provide no means by which to translate the exact steps taken into a different programming language or environment. However, the calculations mainly consisted of multiple linear regressions which is a standard technique, and so other packages are relatively easily available. I’m an old-school fortran programmer (I know, I know), and so I downloaded a fortran package that appeared to have the same functionality and adapted it to my needs. Someone using Matlab or R could have done something very similar. It was a simple matter to then check that the coefficients from my calculation and that in MM07 were practically the same and that there was a one-to-one match in the nominal significance (which was also calculated differently). This also provides a validation of the STATA routines (which I’m sure everyone was concerned about).

The processing in dLM06 was described plainly in their paper. The idea is to define area masks as a function of the emissions data and calculate the average trend – two methods were presented (averaging over the area then calculating the trend, or calculating the trends and averaging them over the area). With complete data these methods are equivalent, but not quite when there is missing data, though the uncertainties in the trend are more straightforward in the first case. It was pretty easy to code this up myself so I did. Turns out that the method used in dLM07 was not the one they said, but again, having coded both, it is easy to test whether that was important (it isn’t).

Replication

Given the data from various sources, my own codes for the processing steps, I did a few test cases to show that I was getting basically the same results in the same circumstances as was reported in the original papers. That worked out fine. Had their been any further issues at this point, I would have sent out a couple of emails, but this was not necessary. Jos de Laat had helpfully replied to two previous questions (concerning what was included in the emissions and the method used for the average trend), and I’m sure he or the other authors involved would have been happy to clarify anything else that might have come up.

Are we done? Not in the least.

Science

Much of the conversation concerning replication often appears to be based on the idea that a large fraction of scientific errors, or incorrect conclusions or problematic results are the result of errors in coding or analysis. The idealised implication being, that if we could just eliminate coding errors, then science would be much more error free. While there are undoubtedly individual cases where this has been the case (this protein folding code for instance), the vast majority of papers that turn out to be wrong, or non-robust are because of incorrect basic assumptions, overestimates of the power of a test, some wishful thinking, or a failure to take account of other important processes (It might be a good idea for someone to tally this in a quantitative way – any ideas for how that might be done?).

In the cases here, the issues that I thought worth exploring from a scientific point of view were not whether the arithmetic was correct, but whether the conclusions drawn from the analyses were. To test that I varied the data sources, the time periods used, the importance of spatial auto-correlation on the effective numbers of degree of freedom, and most importantly, I looked at how these methodologies stacked up in numerical laboratories (GCM model runs) where I knew the answer already. That was the bulk of the work and where all the science lies – the replication of the previous analyses was merely a means to an end. You can read the paper to see how that all worked out (actually even the abstract might be enough).

Bottom line

Despite minor errors in the printed description of what was done and no online code or data, my replication of the dLM07 analysis and it’s application to new situations was more thorough than I was able to do with MM07 despite their more complete online materials. Precisely because I recreated the essential tools myself, I was able to explore the sensitivity of the dLM07 results to all of the factors I thought important. While I did replicate the MM07 analysis, the fact that I was dependent on their initial economic data collation means that some potentially important sensitivities did not get explored. In neither case was replication trivial, though neither was it particularly arduous. In both cases there was enough information to scientifically replicate the results despite very different approaches to archiving. I consider that both sets of authors clearly met their responsibilities to the scientific community to have their work be reproducible.

However, the bigger point is that reproducibility of an analysis does not imply correctness of the conclusions. This is something that many scientists clearly appreciate, and probably lies at the bottom of the community’s slow uptake of online archiving standards since they mostly aren’t necessary for demonstrating scientific robustness (as in these cases for instance). In some sense, it is a good solution to a unimportant problem. For non-scientists, this point of view is not necessarily shared, and there is often an explicit link made between any flaw in a code or description however minor and the dismissal of a result. However, it is not until the “does it matter?” question has been fully answered that any conclusion is warranted. The unsatisfying part of many online replication attempts is that this question is rarely explored.

To conclude? Ease of replicability does not correlate to the quality of the scientific result.

And oh yes, the supplemental data for my paper are available here.

About Gavin

295 Responses to "On replication"

Eli Rabett says

11 Feb 2009 at 5:24 PM

Seems to Eli that if a largish number of models delivers about the same answer that is pretty good assurance that what coding gremlins there are, have only a small effect. OTOH, the place to look for significant stuff is if there are cores of routines used in all of the models. Since those shared elements tend to be the ones produced by the professional coders (usually for mathematical applications such as integration), the Ball is Back in Your Court Dan.
Joe S says

11 Feb 2009 at 5:37 PM

All software can be Verified. Objective technical criteria and associated success metrics can be developed and applied in a manner that provides assurances about the correctness of the coding of the equations and their numerical solutions. Lack of Verification leaves open the potential that the numbers from the software are simply results of “bugs” in the coding.

If only the world were this simple. It’s not. Read a bit about the halting problem for a deeper understanding of some of the issues.
Hank Roberts says

11 Feb 2009 at 5:46 PM

> There is no other examples …, that are not
> Independently Verified.

Blather.

http://www.agent2005.anl.gov/2005pdf/Kennedy%20et%20al.pdf
VERIFICATION AND VALIDATION OF SCIENTIFIC AND ECONOMIC MODELS

“… subjective methods typically require less effort than quantitative methods, can detect flaws early in the simulation process, and are often the only applicable verification and validation methods for exploratory simulation studies.

“We next describe some of the subjective
techniques proposed by Balci (1998) that may be applicable to economic and agent-based scientific simulations. His techniques are widely used in validating the models of manufacturing, engineering, and business processes. …

“1. Face validation. This preliminary approach to validation involves asking domain experts whether the model behaves reasonably and is sufficiently accurate….”
________
Mark says

11 Feb 2009 at 6:02 PM

I’m sure nobody’s heard of the halting problem, Joe. It’s not like it’s ever brought up as the pitfall of the Newton-Raphson method in mathematics…
Ray Ladbury says

11 Feb 2009 at 8:09 PM

Dan Hughes #147: No validation, huh? Gee, I wonder why none of the dozens of scientific professional and honorific societies that have looked IN DETAIL at the consensus has reached a similar conclusion. One has to ask, Dan, what you know that the National Academies, the American Physical Society… hell, even the American Association of Petroleum Geologists doesn’t. Maybe, to paraphrase Mr. Twain–what you know is something that just ain’t so.
Ike Solem says

11 Feb 2009 at 8:47 PM

Dan: “demanding tests of climate models” are indeed important, but first, take a look at these:

http://www.aip.org/history/climate/simple.htm

http://www.aip.org/history/climate/co2.htm

http://www.aip.org/history/climate/oceans.htm

http://www.jamstec.go.jp/frcgc/manabe/cbiblio.html
(A bibliography of Syukuro Manabe, a good record of climate model development)

Manabe, S., and R.T. Wetherald, The Effects of Doubling the CO2 Concentration on the Climate of a General Circulation Model, Journal of the Atmospheric Sciences, 32(1): 3-15, 1975.

Validation: That’s when you do a 20th century model run with known CO2 and temperature data (as well as CH4, etc) and see what happens when you leave out the anthropogenic increases. If you do that, the models underestimate 20th century warming. If you include the forcing, you get an approximate fit to the observed warming.

Verification: That’s where you make predictions based on expected CO2 levels over the late 20th and 21st century, and compare them to events as they unfold. Predictions include warming at the poles and at elevations, leading to a loss of mountain glaciers and increased melt from Greenland and Antarctica, as well as expansion of subtropical dry zones associated with a warming troposphere. That’s all been observed.

The funny thing about denialists is that they are inconsistent – if the models show warming but the observational data shows less, they’ll howl about poor models – and if the data shows more warming than the models, they’ll point to the model prediction as proof that there’s something wrong with the observations – like clockwork. To see how this works, look here:

https://www.realclimate.org/index.php/archives/2008/12/2008-year-in-review/#comment-107910

Sallie Baliunas 1999: “One demanding test of the validity of the computer simulations of the climate of the earth is based on temperature records from the Arctic…”

You see, in the early 1990s a large volcanic eruption threw a lot of aerosols high into the atmosphere, providing a unique semi-experimental test of global model predictive abilities. It also cooled the planet slightly – but nevertheless, there was a record of steady sea ice loss in the Arctic in the classified archives of the U.S. Navy. The validated model’s predictions of the Pinatubo effect were also verified. Now, if you compared a model prediction prepared in 1988 with the observed temperature in 1997, you would find things cooler than expected… for obvious reasons.

Of course, after the past decade’s Arctic warming was noted, as per predictions, the denialists all admitted they were wrong and printed retractions – but only after the press dragged them all before the national cameras and gave them a Hansen-style grilling on Glen Beck and Larry King Live… or did they?
dhogaza says

11 Feb 2009 at 9:54 PM

Of course, after the past decade’s Arctic warming was noted, as per predictions, the denialists all admitted they were wrong and printed retractions – but only after the press dragged them all before the national cameras and gave them a Hansen-style grilling on Glen Beck and Larry King Live… or did they?

I always knew you normally reside in Bizarro World, where everything is the opposite of they way they are here! :)
Nicolas Nierenberg says

11 Feb 2009 at 10:25 PM

Dr. Schmidt,

I think it would be interesting to do additional tests on the model driven data from your paper. Are the surface temperature data and emulated MSU data from your model runs available for download?

[Response: Have at it – here. – gavin]
sidd says

12 Feb 2009 at 9:21 AM

Dan Hughes writes at 4:21 pm on the 11th of February 2009:
“…do the solutions of the discrete equations converge to solution of the continuous equations. Neither of these has been demonstrated for any GCM.”

Mr. Hughes will no doubt be shocked an appalled to learn that there is not even any proof or disproof of the mere existence, or lack thereof, of solutions to the 3-D Navier-Stokes equations, even in the incompressible case, and that a large prize is being offered by the Clay Institute for such proof or disproof.

His scepticism notwithstanding, I can, and do, wield the 3-D compressible Navier-Stokes in anger, and obtain many useful results therefrom. So does every fluid engineer in the world. Merely because we umble fizicists cannot prove that a solution exists does not prevent Nature from solving it every day all around us. All we have to do as someone said, is get close enough.
Robert says

12 Feb 2009 at 1:44 PM

Sorry, this isn’t necessarily relevant to the topic being discussed but I thought I would solicit a response here.

Although I disagree with the consensus of this community in regards to AGW, I do find that most of you are rational in terms of the predicted consequences of AGW which are at odds with how the media general portrays those consequences.

Recently, James Hansen has been attibuted with the claim that we only have 4 years to do something about AGW, which seems to be an outlandish and alarmist claim. How do the members of this community regard this claim?

Your rational and non defensive responses are appreciated.

[Response: When somebody quite sensible is quoted as saying something patently ridiculous, the response should be to question whether that was really what was said and whether the context supports the interpretation. Hansen’s statements are all available on his website and are well worth reading carefully. In this case (see page 3), he is clearly referring to the length of a presidential term, not the imminent collapse of world civilisation. – gavin]
Maya says

12 Feb 2009 at 3:17 PM

“I’ve never understood all the hoopla about the codes.”

Ahhh…job security for me! ;) I’m a software test engineer, a.k.a. software quality assurance analyst, a.k.a. whatever we happened to be called at a given company.

In *any* field where results depend on numbers being crunched by software, there is the possibility that an error can be introduced by a bug in the code. It may be trivial or insignificant, but it can also be catastrophic (Mars lander, anyone?). “Catastrophic” doesn’t happen often, thank goodness, but having someone *other than the person who wrote the code* check it for bugs is a basic precaution against spectacular failure. It’s amazingly easy for the author of the code to overlook something that will be obvious to a tester, simply because the author knows what he/she *meant* the code to do, therefore it *looks* like the code does it that way.

So my point is, it may seem like bugs in the code are not a big deal because ideally they get fixed before software is used or released. You won’t notice they aren’t there because, well, they aren’t there.
Mark says

12 Feb 2009 at 5:56 PM

Maya, #161, so you’re requesting work that you can and are employed to do should be done in an entirely new field..?

Looking to make a lot more money, are we?
wildlifer says

12 Feb 2009 at 6:05 PM

@161,
Wouldn’t it behoove one to create and test one’s own code on the data? It seems to be all about theft over toil to me.
Arthur Smith says

12 Feb 2009 at 6:43 PM

I’m finding the discussion here reminiscent of my own career – 5 years as a postdoc mostly running computational codes of one sort or another, added to my graduate degree in physics, left me with somewhere around 50,000 lines of code I had either written or heavily modified for my purposes (mostly C, some fortran, some perl – this was 15 years ago). A few bits and pieces were original and I put some effort in to make them shareable – graphics and PostScript creation, a multi-dimensional function integrator, etc. A few were done as part of much larger projects and at least ended up under proper revision control as a contribution to that project (that was my intro to CVS). But most were one-off things that tested some hypothesis, interpreted some data file, or were some sort of attempt at analysis. 90% of the time they weren’t a lot of use, and spending extra time documenting would have seemed pretty worthless – I used “grep” a lot to find things later. Sure they could have been made public, but nobody would have any idea what command-line arguments I’d used or the processing steps I’d taken, except in those rare instances where I anticipated my own reuse and created an explanatory “README”. Probably simpler for another scientist to just do it over from scratch than try to figure out what I’d done from looking at the code.

And now I’m a professional software developer in a group where we have quite rigorous test and development procedures, everything is checked into a version control system and regularly built and run against regression tests to keep things robust. Nevertheless, I still have a directory with hundreds of one-off scripts that fit in that same category of being easier to rewrite than to generalize, and there’s little purpose in making them publicly available or putting them under version control since at most I’ll use them as starting points for other scripts rather than re-using as they are in any significant way.

I’m not sure it was Fred Brooks or somebody else, but the expression I recall reading long ago was that turning a prototype into an internal software product took roughly a factor of 3 more effort, and turning an internal product into something you could publicly distribute (or sell) took roughly a factor of 3 times the effort beyond that. Software always falls along this spectrum, and most of what scientists use tends to be at the “prototype” level, simply because of the exploratory nature of science. Theoretically it would be nice to have the resources to keep everything clean and nicely polished, but if 90% of it is code you’re never going to re-use, what’s the point?
Robert says

12 Feb 2009 at 8:58 PM

Gavin,

I appreciate the response. I read page 3 and take James Hansen interpretation to mean that his original answer gave the impression that we had 4 years to solve the problem but wasn’t reflective of what he meant. His solution was to be more careful in how he answered those types of questions. I applaud him for recognizing that he should be more precise when answering questions from the media.

Thanks for addressing my question.

-Robert
John Mashey says

13 Feb 2009 at 12:24 AM

re: #164 Arthur
Yes, that was Fred. It’s Figure 1.1, the first one in the book, a 2×2 square:
Program (1), Programming System (3)
Product (3), Programming System Product (9)
John Burgeson says

13 Feb 2009 at 10:24 AM

Thanks to this site for leading me to Spencer Weart’s great
book THE DISCOVERY OF GLOBAL WARMING.

I literally could not put it down. So much of the puzzle fell into place for me as I read it. I intend to recommend it (along with this site) in future articles for the Rico Bugle and will mention it on both the ASA (American Scientific Affiliation) mail list and my own web site.
Naj Tam Hudat says

13 Feb 2009 at 12:08 PM

On 13 February 2009 at 10:24 AM John Burgeson wrote:

“Thanks to this site for leading me to Spencer Weart’s great book THE DISCOVERY OF GLOBAL WARMING. .. So much of the puzzle fell into place for me as I read it…”

If you don’t mind my asking, I am interested in learning which pieces of the puzzle the book helped you with in order to know if it would be worth my time to read. Thank you.
John Burgeson says

13 Feb 2009 at 12:36 PM

A fair question. My introduction to the whole GW issue took place only recently (late 2006) when I was urged to read and review Al Gore’s book for the Rico Bugle. That review was published in PSCF (the journal of the ASA) and also in the Bugle. A copy of it is at http://www.burgy.50megs.com/inconvenient.htm

I had read (off and on) other articles and books on the issue, but Gore’s book convinced me that the subject was one in which I needed to invest some serious study. (For the past 15 or 20 years my efforts have been directed to fighting the pseudoscience of the “young earth” advocates.) Once a physicist, I spent most of my life in the computer business with IBM, retiring almost 15 years ago. I know — at my advanced age, what should I care? Well — I have 12 grandchildren!

Initially my reading was pretty much ad hoc. I wrote about the issue in December 2008 (www.burgy.50megs.com/gw.htm) but at that point I still had not really understood the science-history behind it all.

Weart’s book filled in the gaps. I had had no idea the work behind GW went back so many years, or had involved so many peple. It is simply a great story, and forms, I assert, a foundational understanding for anyone who, having been sucked into the controversy, wants to understand it from the beginnings. The “puzzle peices missing” were the pre-21st century efforts, studies and events.

Again, I recommend it. It is definitely “worth your time” to read.

Thanks for asking.

John (Burgy)
tamino says

13 Feb 2009 at 12:42 PM

Re: #168 (Naj Tam Hudat)

As about a thousand people who regularly read this blog can tell you, Spencer Weart’s Discovery of Global Warming is definitely worth the time to read. As for which pieces of the puzzle will fall into place … ALL of them.
Ray Ladbury says

13 Feb 2009 at 12:51 PM

Naj, Weart’s book is excellent for placing current science in a historical context. If you want to understand how climate scientists have reached their understanding of the driving forces in climate, this is an invaluable resurse. READ IT!!! It’s also a good read.
Naj Tam Hudat says

13 Feb 2009 at 1:53 PM

To Burgy, tamino, & Ray (#’s 169,170,171) above:

Thank you very much for your comments. I will definitely look into it.
Chris Colose says

13 Feb 2009 at 6:32 PM

Naj Tam Hudat,

You will be interested also in Spencer Weart’s website. RC has a link on the right side of your screen “AIP:Discovery of Global Warming” which I personally find more useful and comprehensive than the book (and obviously more accessible).
Maya says

13 Feb 2009 at 8:22 PM

“Maya, #161, so you’re requesting work that you can and are employed to do should be done in an entirely new field..?

Looking to make a lot more money, are we?”

Um, no….. I’m really puzzled by your question. To what field do you imagine software testing is confined? Software = computer code. Certainly some fields use computers more than others, but I honestly can’t think of any scientific endeavor that doesn’t use computers at all.

“Wouldn’t it behoove one to create and test one’s own code on the data? It seems to be all about theft over toil to me.”

Theft by whom? The tester? Why? I have a hard time imagining what would be worth the risk. I mean, I won’t swear it never happens, but I’ve never personally encountered a case of industrial espionage. Heck, even insider trading will get you crucified – I shudder to think what would happen to someone caught selling proprietary source code.

And yes, a developer tests his/her own code, but it’s a basic principle of software development that you also have someone else look at the code. Even in shops that weren’t big enough to have a dedicated QA staff, we checked each other’s code, both in code reviews and in functional testing. It’s just way, way too easy to overlook an error in your own code. I’ve been a developer as well as a tester, so I’m not pointing fingers at developers – it’s just human nature. It’s metaphorically akin to not being able to find your eyeglasses because they’re on top of your head.
James says

13 Feb 2009 at 11:54 PM

Maya Says (13 February 2009 at 8:22 PM):

“And yes, a developer tests his/her own code, but it’s a basic principle of software development that you also have someone else look at the code…”

Even/especially in development. There’ve been plenty of times when I’ve been beating my brains out over some problem or other, and had someone else look at the code and spot the problem in minutes – and of course I’ve done the same for other people. All too often, you see what you expect, not what’s really there, as for instance repeated word errors in writing…
Mark says

14 Feb 2009 at 8:10 AM

But the product of software development IS the software.

The product of software development in science is the science.

Now, if there ARE errors, they could be as easily found by proper review of the code as it could be by running the science again with different software.

After all, fly-by-wire systems use triple redundancy in hardware, software and language used because they don’t believe that even the most rigorous review possible is not enough to determine there is no error and the only way to determine there is no error is to run the same process on the same data with different software and check the results.

Which, oddly enough, is what science does at the moment…
Michael Tobis says

14 Feb 2009 at 10:57 AM

Mark seems to think I am a a delayer, because I have a disagreement with Gavin.

In fact, as far as policy goes I think I’m pretty much in Hansen’s camp. CO2 emissions need to be restrained as vigorously and as early as is feasible. I consider that to have been established by 1990. Whatever further climate research is or should be about, it is not that. (We can call on new science to argue about what the safe level should be, but that is no argument for delay in getting started on a new trajectory.)

So it’s peculiar to be cast as a member of the bad guys on the basis of a secondary disagreement. It’s both amusing and disturbing. Perhaps Mark is making this mistake because he has an oversimplified model of public discourse which has probably been set by the horribly polarized politics of the US over the past twenty years.

All I’m saying here is that the argumentative position taken by Eric and Gavin is one they ought to loosen their grip on.

Much talk here has been about publication and portability of code. These indeed are thorny issues. However, the idea that one should be able to replicate published results locally should not be controversial. What’s more, for desktop codes on commercial platforms like Matlab, this is easy to achieve. That’s all I’ve been trying to say here. Arguments to the contrary on those points seem awfully strained to me.

My reason for concern and active participation here is that arguments that seem plausible to scientists and implausible to practicing engineers weaken the reputation of the field among influential groups where it matters. This stems from a mutual disrespect between scientists and engineers that is very counterproductive. As someone who is a little of both, it hasn’t helped me any, either.
Curt Covey says

14 Feb 2009 at 3:20 PM

Gavin,

I’m glad you started this discussion. Clearly a lot of people who are interested in climate science, but are not scientists themselves, underestimate the inherent difficulty of perfectly duplicating a result that appears in a peer-reviewed scientific journal. Nevertheless we must exercise due diligence in presenting our results ways that make them reproducible.

To me the key distinction is between “data sharing” and “data auditing.” The former is essential for science to progress. The latter, except in special cases where misconduct is reasonably suspected, just slows things down.

Regards,
Curt
Hank Roberts says

14 Feb 2009 at 9:54 PM

http://www.google.com/search?q=define%3Asciolist
Walt Bennett says

14 Feb 2009 at 10:54 PM

So, how’s the science coming?
Bernie says

15 Feb 2009 at 12:12 AM

#178 Curt Covey (14 February 2009 at 3:20 PM )
I think the argument that Michael T has been stressing is that some scientists underestimate the inherent ease “of perfectly duplicating a result that appears in a peer-reviewed journal.” Some scientific fields simply haven’t embraced the standards and work practices that will enable this. To argue otherwise opens those claiming it is difficult to a lack of credibility which casts a shadow on the substance of their research.
Lawrence Coleman says

15 Feb 2009 at 2:53 AM

Why must we wait another 5 years for the IPCC 5th assessment report, what can’t this be fast-tracked to 2013 or earlier. My understanding is the 4th IPCC assessment did not give mention to many +ve feedback loops due to poor understanding of the mechanics behind them , such as the 1GTonne of methane and CO2 tied up in the melting arctic tundra, also the fact that more equatorial forest is burning thus turning a vital carbon sink into a very efficient carbon source and the fact that the low altitude ocean wind currents are increasing thus bringing to the surface CO2 saturated water that cant absorb much more atmosperic C02. All these factors were not mentioned in the last report. Oh! almost forgot.and the fact that China and india are steaming ahead despite a Global financial crises blip at spewing millions of additional tonnes of soot and CO2 into the air from uncontrolled reliance on coal fueled industry and power stations. If you feed those positive feedback loops into the equation of how long we have until the point of no return is reached, I think you’ll find we have crossed that fateful mark already, what do you think? Unstoppable? Yep! good description.
Mark says

15 Feb 2009 at 7:04 AM

Bernie, Bernie, I says, who but the scientist knows or needs the science? Surely since they are the ones who have to use the paper to replicate or refute the result, and the code doesn’t have to follow the documentation (see for example MSDN code that doesn’t work), should they not be considered when assessing the relative work/benefit assessment?
Mark says

15 Feb 2009 at 7:13 AM

Re 177, no, I think your idea is a delayer. Stop trying to work the victim angle. You’re not. You’re idea isn’t going to do much more than slow things down.

In fact, if you figure that it is already proven, there’s no need to persue more funding to investigate, and no need to re-prove old findings. And the idea you’re peddling just a way to get more work (which requires more funding).
Lawrence Coleman says

15 Feb 2009 at 10:08 AM

Duplication in climate sciences must be extremely difficult, it’s not easy to compress our biosphere down to test tube size. It’s very dangerous as well because if two results are out even slightly you will tend to cast suspicion on the entire experiment. Nothing in our biosphere is exactly the same moment by moment. There is way too much complex interaction by miriads of variables from the atomic level up to entire weather systems. If we wait and wait and delay and procrastinate and call for other tests and universities to comfirm previous experiments for precise duplication/ replication and do nothing to address climate change we are history..simple as that.
This is where governmental logic must play a vital part. If Barack Obama understands the mechanics and jist of climate change without having to wait for every scientific/bureaucratic ‘i’ to ne botted and ‘t’ crossed and takes swift and decisive action we might have a glimmer of a chance, but not if he waits for a certain unrealistic degree of accuracy or perfect replication in the data.
Bernie says

15 Feb 2009 at 10:50 AM

#184
Markie Mark: I am confused as to the meaning of your comment. Which scientists are you referring to? (Cui decide?) You also seem to be arguing that there is some significant burden in documenting the code and packaging the data? This simply does not seem to be the case in the vast majority of instances.
dhogaza says

15 Feb 2009 at 12:16 PM

You also seem to be arguing that there is some significant burden in documenting the code and packaging the data? This simply does not seem to be the case in the vast majority of instances.

Then why does it cost commercial software companies so much money?
Hank Roberts says

15 Feb 2009 at 12:18 PM

> this simply does not seem

Baloney, read the thread. The nonscientists can’t see the problem any scientist has with the notion of freely giving them everything they need to act like scientists — in a cookbook recipe form tied up with a bow. You want help to act like a scientist without knowledge.

If it were that easy, you’d have been taught how to do real science in grade school. Instead they tried to teach you basics. Guess why.
Nicolas Nierenberg says

15 Feb 2009 at 1:27 PM

Dr. Schmidt,

Could you describe the process that you used to convert the RSS data for use in your paper. I note that both RSS and UAH are on 2.5 degree grids, but I believe the surface temperature data is on a 5 degree grid.

I’m asking because the trend in both the the RSS data that you provided and the UAH data that Dr. McKitrick provided show identical trends over the time period (.237 versus .232), but they have very different standard deviations (.183 versus .133).

Given that both of these are measuring the same thing, and the fact that their trends are identical I’m wondering what is causing the difference in standard deviations between the two provided anomaly trends.

I note that I have posed the same question to Dr. McKitrick over at CA.
Bernie says

15 Feb 2009 at 5:02 PM

Hank:
I followed the thread. Your “cookbook” argument is pure unadulterated evasion, a head fake. Just as the pretense that it takes too much time or that “scientists” need to write undocumented code. It is a silly and juvenile argument. Michael T called Eric on it. Times have changed. Expectations have changed. Standards have changed. This is an argument those who decline to post their code and data cannot possibly win.
Nicolas Nierenberg says

15 Feb 2009 at 5:07 PM

To the points argued here about replication.

I am currently looking at the Schmidt paper and the MM paper. Both authors have provided data sets. But as I looked at the data sets I was surprised to find that the distribution of data was different in the summarized RSS data than in the summarized UAH data even though the mean trends were identical.

At the moment I can’t tell if this was an artifact of the original data, or the way that it was summarized to produce the decadal trends on these particular grid locations.

The fact is that I don’t have the code or method that Dr. Schmidt used to summarize the RSS data. I have some old files written in an obscure language that Dr. McKitrick has pointed me to.

Now if I want to investigage I am forced to do work to try to duplicate the methodology used by each author, and I won’t know I am doing it correctly until I get something that matches their data. Wouldn’t it have been nice if they had used the same summarization code, or commented on why that wasn’t appropriate?

I want to point out that it is really no big deal in the grand scheme of things, but it is a good example.
Ray Ladbury says

15 Feb 2009 at 8:29 PM

Bernie, First, who says scientists are writing undocumented code? It is merely that the documentation is not intended for an outsider. My PhD was in another computer intensive area–experimental particle physics. Each group writes most of its own analysis, precisely because each analysis is unique. In our group, we usually had multiple people looking for the same particles, and we were careful not to share code even within our own group in order to preserve the independence of the efforts. Eventually, particle physics did start producing common code–for example, the GEANT package of Monte Carlo and analysis codes. The point is that when the need arises, the solution will be developed by the scientists themselves, and it will be a helluva lot better than any solution imposed from without by a bunch of self-appointed “auditors”.
I would suggest that if you are interested in the subject, you start with learning the science. You will then be able to produce your own code.
trying_to_make_sense says

15 Feb 2009 at 9:33 PM

Dr Nierenberg (#191)

Although I do not work in the area of climate science, what I will do when faced with such a situation in my own research is
(i) mail the authors,
(ii) irrespective of the reply, test the algorithm for both the grids. The fact that you find different results is a more interesting observation than mere duplication (note: not confirmation) of the results would have been. This means that either the results are not robust, or they might depend on some other factor. Figuring out why is where the next step in our understanding would lie.

In passing, your comment about the code from Dr McKitrick being in an obscure language summarizes one of the major arguments against why providing code is in any way required or useful for science. Coding languages change, software versions change (not always in a backward-compatible manner), and even algorithm implementations change. A mathematical description is the only useful thing beyond a couple of years. I would even venture that with the proliferation of versions of software tools, as well as code that any functional research group would invariably produce, a mathematical description is the only thing that I would advise to be made public.
Phil. Felton says

15 Feb 2009 at 10:21 PM

Re 191

I am currently looking at the Schmidt paper and the MM paper. Both authors have provided data sets. But as I looked at the data sets I was surprised to find that the distribution of data was different in the summarized RSS data than in the summarized UAH data even though the mean trends were identical.

UAH and RSS use quite different approaches in their coverage of the antarctic, probably accounting for the seasonal difference between the two, perhaps that is related to your problem?
sidd says

16 Feb 2009 at 12:31 AM

This whole discussion seems to center on the lack of programs used to arrive at the conclusions in published papers. I suggest that the place to look is not in the papers, rather in the dissertations of the grad students who are listed as authors. I recall that the last several score pages of my dissertation was a listing of code that I used. In my grad school days, some decades ago, when I wanted to steal code, if for some reason could not get it directly from the principals, I found that University Microfilms (in the USA) was my friend. (I must say that I usually regretted stealing the code and rapidly wrote my own…) And these days, one can usually find the code on the web directories of the usual suspects.
Steve Bloom says

16 Feb 2009 at 12:50 AM

Re #182: “Why must we wait another 5 years for the IPCC 5th assessment report?”

Lawrence, your plea has been answered! Next month in Copenhagen there will be held what amounts to a “mid-course correction” conference for the IPCC. We’ll be hearing a lot about it, I’m sure.

Re #191: “Now if I want to investigage I am forced to do work to try to duplicate the methodology used by each author, and I won’t know I am doing it correctly until I get something that matches their data.”

But if you come up with a valid method that gets a different result, you’ve learned something important. It seems to me that if you’re not doing something along those lines, you’re doing “auditing” rather than science. I think this nicely illustrates Curt Covey’s point in #178.
Doug Bostrom says

16 Feb 2009 at 3:21 AM

Re 190, events leading up to it: I think we’ve just witnessed the birth of a brand-spanking new red herring. How touching.

I’m sorry to see how much time has been spent on this topic by the tiny handful of people actually performing useful work on the topic of climate change. I’m even sorrier to imagine the continued drain on time as disingenuous and strangely inconsistent demands for climate simulation-related code multiply like rabbits.

All this talk of textbook perfect documentation and transmission of what are essentially one-off runs is useless smoke; I’m sure there are a few people who can’t help but actually care deeply about buttoned down collars and perfectly knotted ties but in point of fact most of us prefer to use our time more productively.

In any case all this generalized discussion of how nice it would be to have prettified code sort of begs the question of whether anybody asking for it has even the most rudimentary capability of understanding what it does. Personlly, I doubt it, otherwise they’d be publishing in the climate field.

At the end of the day, the only thing that really matters is data. It’s up to the peanut gallery to come up with their own processing methods.
pete best says

16 Feb 2009 at 5:07 AM

Re #182, I doubt that scientists will comment on anythin that is not knowable in this realm. Lots of things that happen on our planet are short term change, natural variability and hence all the talk of a faster warming Arctic and the potential of large scale methane clathrates release may or may not happen. The pro climate change media seem to be stating that its a bona fide fact that it will be released but its not a scientifc fact as yet. When in 2007 the Arctic sea ice melted faster then predicted it could be due to AGW or it could be due to natural variability and within a few years the sea ice may indeed recover sufficiently that the predictions of AGW may be right.

the IPCC has got to be sure on this and cannot just state that their models are conservative relative to the real world and warming is hapenning faster than predicted. Their next report is 2012 which is fine as it takes all this time to collate all of the information. Let us get the science right and put to bed the arguments the media loves, opinion and debate on the subject.
Michael Smith says

16 Feb 2009 at 6:36 AM

Dr. Schmidt, in your paper you state the following criteria for evaluating Ross’s correlations:

If the distribution encompasses the observed correlations, then the null hypothesis (that there is no contamination)cannot be rejected.”> (Emphasis added)

Since the distribution of correlations that you generated does not encompass the correlations found by McKitrick, this means that logically — by your own criteria — we must reject the null hypothesis and conclude that there is contamination in the data. Why, then, have you (apparently) abandoned the very criteria you stated in the paper and instead concluded that the McKitrick correlations are “spurious”?

[Response: The statement you quote is true and applies in general, in particular, to the discussion of the de Laat and Maurellis paper. It is not however the sole reason why a correlation might be spurious. If you read the section regarding MM07, the issue there is that the quoted significance is likely to be very strongly over-estimated:

… the preponderance of nominally significant correlations certainly implies that the reported F -test values are not a fair assessment of the hypothesis put forward by MM07. We ﬁnd that supposedly 95% significant correlations to ‘g’ and ‘e’ (in experiment G3) occur in 3 and 4 (respectively) simulations out of 9, roughly 7 times as often as should be expected if the ‘significance’ test used by MM07 had even its minimum reported power. This clearly demonstrates that there are far fewer degrees of freedom in these correlations than they assumed.

It would be nice to have a good estimate of the distribution of the correlations of the null hypothesis in this case, but I doubt it can be properly defined with only 5 model runs. I would encourage someone interested in pursuing this further to start looking at the full set of AR4 model runs to get a better handle on it. Until then, it will be unclear how the line you quote applies to MM07. – gavin]
Mark says

16 Feb 2009 at 6:45 AM

Bernie-wernie, #186, it isn’t needed. It’s probably *nice* to have, but it isn’t needed and, like I say time and again but nobody seems to be reading it, fly-by-wire systems don’t rely on proven code methodologies. It relies on replication of results and voting on what to do to make it reliable.