Gavin Schmidt and Eric Steig
The last couple of weeks saw a number of interesting articles about archiving code – particularly for climate science applications. The Zeeya Merali news piece in Nature set the stage, and the commentary from Nick Barnes (of ClearClimateCode fame), proposed an ‘everything and the kitchen sink’ approach. Responses from Anthony Vejes and Stoat also made useful points concerning the need for better documentation and proper archiving. However, while everyone is in favor of openness, transparency, motherhood and apple pie, there are some serious issues that need consideration before the open code revolution is going to really get going.
It would help to start by being clear about what is meant by ‘code’. Punditry about the need for release of ‘all supporting data, codes and programmes’ is not very helpful because it wraps very simple things, like a few lines of Matlab script used to do simple linear regressions, along with very complex things, like climate model code, which is far more sophisticated. The issues involved in each are quite different, for reasons both scientific and professional, as well as organizational.
First, the practical scientific issues. Consider, for example, the production of key observational climate data sets. While replicability is a vital component of the enterprise, this is not the same thing as simply repetition. It is independent replication that counts far more towards acceptance of a result than merely demonstrating that given the same assumptions, the same input, and the same code, somebody can get the same result. It is far better to have two independent ice core isotope records from Summit in Greenland than it is to see the code used in the mass spectrometer in one of them. Similarly, it is better to have two (or three or four) independent analyses of the surface temperature station data showing essentially the same global trends than it is to see the code for one of them. Better that an ocean sediment core corroborates a cave record than looking at the code that produced the age model. Our point is not that the code is not useful, but that this level of replication is not particularly relevant to the observational sciences. In general, it is the observations themselves – not the particular manner in which they are processed – that is the source of the greatest uncertainty. Given that fundamental outlook, arguments for completely open code are not going to be seen as priorities in this area.
By contrast, when it comes to developers of climate models, the code is the number one issue, and debugging, testing and applying it to interesting problems is what they spend all their time on. Yet even there, it is very rare that the code itself (many of which have been freely available for some time) is an issue for replication — it is much more important whether multiple independent models show the same result (and even then, you still don’t know for sure that it necessarily applies to the real world).
The second set of issues are professional. Different scientists, and different sciences, have very different paths to career success. Mathematicians progress through providing step by step, line by line documentation of every proof. But data-gathering paleo-climatologists thrive based on their skill in finding interesting locations for records and applying careful, highly technical analyses to the samples. In neither case is ‘code’ a particularly important piece of their science.
However, there are many scientists who work on analysis or synthesis that make heavy use of increasingly complex code, applied to increasingly complex data, and this is (rightly) where most of the ‘action’ has been in the open code debate so far. But this is where the conflicts between scientific productivity at the individual level and at the community level are most stark. Much of the raw input data for climate analysis is freely available (reanalysis output, GCM output, paleo-records, weather stations, ocean records, satellite retrievals etc), and so the skill of the analyst is related to how they choose to analyse that data and the conclusions they are able to draw. Very often, novel methodologies applied to one set of data to gain insight can be applied to others as well. And so an individual scientist with such a methodology might understandably feel that providing all the details to make duplication of their type of analysis ‘too simple’ (that is, providing the code rather carefully describing the mathematical algorithm) will undercut their own ability to get future funding to do similar work. There are certainly no shortage of people happy to use someone else’s ideas to analyse data or model output (and in truth, there is no shortage of analyses that need to be done). But to assume there is no perception of conflict between open code and what may be thought necessary for career success – and the advancement of science that benefits from a bit a competition for ideas — would be naïve.
The process of making code available is clearly made easier if it is established at the start of a project that any code developed will be open source, but taking an existing non-trivial code base and turning into open source is not simple, even if all participants are willing. In a recent climate model source code discussion for instance, lawyers for the various institutions involved were very concerned that code that had been historically incorporated into the project might have come from outside parties who would assert copyright infringement related to their bits of code if it were now to be freely redistributed (which is what the developers wanted). Given that a climate model project might have been in existence 30 years or more, and involved hundreds of scientists and programmers, from government, universities and the private sector, even sorting out who would need to be asked was unclear. And that didn’t even get into what happens if some code that was innocently used for a standard mathematical function (say a matrix inversion) came from a commercial copyrighted source (see here for why that’s a problem).
Yet the need for more code archiving is clear. Analyses of the AR4 climate models done by hundreds of scientists not affiliated with the climate model groups are almost impossible to replicate on a routine and scalable basis by the groups developing the next generation of models, and so improvements in those metrics will not be priorities. When it comes to AR5 (for which model simulations are currently underway), archiving of code will certainly make replication of the analyses across all the models, and all the model configurations much less hit or miss. Yet recently, it was only recommended, not mandated, that the code be archived, and no mechanisms (AFAIK) have been set up yet to make even that easy. In these cases, it makes far more sense to argue for better code archiving on the basis of operational need, than it does on the basis of science replication.
This brings us to the third, and most important issue, which is organizational. The currently emerging system of archiving by ‘paper’ does not serve the operational needs of ongoing research very well at all (and see here for related problems in other fields). Most papers for which code is archived demonstrate the application of a particular method (or methods) to a particular data set. This can be broken down into generic code that applies the method (the function), and paper-specific code that applies that method to the data set at hand (the application). Many papers use a similar method but in varied applications, and with the current system of archiving by ‘paper’, the code that gets archived conflates the two aspects, making it harder than necessary to disentangle the functionality when it is needed in a new application. This leads to the archiving of multiple versions of essentially the same functional code causing unnecessary confusion and poor version control.
It would be much better if there existed a stable master archive of code, organised ‘by function’ (not ‘by paper’), that was referenced by specific applications in individual papers. Any new method would first be uploaded to the master archive, and then only the meta script for the application referencing the specific code version used would need to be archived with an individual paper. It would then be much easier to build on a previous set of studies, it would be clear where further development (either by the original authors or others) could be archived, and it would be easy to test whether the results of older papers were robust to methodological improvements. Forward citation (keeping track of links to papers that used any particular function) could be used to gauge impact and apportion necessary credit.
One could envision this system being used profitably for climate model/reanalysis output analysis, paleo-reconstructions, model-data comparisons, surface station analyses, and even for age-model construction for paleo-climate records, but of course this is not specific to climate science. Perhaps Nick Barnes’ Open Climate Code project has this in mind, in which case, good luck to them. Either way, the time is clearly ripe for a meta-project for code archiving by function.