Climate code archiving: an open and shut case?

26 Oct 2010 by eric

Gavin Schmidt and Eric Steig

The last couple of weeks saw a number of interesting articles about archiving code – particularly for climate science applications. The Zeeya Merali news piece in Nature set the stage, and the commentary from Nick Barnes (of ClearClimateCode fame), proposed an ‘everything and the kitchen sink’ approach. Responses from Anthony Vejes and Stoat also made useful points concerning the need for better documentation and proper archiving. However, while everyone is in favor of openness, transparency, motherhood and apple pie, there are some serious issues that need consideration before the open code revolution is going to really get going.

It would help to start by being clear about what is meant by ‘code’. Punditry about the need for release of ‘all supporting data, codes and programmes’ is not very helpful because it wraps very simple things, like a few lines of Matlab script used to do simple linear regressions, along with very complex things, like climate model code, which is far more sophisticated. The issues involved in each are quite different, for reasons both scientific and professional, as well as organizational.

First, the practical scientific issues. Consider, for example, the production of key observational climate data sets. While replicability is a vital component of the enterprise, this is not the same thing as simply repetition. It is independent replication that counts far more towards acceptance of a result than merely demonstrating that given the same assumptions, the same input, and the same code, somebody can get the same result. It is far better to have two independent ice core isotope records from Summit in Greenland than it is to see the code used in the mass spectrometer in one of them. Similarly, it is better to have two (or three or four) independent analyses of the surface temperature station data showing essentially the same global trends than it is to see the code for one of them. Better that an ocean sediment core corroborates a cave record than looking at the code that produced the age model. Our point is not that the code is not useful, but that this level of replication is not particularly relevant to the observational sciences. In general, it is the observations themselves – not the particular manner in which they are processed – that is the source of the greatest uncertainty. Given that fundamental outlook, arguments for completely open code are not going to be seen as priorities in this area.

By contrast, when it comes to developers of climate models, the code is the number one issue, and debugging, testing and applying it to interesting problems is what they spend all their time on. Yet even there, it is very rare that the code itself (many of which have been freely available for some time) is an issue for replication — it is much more important whether multiple independent models show the same result (and even then, you still don’t know for sure that it necessarily applies to the real world).

The second set of issues are professional. Different scientists, and different sciences, have very different paths to career success. Mathematicians progress through providing step by step, line by line documentation of every proof. But data-gathering paleo-climatologists thrive based on their skill in finding interesting locations for records and applying careful, highly technical analyses to the samples. In neither case is ‘code’ a particularly important piece of their science.

However, there are many scientists who work on analysis or synthesis that make heavy use of increasingly complex code, applied to increasingly complex data, and this is (rightly) where most of the ‘action’ has been in the open code debate so far. But this is where the conflicts between scientific productivity at the individual level and at the community level are most stark. Much of the raw input data for climate analysis is freely available (reanalysis output, GCM output, paleo-records, weather stations, ocean records, satellite retrievals etc), and so the skill of the analyst is related to how they choose to analyse that data and the conclusions they are able to draw. Very often, novel methodologies applied to one set of data to gain insight can be applied to others as well. And so an individual scientist with such a methodology might understandably feel that providing all the details to make duplication of their type of analysis ‘too simple’ (that is, providing the code rather carefully describing the mathematical algorithm) will undercut their own ability to get future funding to do similar work. There are certainly no shortage of people happy to use someone else’s ideas to analyse data or model output (and in truth, there is no shortage of analyses that need to be done). But to assume there is no perception of conflict between open code and what may be thought necessary for career success – and the advancement of science that benefits from a bit a competition for ideas — would be naïve.

The process of making code available is clearly made easier if it is established at the start of a project that any code developed will be open source, but taking an existing non-trivial code base and turning into open source is not simple, even if all participants are willing. In a recent climate model source code discussion for instance, lawyers for the various institutions involved were very concerned that code that had been historically incorporated into the project might have come from outside parties who would assert copyright infringement related to their bits of code if it were now to be freely redistributed (which is what the developers wanted). Given that a climate model project might have been in existence 30 years or more, and involved hundreds of scientists and programmers, from government, universities and the private sector, even sorting out who would need to be asked was unclear. And that didn’t even get into what happens if some code that was innocently used for a standard mathematical function (say a matrix inversion) came from a commercial copyrighted source (see here for why that’s a problem).

Yet the need for more code archiving is clear. Analyses of the AR4 climate models done by hundreds of scientists not affiliated with the climate model groups are almost impossible to replicate on a routine and scalable basis by the groups developing the next generation of models, and so improvements in those metrics will not be priorities. When it comes to AR5 (for which model simulations are currently underway), archiving of code will certainly make replication of the analyses across all the models, and all the model configurations much less hit or miss. Yet recently, it was only recommended, not mandated, that the code be archived, and no mechanisms (AFAIK) have been set up yet to make even that easy. In these cases, it makes far more sense to argue for better code archiving on the basis of operational need, than it does on the basis of science replication.

This brings us to the third, and most important issue, which is organizational. The currently emerging system of archiving by ‘paper’ does not serve the operational needs of ongoing research very well at all (and see here for related problems in other fields). Most papers for which code is archived demonstrate the application of a particular method (or methods) to a particular data set. This can be broken down into generic code that applies the method (the function), and paper-specific code that applies that method to the data set at hand (the application). Many papers use a similar method but in varied applications, and with the current system of archiving by ‘paper’, the code that gets archived conflates the two aspects, making it harder than necessary to disentangle the functionality when it is needed in a new application. This leads to the archiving of multiple versions of essentially the same functional code causing unnecessary confusion and poor version control.

It would be much better if there existed a stable master archive of code, organised ‘by function’ (not ‘by paper’), that was referenced by specific applications in individual papers. Any new method would first be uploaded to the master archive, and then only the meta script for the application referencing the specific code version used would need to be archived with an individual paper. It would then be much easier to build on a previous set of studies, it would be clear where further development (either by the original authors or others) could be archived, and it would be easy to test whether the results of older papers were robust to methodological improvements. Forward citation (keeping track of links to papers that used any particular function) could be used to gauge impact and apportion necessary credit.

One could envision this system being used profitably for climate model/reanalysis output analysis, paleo-reconstructions, model-data comparisons, surface station analyses, and even for age-model construction for paleo-climate records, but of course this is not specific to climate science. Perhaps Nick Barnes’ Open Climate Code project has this in mind, in which case, good luck to them. Either way, the time is clearly ripe for a meta-project for code archiving by function.

200 Responses to "Climate code archiving: an open and shut case?"

Didactylos says

26 Oct 2010 at 4:58 PM

sharper00 said: “People want to replicate the work of scientists to ensure the results are reliable and not simply the word of a small group of insiders.”

…only they don’t. When given the code, all the reams and reams of it, they ignore it.

If you look, you will find that the few people who do take advantage of open climate code are not the people screaming for it and bashing away on their drum.

Maybe you are going to counter by saying “it’s not about whether the code is actually checked, but whether it’s available to be checked if we want”. But that’s utterly bogus, as well. The data in many cases is already available, and nobody is checking it. And if nobody looks at it, why spend all the time and money making it available?

No – the deniers are just paranoid. We can’t help them. Conspiracy theorists don’t turn in to normal people in my experience.

So, vboring and sharper00: keep banging away at your drums and ignoring all the data you already have access to. There are far better reasons for publishing code than your hysterics, as Gavin and Eric have explained in a rather good post that you totally ignored. I’m still thinking about the implications myself.
Barton Paul Levenson says

26 Oct 2010 at 5:36 PM

klee12: Being able to reproduce results is the bedrock of science and, IMHO, the code and data are necessary for reproduction.

BPL: The code is NOT necessary as long as the paper gives the algorithm. Sometimes even the data isn’t necessary if another researcher or team can gather their own. I don’t think you understand what “reproducibility of results” actually means. It doesn’t mean you duplicate someone else’s work line by line. It means you test the same principle and get the same result. In fact, it’s more impressive if you get the same result a different way.
Isotopious says

26 Oct 2010 at 5:39 PM

Modelers think they are doing a great job at internal climate variability….

Maybe a warning like on cigarettes will do the trick (wouldn’t be too difficult to make it retrospective).

—————————————————————————

WARNING! WARNING! WARNING! WARNING!
THIS PUBLICATION CONTAINS DATA FROM A COMPUTER SIMULATION. IT REMAINS POSSIBLE THAT COMPUTER SIMULATIONS HAVE GROSSLY UNDERESTIMATED OR OVERESTIMATE INTERNAL PROCESSES.

[Response: Sure, and on data papers we could say “DATA MAY BE REVISED AT ANY TIME DUE TO POTENTIAL UNDISCOVERED CALIBRATION PROBLEMS AND FORTHCOMING WORK THAT MIGHT SUGGEST THAT THE MODEL-DATA COMPARISON WAS NOT COMMENSURABLE. WE RESERVE THE RIGHT TO REVISE ANY AND ALL INTERPRETATIONS WITH NO ADVANCE NOTICE.” – how about that? – gavin]
Michael says

26 Oct 2010 at 5:50 PM

I think well worth understanding is that in some fields such as mining exploration repetition is required, and replication is not good enough. If a mining company can do some exploration, publish the raw data, the general methods and the final answer, then the potential for fraud exists. A small adjustment of the results, (say by 1%) would not be detectable by replication, as replication is not expected to be exact, so a small difference is acceptable. When trying to sell a new mining operation overstating results by 1% could lead to millions of dollars more value, so a powerful motivation to fudge the results is present. Protecting against such potential fraud can only be done by insisting on exact repetition, and so in fields such as mining exploration repition becomes the necessary gold standard.

Then you can get someone from such a field who knows that repition is the gold standard, and that lack of repitition can easily allow fraud. If they don’t understand the reason for such a rule, and how climate science is different, then they will insist in a full audit of the climate science books and suspect fraud if they can’t get it.

[Response: Hmm. In point of fact, most companies, mining or otherwise, are able to hold on to virtually all the information they have, even in the face of audits. At least in the U.S., this is true even when they have government funding. Try getting information on the currently-hot topic (no pun intended) of active geothermal research. You’ll find ‘replication’ is not going to be possible. That’s because corporations have rights to intellectual property individuals apparently don’t.–eric]
Isotopious says

26 Oct 2010 at 6:33 PM

No problem Gavin. Catering for the lowest common denominator is a fact of life, put yourself in their shoes.
sambo says

26 Oct 2010 at 6:44 PM

Didactylos(#51)

I don’t think you can reasonably say that because there are very few looking at the code, the motives of those asking for more transparency has to be to deny and delay. It would seem that the population of those willing to put that much effort into that type of code review/replication would be extremely small even if not one person asking for transparency had alterior motives.

Just looking at the GISSTEMP replication and there are at least 7 – 8 full replications of that work, while at least another one or two dozen bloggers posting about it in some for (I’m talking about the technical nitty gritty here).

Considering the code for the GCM’s is so much larger and more complicated, not to mention the fact that they wouldn’t be able to run on a normal desktop computer (unless it’s an older version) I think that the number of bloggers digging through the code is actually quite high.

One other note regarding comments in code. There are some developers that would say you’re code should be able to describe what it is doing without needing extensive documentation. This is idealized but the idea of using better names for variables, spending a bit more time developing a better structure to the program or using more appropriate tools for the task at hand can go a long way.
Ray Ladbury says

26 Oct 2010 at 7:31 PM

Dan H. says “Any input with large enough error bars could cause the entire model to unravel. That is why some who include uncertainties in their calculations arrive at 21st century warming between 1 and 6C.”

WRONG!!! The range of estimates has to do with 1)sensitivity, and 2)emissions scenario. The range of estimates for sensitivity has to do with those that yield Earth-like behavior of the climate. Emissions are up to us.
David B. Benson says

26 Oct 2010 at 7:45 PM

“Mathematicians progress through providing step by step, line by line documentation of every proof.”

Thanks for the chuckle. Almost no mathematics is so rigourous; the proofs are attempts to communicate the reasoning to other mathematicians. A few months back there was an article in the Notices of the American Mathematical Society by somebody who actually does computerized formal proofs of major results in mathematics. There is a research community working on improved methods for formal proofs, typically of computer programs, applied via computer.

A formal proof meeting all the criteria laid down by logicians is an awe inspiring, even fearsome, thing to behold; such belong only in the supplimental information. :-)

[Even reCAPTCHA agrees: “YAHWEH doncerri”.]
Ray Ladbury says

26 Oct 2010 at 7:48 PM

I have said it before many times. Share data sources (though not datasets as in the same databases). Share methods. Be as frigging helpful as you can. DO NOT share code. Archive it so if there is a question you can go over it later. If you have a disagreement that can’t be resolved, then go over each others’ code line by line if you need to. However, keep a firewall between your code and that of your rivals.

The goal is verification of the methods and results. Replication of the results is a way to do that. Running the same code on the same data and getting the same results (i.e. repetition) proves only that your machines are compatible.
dhogaza says

26 Oct 2010 at 8:20 PM

Sam Marshall sez:

I would second the recommendation for a distributed VCS such as Git when used in this context. Archive quality code in a proper version control system which can be publicly referenced and you can easily point to the exact version used. For throwaway crappy code written just for that paper, or a few lines of Matlab or whatever it was, sure, go ahead and archive it with the paper.

The source code for GISS Model E is available in a public CVS repository and cvsweb is set up so people can view it that way, if they prefer. There’s also a source code browser available on the GISS Model E website.

So why do people think that suggestions like “use a VCS tool” is news to people involved in the more complex world of climate-related software development?

Oh, it’s not git … it’s CVS … ummm, that just means they’ve been doing using version control software a LONG TIME.

As with so many things in the so-called climate science debate … people tend to pontificate without first asking … have the people being criticized already thought of this stuff?

CO2 as a GHG does not violate the second law of thermodynamics … CHECK!

Source code to at least some climate science related software is under version control … CHECK!

Gavin acts like a “mean [high school] girl” so all of climate science is a fraud … CHE… oops, no, not so much :)

[Response: Actually we will be switching to git in a couple of weeks. – gavin]
David B. Benson says

26 Oct 2010 at 9:21 PM

dhogaza @60 — Writing in all caps is consideered to be the equivalent of shouting and so is considered to be quite rude.
Mitch Golden says

26 Oct 2010 at 9:33 PM

@60: If you read my original comment, the point I was making was that they were probably suing something like CVS but that git is substantially different, and far more suitable to solve the problem being posed: of organizing code by function and making it public.

Gavin: glad to hear you are making the switch. Some questions: (1) where is the public repository being hosted, and will the public have access to it? Do you think it’s worthwhile for some relevant professional organization to run a repository, instead of github?

Will you be making public every single commit in the development process? This is a level of transparency that goes beyond the requirements of making the results public, and would actually perhaps indicate the thought processes during development.
Thomas says

26 Oct 2010 at 11:49 PM

Of course we have different sorts of people making the requests. For some deniers, simply being able to claim they are hiding something, or forcing researchers to waste time and money satisfying a request is enough. And I do suspect we will have people of bad faith looking through the sources for things that can be made to sound bad taken out of context “hide the decline” sorta things.

But to get more into the technicalities: I’m currently doing QA on a large commercial engineering code. Now the range of problems for our application is much broader than say analying temperature data, or running a GCM, so bombproof QA is an impossible task. But, one thing I’ve noted is that maybe half of reported bugs upon detailed inspection are actually user errors. So I think the issue could come up, if you greatly expand the user community for your codes the effort to filter out reported false negatives (reported bugs that aren’t really bugs), could expand greatly.

Secondarily, often more important than the actual source code are the proceedure and methods used for verification and shakeout of bugs. Often these include throwaway code, that is used to generate and/or analyze specialized test cases. Reading source code looking for errors is pretty difficult. Also if “code” is being generated by automated or semiautomated means -say as output from a computer algebraic system, or data-table driven code generation, then it may be truly human-unreadable, and the proper “replication archive” would be the input to the code generator rather than the source code itself.

Lastly, as Gavin suggested, old code was compiled and run (and hopefully QAed) on older computers with now out of date and no longer supported compilers and libraries. In my own work I can get exact source code from several years ago, but the ability to compile and link it, and run it under the original environment does not exist. So there are real limitations to what historical archiving can do.
dhogaza says

27 Oct 2010 at 12:28 AM

Gavin:

Response: Actually we will be switching to git in a couple of weeks. – gavin

Yes, and I’m using git in my current major consultancy contract, while older work I’m involved in still uses CVS, and another project I’ve had involvement in a decade ago (PostgreSQL) is now just about ready to switch to git rather than CVS.

My point wasn’t that CVS was as good as git … my point was that the fact that you guys have been using CVS for years means you’ve been on top of shit for years, despite what detractors say (and the fact that you’re switching to git – if you haven’t checked out the improvements to the CVS->git stuff contributed by the PostgreSQL folk, i.e. in terms of preserving history etc, do so! – simply amplifies the point).

dhogaza @60 — Writing in all caps is consideered to be the equivalent of shouting and so is considered to be quite rude.

Which caps are you talking about, pray tell? CVS has traditionally been written “CVS” forever.

“CHECK!” yeah I meant to emphasize that, you have a problem with it?

CO2? that’s wrong?

GHG? I’m rude?

GISS? wrong? “E” (after “Model”) rude?

VCS? another acronym?

WTF are you talking about?

Oh, sorry … I meant “wtf” because I hate to be rude …
dhogaza says

27 Oct 2010 at 12:49 AM

@60: If you read my original comment, the point I was making was that they were probably suing something like CVS but that git is substantially different, and far more suitable to solve the problem being posed: of organizing code by function and making it public.

Not really. It’s better, but not for the reasons you mention. CVS makes code public just as well as git.

“substantially different”? My 40 years of software engineering experience argues against it. Incrementally better, yes, and easier to manage, yes. CVS is file level, SVN and git oriented towards file sets, but people using CVS have worked around this forever and for a very long time has been the open source standard for public repositories.

Yes, we’re moving forward, and it’s a good thing. But this Holy Allah git rap is simply wrong.
E.L. says

27 Oct 2010 at 1:21 AM

Eric and Galvin,

“However, while everyone is in favor of openness, transparency, motherhood and apple pie, there are some serious issues that need consideration before the open code revolution is going to really get going.”

Here is my take on this situation….

Scientists and Mathematicians need to build an abstract scientific open source framework. The framework itself is not a model nor does it contain models; instead, it contains various algorithms and tools that scientists and mathematicians could use to construct a model. If mathematicians and scientists from all backgrounds would be willing to collaborate, the framework could become very powerful, and it could have the ability to have a great impact on scientific and mathematical research.

Such a framework would have to be built in C++. C++ contains enough tools that the framework could incorporate sophisticated 3d models with tools such as direct x. In addition, sophisticated network algorithms could be created so that scientists could collaborate computer resources for intensive calculations. Scientists could also release objects for their specific models that other scientists could plug into the framework to see the results along with creating unique models for independent investigation.

But for the framework to be useful and powerful, people would have to collaborate. We are already past the point where we can be masters of every science; therefore, we could benefit greatly by having people who are masters of a specific field contribute to that area of the framework.

Here is the great problem with the idea…
Scientists and mathematicians compete for spots in history books and funding for research. The open source framework is a focus on the community more than the individual, and some people may not like the idea of a community approach to scientific research.
klee12 says

27 Oct 2010 at 1:28 AM

Barton Paul Levenson @52 wrote

“The code is NOT necessary as long as the paper gives the algorithm….It doesn’t mean you duplicate someone else’s work line by line. It means you test the same principle and get the same result. In fact, it’s more impressive if you get the same result a different way.”

I think we are addressing different problems. A writes a paper and describes the algorithm. B thinks that the algorithm is fine but, but for whatever reason, thinks there is a bug in code. He codes up the algorithm and gets different results. How do we decide who is correct? The algorithm may be correct but its implementation may not be.

There is always the chance that there are bugs in programs and we should recognize it, especially in scientific code. Especially insidious are round-off errors. You cannot find these errors without the code.

In regard to CVS and such systems, I don’t think it is necessary if you have a single person working on the code. If a code is experimental and for private use it doesn’t need a formal version number. But when you release it to the public or release data that requires requires your code, you’ve got to have give the data and the code version numbers so you can keep track of all the stuff you’ve made public.

If the project is so large that there are a number of programmers. contributing code to a project, you’ve need CVS or something like it to keep the versions by all the programmers synchronized. I don’t think in most cases a single programmer working on a single project needs CVS. The exception may occur when the programmer wants to release different versions for different platforms. However, I think portability is not a great problem with scientific codes. I could be wrong here.

klee12
E.L. says

27 Oct 2010 at 1:29 AM

In addition to my above comment, the framework would have to be centralized instead of scattered. One large problem with open source is that is has the tendency to scatter into multiple projects where people try to obtain control.
David Jones says

27 Oct 2010 at 1:57 AM

@Hank Roberts @33 (ah, the perils of writing in “blog mode”). Yes I have written software that runs for millions of hours between failures. No that does not mean there is one instance running for millions of hours; it means there are thousands of instances all of which run for thousands of hours with no failure in any instance. This is a fairly standard way to try and measure MTBF. Under a suitable model, we estimate p, the probability of failure in any given hour; 1/p then gives the Mean Time Between Failure.
sharper00 says

27 Oct 2010 at 4:21 AM

@Didactylos

“…only they don’t. When given the code, all the reams and reams of it, they ignore it.”

No what you mean is that some don’t because some didn’t really care about the code being open in the first place. That’s not an argument against bothering to open it in the first place.

“If you look, you will find that the few people who do take advantage of open climate code are not the people screaming for it and bashing away on their drum.”

The kinds of people who engage in drum bashing are probably not the type to sit down and invest many hours in a sober analysis of source code.

However consider this: What were the people who did perform that analysis doing before the code was released? We can’t say for sure of course but in the case of GISTEMP individuals with a range of opinions went and reconstructed it thus in some cases people who already thought it was ok reaffirmed their opinion but in other cases people who were undecided or tending towards the belief there was something “funny” going on learned that it was actually a fair and robust means of building a global temperature record.

Does that convince people who don’t care much about evidence? By definition no.

That said:

“So, vboring and sharper00: keep banging away at your drums and ignoring all the data you already have access to. There are far better reasons for publishing code than your hysterics”

This is all pure nonsense. I posses no drums and lack a sufficient sense of rhythm to use them if I did. You do the position you’re apparently advocating harm by reflexively assuming anyone who thinks having open data and code is a “good thing” must embody the very worst of those who request such things.

It’s far from hysterical to make a case for how this enhances the credibility of the work involved and undermines the arguments of those who seek to oppose science purely because it produces answers they dislike.
The Ville says

27 Oct 2010 at 5:45 AM

‘git’ is an unfortunate term to use. Here in the UK it has a whole different meaning:

http://en.wikipedia.org/wiki/Git

I imagine it causes much amusement in software circles here.
Highschooler says

27 Oct 2010 at 8:06 AM

Regarding the post below. Nobody spread “misinformation” or slander. If anyone did it was not Gail,Rpauli or myself. http://www.gcbl.org/land/green-infrastructure/trees-and-air-pollution
http://archive.greenpeace.org/ozone/radiant/1rad.html
http://www.eoearth.org/article/Impact_of_ozone_on_health_and_vegetation
http://whqlibdoc.who.int/monograph/WHO_MONO_46_(p233).pdf
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VB5-4NF2HPD-1&_user=10&_coverDate=06%2F30%2F2007&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=4cea5a7e04b1add34eebad1f5622718a&searchtype=ahttp://www.amnh.org/nationalcenter/youngnaturalistawards/2006/megan.html%20%3C—
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B75CG-48XVFF8-F&_user=10&_coverDate=10%2F31%2F1972&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=91a8023282be90a0b29f2f4a632ee295&searchtype=a
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VB5-48XKNCG-1T&_user=10&_origUdi=B75CG-48XVFF8-F&_fmt=high&_coverDate=12%2F31%2F1988&_rdoc=1&_orig=article&_origin=article&_zone=related_art&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=e0213c0b4aa5b0635b522500f7e6ba82
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VRD-43YR2YH-2&_user=10&_coverDate=09%2F30%2F2001&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=e155629c5b8cbb07b70ca3318df683a0&searchtype=a
http://www.nasa.gov/topics/earth/features/soybeans.html
http://www.forestthreats.org/publications/su-srs-018/ozone-bioindicator-plants
http://www.ars.usda.gov/research/publications/publications.htm?seq_no_115=239382 Studies by the USDA and NASA hardly qualify as misinformation. THese people know what they are talking about. And we ought to listen!
Ray Ladbury says

27 Oct 2010 at 8:13 AM

Klee12 says, “A writes a paper and describes the algorithm. B thinks that the algorithm is fine but, but for whatever reason, thinks there is a bug in code. He codes up the algorithm and gets different results. How do we decide who is correct?”

Because then C will take her shot and should agree with either A or B. If not, then we can fairly attribute the problem to bugs in the code of all three. Then and only then, they might want to get together and compare lines of code. Keep in mind that most “bugs” do not alter the result in a significant fashion. The purpose of scientific computer models is not to give “answers” but to give insight.
Bill Harris says

27 Oct 2010 at 8:17 AM

Does http://reproducibleresearch.net/ have anything useful to offer to this discussion?
Ray Ladbury says

27 Oct 2010 at 8:19 AM

I think that some folks here have no idea of the magnitude of the project they are talking about. A good example of an open source scientific computing package is the GEANT 4 code used in applications from particle physics to space science. It is a great package, but the configuration control is critical, and each time there is a revision it takes a huge amount of work. What is more, SOMEBODY has to pay for that work. If you want to double the funding of climate science so they can hire people full time to manage the code, GREAT. But pony up first.
P. Lewis says

27 Oct 2010 at 8:30 AM

Indeed The Ville, if you follow your link and then the Git software link you get Linus Torvalds reportedly saying “I’m an egotistical [sic] bastard, and I name all my projects after myself. First Linux, now git.”
Frank says

27 Oct 2010 at 8:41 AM

From a CS background but not a Climate background…

Are we really talking about two types of code here? In my opinion the first bit of code is around the data and how it is brought together. I can see this type of code being released because that set of code is the implementation of a set of principles around the input date itself.

I think that the model itself is what should be independently replicated starting with the same set of known data and the rules and processes around that set of known data. I would think that it would be a valid replication if you can take a known set of data combined with a hypothisis and either reproduce the results or not reproduce the results.
Nick Barnes says

27 Oct 2010 at 8:50 AM

Thanks for the link, Gavin and Eric. I don’t really have time to comment this week but here are a few pointers. For more, see the Climate Code Foundation, both now and in future, as we write white papers and blog posts to cover more of this in more detail.

First, my Nature piece was written to very strict word limits (“World View” pieces have to fit a single page in print), and the editor specifically wanted something which would address and challenge working scientists in many fields. So I hope nobody assumes that it’s all I have to say on the subject, or that I’m unaware of the range of current practices, both across science and specifically in climate science. For instance, I am well aware – and wrote in my submission to the Muir Russell inquiry – that science includes many medium-to-large software projects (e.g. 100K-10M lines of code), within which software development practices are often quite sophisticated. Similarly, code publication is already the practice of some scientists, and is even commonplace in some entire fields of science. I am also aware that the principle of “all the code, all the time” may be very hard to achieve.

Having said that, the main point of my Nature article is certainly still valid, and applies across science. Software has become an indispensable part of the work of science, and therefore it should be published so that it can improve, in the same cooperative competitive way that has made science such a fabulously successful social enterprise over the last 300 years. Some of your competitors (probably almost none of them, in fact) will read your code. Some will find bugs in it, some will write letters criticising it, some will steal ideas from it and use them in their own research. That is how science works. Excluding source code from the principle is inconsistent and also hinders progress.

Secondly, I am not advocating that code should be published so that other people can run it. In some cases there is some benefit in doing so, or (somewhat more often) in modifying and adapting it for further scientific research. But the fact that some commenters see this as the main or only purpose of publication is a sign of how poorly software is understood. It is a long-standing tenet of software engineering, practically an axiom, that your most important readers are human beings, not compilers. Science software should be published so that scientists can read it, not so that they (or anyone) can run it. This is one reason why open-source publication, although valuable, is not critical. By all means, if you are so inclined, use a restrictive license when publishing your code, and thus prevent others from modifying or developing it.

Thirdly, I entirely agree that re-running some code is not replication in the scientific sense. Of course it is not, and I have never argued otherwise.

Fourthly, and this is where we get specific to climate science: public trust in the science has been seriously damaged. The best response is (even more) openness and transparency. Given that several of the inquiries into the stolen emails made specific recommendations for code publication, any failure to publish code will undoubtedly now be used as ammunition to attack scientists, in the media and otherwise. I want to increase public understanding of the science – that’s the goal of the Foundation – by removing this source of FUD. Nobody who has spent much time in the climate blogosphere can imagine that I am intending to attack the science or scientists: I am trying to help them defuse these attacks.

Lastly on the question of versioning code and versioning methods, raised in the original post. A given result, dataset, or chart, is produced at a specific time by a specific version of a body of code applied to specific input datasets, invoked in a particular way, on a particular platform with a particular configuration. All of that information needs to be recorded: overall this is configuration management and is a very well-understood area of software engineering, with many tools and a large body of knowledge. Obviously it incorporates version control (the CVS/Subversion/Perforce/git piece of the puzzle), although there is more to it than that. One aspect is that different results may be produced by different configurations (e.g. different versions of the code), and that – for instance – later code versions may produce “better” results (e.g. ones with smaller error bars). But to the extent that the published results are important or valuable, are useful, they should be tied to the particular configuration which produced them.

I’m away on personal business until Friday 5th, and I’m unlikely to be able to contribute here much in the meantime. David Jones might. Anyone wanting to help the work of the Foundation is welcome to contribute.

[Response: Nick, Thanks for commenting. No disagreement here, and hope it’s clear that we weren’t criticizing your Nature piece, which we really liked. We simply wanted to add some of the broader context that you — understandably — didn’t have space for. I would like to re-emphasize though, that as far as there is a public perception of a ‘lack of transparency’ in science, it is not helped by simply calling for ‘transparency’, as if it is something new that has never occurred to any of us before. The fact is, climate science has been *particularly* open compared with most other sciences. Unless calls for transparency make clear we’re talking about better methods at transparency (as you are doing) and ‘even more transparency’ (as you have said), then it sends a misleading signal.–best — Eric]
CM says

27 Oct 2010 at 9:04 AM

From the OP:

> a stable master archive of code, organised ‘by function’ (not ‘by paper’),
> that was referenced by specific applications in individual papers

Sounds great. Don’t tap me for funding, hosting, or code. But I can usually spare an acronym: How about SCROLL = Scientific Code Repository and On-Line Library?

Under whose auspices should it be set up, to get researchers, institutions and publishers on board?

And how wide a circle of disciplines and research fields would it make sense to include under one such umbrella? Climate science, Earth sciences, everything under the sun?
Bob (Sphaerica) says

27 Oct 2010 at 9:16 AM

Gavin/Eric,

I just re-read this (I must have been in quite a fog yesterday). It really is an excellent piece, hitting all of the main points, although it really took a re-read to break out the separate issues more clearly in my mind. The article talks about many disparate things, all of which have to do with code sharing, but are really very separate problems.

Is there one administrative or regulating body within the realm of climate science (or above that, embracing all sciences) that could actually tackle, structure, and enforce solutions to these disparate issues? I mean something along the lines of forming a committee, making recommendations, and then instituting and enforcing a field-wide policy.

For example, I could see the establishment of a private, secure repository with the field-wide requirement (perhaps enforced by the journals themselves) that all papers submit their code — but that such a repository is private and particular pieces of code are not accessible to any individual without special permission, so that requests for code (for the purposes of confirmation/replication, or re-use) must go through a review-approval phase, through a regulating body, and the owner of the code is informed of every request.

Authors could (of their own volition) choose to make their code open/public within that repository, so no review is necessary. The regulating body could also, at some point, decide that some piece of code is open for access without review, either because some reasonable amount of time has passed (sort of like the lapse of a patent), or because it is recognized that there is great value in doing so (i.e. they are getting a lot of valid, approved requests for that piece of code).

At the same time, this body could implement certain standards for code submitted at various levels. For example, the most basic level is “just give us what you’ve got.” In other cases, they could require an established level of documentation and a defined organization of the code, such as complete separation of I/O from subject logic (I use the term “subject logic” instead of the programming/commercial industry terms “model” or “business logic”, but that’s really what I mean — I just don’t want the computer meaning of the word “model” to be confused with the concept of a program which models climate).

Another function such a body could perform, if properly funded, would be the actual re-organization and cataloging of code (as requested, by function), and perhaps the value-added process of translating selected functions into a preferred computer language for ease of re-use. One could not really ask the original researcher to do this, but there’s a distinct value in doing so (in some cases).
Bob (Sphaerica) says

27 Oct 2010 at 9:25 AM

76 (CM),

How about SCROLL = Scientific Code Repository and On-Line Library?

I like it! SCROLL. Choosing a cool and oddly, surprisingly appropriate acronym is by far the most important part of any project!!! It’s all downhill from here…

:)

That’s actually an intriguing image. In the early middle ages, witches and warlocks might have looked into scrolls for spells, or an alchemist might have looked in ancient scrolls for arcane formulas, hoping to turn lead into gold. In the 21st century, scientists use their almost magical computer devices to access SCROLL, to get the computer programs needed to determine if the emissions from black gold are as dangerous as lead.
Didactylos says

27 Oct 2010 at 9:41 AM

sharper00: if you were an honest advocate of open source, then wouldn’t you have actually read the post by Gavin and Eric at the top of this page, and addressed the points therein – instead of just blindly demanding… what, exactly? “Open code, open code, open code!” – with no thought to reality. So much code is open already, but the peanut gallery remains unsilenced. Do you think making absolutely every line of code would silence them? Of course not! They will go on to demand to see every single email written by climate scientists, and code in progress, and no end of unreasonable things.

And those tiny handful of people convinced by open code: they are doing it wrong. If someone works away at GISTEMP and discovers that it has no major bugs, and that it really produces the graph that NASA already made – and then claim that global warming is real, and their suspicions unfounded – they are spectacularly missing the point*. Not that such a scenario ever happens. People replicating GISTEMP fall into two camps: those you dream about, who just want to show to the world that there is nothing to hide, and on the other hand there are those searching for problems. When they find none, they just go back to claiming there are problems with the raw data – there is an endless supply of data to nitpick.

Okay, it is possible that I misunderstood you. There are plenty of people who are hysterical about the subject of open source software, and who have no interest in climate science at all. Maybe you fall into that category?

If you don’t like being labelled, then say something pertinent about what Gavin has to say, instead of saying things that sound like talking-points.

Please can we move on from the wrong reasons for opening up code, and talk about the right reasons?

* Do you know why? Read the article again.
Didactylos says

27 Oct 2010 at 10:00 AM

I like the proposal from Gavin and Eric. It reminds me strongly of efforts by the biological sciences to work together on gene annotation, through huge online databases. But they have just exchanged one problem for another: there are dozens of competing databases, and huge vested interests at work.

Hopefully a code archive can avoid this fragmented approach, particularly if it limits its scope to climate science.

I’m generally a little sceptical of current efforts to deify git, and use it instead of any other source control. But git might provide some concrete benefits to a project like this. Also, git tool support has reached a point where it can be used without too much fuss.

What I would like to see is more organised archival of data. Data suffers from all the change tracking problems that code does, and the solutions should be similar. In addition to that, a few standards for data exchange would save an awful lot of wasted time futzing around with different ad-hoc file formats. Is there any movement in that area that I’m not aware of?
Rod B says

27 Oct 2010 at 11:06 AM

SecularAnimist, but as I’m sure you know, a bug is very often just an undocumented feature.
Mak says

27 Oct 2010 at 11:52 AM

I am just curiousif someone can answer some questions that popped into my head.
As someone who may have to publish my code:
* Do I have to support it if someone cannot understand/use it?
* If yes, how long does this support need to be?
* Will I have to guarantee it works on specific computer systems?
* What if I use proprietary/commercial software?
* Do I have to support / provide this software?
* What if I use proprietary computers / equipment to generate results? (Do I have to provide facilities for you to use them?)
* Will all code have to be open source?
* If yes, which ones?
* Do I have to document it?
* Should the documentation be in english?
* What standards do I need to use for coding & documenting?
* What if those standards change (i.e. do I have to version code/documentation to be forward compatible)?
* Where should I make the code available?
* If it is my own website – how do I pay for it?
* Can I charge for downloading / support / documentation / use?
* If not, why not?
sharper00 says

27 Oct 2010 at 12:02 PM

@Didactylos

“sharper00: if you were an honest advocate of open source, then wouldn’t you have actually read the post by Gavin and Eric at the top of this page, and addressed the points therein”

I already did and this only indicates you’re not reading what I’m saying carefully. Gavin’s points concern the relevance of open code to scientific progress and I agree with him. However I believe the drive to open up code and data is relevant to acceptance of scientific propositions and less so their progression.

“And those tiny handful of people convinced by open code: they are doing it wrong.”

I disagree completely. Their efforts have left no room for the rational to consider it plausible that warming trends are an artifact of adjustment. If your benchmark standard is “Well some people still say it is” then all scientific and educational efforts throughout human history have been a waste since some people still say the Earth is flat.

“Okay, it is possible that I misunderstood you. “

How wonderful of you to concede to the possibility that someone who disagrees with you on the openness of source code might be less than a drum beating hysterical bore!
Nick Barnes says

27 Oct 2010 at 1:17 PM

Mak@85:

* Do I have to support it if someone cannot understand/use it?
No.
* If yes, how long does this support need to be?
N/A.
* Will I have to guarantee it works on specific computer systems?
No.
* What if I use proprietary/commercial software?
* Do I have to support / provide this software?
No. Only publish your own code.
* What if I use proprietary computers / equipment to generate results? (Do I have to provide facilities for you to use them?)
No.
* Will all code have to be open source?
No.
* If yes, which ones?
N/A
* Do I have to document it?
No.
* Should the documentation be in english?
N/A.
* What standards do I need to use for coding & documenting?
It’s up to you.
* What if those standards change (i.e. do I have to version code/documentation to be forward compatible)?
No.
* Where should I make the code available?
Wherever you like.
* If it is my own website – how do I pay for it?
Use Googlecode or sourceforge.
* Can I charge for downloading / support / documentation / use?
No.
* If not, why not?
Do you charge for people reading your other scientific output?
Didactylos says

27 Oct 2010 at 1:21 PM

sharper00 said: “Their efforts have left no room for the rational to consider it plausible that warming trends are an artifact of adjustment.”

I’m confident that rational people don’t buy into the conspiracy theory that global warming is a massive hoax. But all this is to miss the actual point I was trying to make (and which I footnoted for you). The accuracy of a single study or data product says very little about the robustness of the result. It is not by exact repetition that science moves forward, and similarly such repetition should not sway the mind of any “rational” person. Replication is what we want: evidence that the results are robust despite the method or the source of the data. And that is exactly what we already have, with all the different temperature products from satellites and ground stations analysed in many different ways.

Let me say it again: if anyone thinks that mere repetition of a method using exactly the same code adds significantly to the credibility of the paper, then they are wrong. I’m not saying such an exercise is valueless – it isn’t. But it doesn’t have the value that you claim.

“the possibility that someone who disagrees with you on the openness of source code”

The irony here is that I too think open source code is a good thing. But I’m not an absolutist about it, nor do I view the world through rosy-tinted glasses. Can’t we agree that open source is a good thing in general, then focus on real-world concerns, difficulties and costs?
Didactylos says

27 Oct 2010 at 1:35 PM

Mak asked many questions (but sadly didn’t give any context). Answered inline:

* Do I have to support it if someone cannot understand/use it?
No. It’s nice if you can, though.

* Will I have to guarantee it works on specific computer systems?
No.

* What if I use proprietary/commercial software?
Whether you can make code available when you are not the owner of the code will depend on the license conditions. Generally, consult a lawyer if you are in doubt.

* Do I have to support / provide this software?
It might be sufficient simply to point people to where they can obtain the software, and presumably they will receive support from the same place.

* What if I use proprietary computers / equipment to generate results? (Do I have to provide facilities for you to use them?)
No.

* Will all code have to be open source?
No… but it will need either a license that allows other people to use it, or a declaration putting the code into the public domain. Choosing a license is a difficult issue, and many people will recommend that you choose an open source license.

* If yes, which ones?
The code you are releasing (which you own).

* Do I have to document it?
Yes! (No. But documentation is really appreciated.)

* Should the documentation be in english?
No. However, English documentation will make the code more useful to more people.

* What standards do I need to use for coding & documenting?
Whatever you think is best.

* Where should I make the code available?
For open source projects, there are many free hosting options.

* Can I charge for downloading / support / documentation / use?
If your chosen license allows this, then yes. Otherwise, no.
Hank Roberts says

27 Oct 2010 at 1:46 PM

Highschooler, citing to actual sources for actual claims is the right approach.
Often it will help the reader to quote up-to-date information from a cited source.

For example, one of your links given above reports:

“… through 2001…. There is not yet any evidence linking FHM ozone bioindicator response data to a specific tree health problem or a regional decline.”

It’s good to provide information that fails to support your argument. You should do it explicitly, though, not just leaving it for others to search out. Otherwise readers who aren’t cynical enough might have thought your list of links was of links supporting the extreme assertions about ozone.
Did you look through each of the links you posted for relevant information?l

No one is arguing that ozone is a problem — one of many problems. People replying here are consistently asking for cites to actual sources of quotable facts, and for commenters not to go beyond those in making claims about what’s known happening.

It’s way too easy to be cast as an alarmist — and there are plenty of people far out along all spokes of the political wheel who are alarmists.

Make clear what’s known from citable sources, and what’s your personal fear.

If you’d looked through that USDA monitoring site, you’d have found more current information — for example

http://www.forestthreats.org/about/fhm/recent-publications/william-smith/Ozone%20Bioindicator%20Sampling%20and%20Estimation.pdf/view?searchterm=ozone%20bioindicator

“the ozone biomonitoring data of the USDA Forest Service Forest Inventory and Analysis Program (FIA) are the only source of information available
that documents plant injury from air pollution using consistent protocols….” followed by a discussion of the monitoring program and ending

“Periodic recommendations to analysts will be made as results for QA analyses become available. There is also a companion ozone bioindicator user guide (Smith and others, in review) that analysts are encouraged to consult for additional guidance on interpreting ozone biomonitoring data and reporting on the issue of ozone and forest health for the FIA program.”

So — they’re watching for evidence of the problem.
CM says

27 Oct 2010 at 2:42 PM

Bob #81, re: “SCROLL”, glad you liked it. I was thinking along more ancient lines, myself — say, Hypatia uploading a hydrometer design to the library of Alexandria (where my mind often ends up when it goes wandering).
:)
sharper00 says

27 Oct 2010 at 2:43 PM

@Didactylos

“I’m confident that rational people don’t buy into the conspiracy theory that global warming is a massive hoax.”

Generally speaking sure but it’s not necessary to convince people of this, for those of a certain persuasion it’s only necessary to introduce sufficient doubt to get people to deprioritise the issue in favour of other things.

If someone says to me “I heard on the internet that if you look at the raw data there’s no warming at all!” I can go into a long spiel about peer review and such which still requires people to basically trust a process they have no direct experience or knowledge of or I can say “Well look, all the data, method and source code is open. Here’s 4 different people who reconstructed it independently and they found exactly the same thing”.

When AR5 hits the shelves I very much want to be able to say the same thing.

” But all this is to miss the actual point I was trying to make (and which I footnoted for you). “

That’s because we’re talking past each other despite my repetition of my position that I’m talking about general acceptance not advancement of the field.

“if anyone thinks that mere repetition of a method using exactly the same code adds significantly to the credibility of the paper, then they are wrong.”

Well that person still gets a vote so a determination has to be made as to what constitutes the reasonable concerns of people outside the scientific process who may be asked to vote on policies based out the output of that process.

I think the replication of scientific work by others will help that person have greater confidence in that work.

“Can’t we agree that open source is a good thing in general, then focus on real-world concerns, difficulties and costs?”

We certainly can. I work in the software business so I understand well the difficulties and benefits of open source software.

Incidentally I’m not being “absolutist”. I understand there’s a balance between opening the code and other things. I also understand Gavin’s points concerning the near irrelevance of open code to making research more scientifically credible however as above I think AR5 will be strengthened if the code involved is open. People may nitpick for sure but by all means let them show their criticisms have a meaningful impact on the results.
Didactylos says

27 Oct 2010 at 3:14 PM

sharper00 said “I understand there’s a balance between opening the code and other things.”

If you had opened with that, I may have been a little less inclined to bracket you with vboring.

I still disagree with you, though. Those swayed by “Well look, all the data, method and source code is open. Here’s 4 different people who reconstructed it independently and they found exactly the same thing” are in a tiny minority. The purveyors of doubt have plenty of other weapons in their toolkit.
klee12 says

27 Oct 2010 at 3:49 PM

Rat Kadbury wrote @ 73

“Because then C will take her shot and should agree with either A or B. If not, then we can fairly attribute the problem to bugs in the code of all three. Then and only then, they might want to get together and compare lines of code.” I doubt this can work in practice, but even if a resolution comes it would be at great cost. Let me explain.

First of all in any code project there are specifications (or requirements) that determine in great detail the input and the desired output. It was found (at least in large projects) that many problems came from delays in the projects came in misunderstandings of the specification. For example the specifications should be specific as to the how to handle out of range input or other such events. So besides finding the bugs in the code, there might be misinterpretation of the specifications.
In the case of scientific programing there may be misinterpretations of the algorithm as well as the specifications.

If B suspects an error in A’s code, B will probably spend a great deal of time trying to find it. If B finds an bug, bug it may be significant or not significant, and the A’s results stand. But if B finds an bug that is significant then A’s results may be questioned. A does not have to look at B’s attempted reconstruction of A’s code from supplied algorithm. Note B’s program will probably use different subroutines, may use different data structures, and use an expressive but exotic scripting language that is unsuitable for scientific programming and A may not be familiar with that language. Now introduce a third party, C. B will may have to look at A’s code and C’s code, A at B’s and C’s, and C at A and B’s. Assume there are 6 people trying to reproduce the results and 3 get the same results that A does, and 3 do not.

I do not see the point of withholding the code. It’s written so why not publish it. One could argue that maybe I’ll turn into a project and earn millions. But that just isn’t so. Once A have described the algorithms involved in your code, if B can write codes that reproduces your results, anyone else can.

There is a saying from software engineering (I think) that said the code is the documentation of the code. I would say the code is also the specification for the code. There is not ambiguity in the code about what the code does. What the code says overrides programs and specifications written on paper. Google has proprietary (secret) code; it’s secret because they do not publish the algorithms. When someone isolates a computer virus, that person can deduce the algorithm and specifications from the code.
The code (and data for reproduction of results) is everything.

One more comment. In some large projects the programmers get together and review everyone else’s code.

BTW, Nick Barnes @78 has a very good post. He makes an different argument for publishing code.

klee12
Didactylos says

27 Oct 2010 at 4:25 PM

klee12 said “First of all in any code project ”

Scientific code is very rarely like that. This isn’t a case of two implementations of a W3C standard, or commercial software, or anything like that.

Scientific code varies from the very old (decades old Fortran code that has been built on by many people over many years) to the very young – a quick and dirty script, or code designed to run once.

And people who claim that the code is all the specification you need just make me sick. They’re just too lazy to create a stable standard to implement, thus blocking anyone else from meaningful interoperation – because code is anything but stable, and when the “standard” changes with each commit, you might as well give up right there. Yes, this situation does arise in the real world – all too frequently.

All this “the code is everything” talk is nonsense, too. In a scientific discipline less reliant on computing, nobody demands that the *actual test-tube* be archived forever, and accessible to anyone who wants to see it. No, the method is published, and other scientists reproduce and build on it. If problems arise, they discuss it – informally, or through the literature.

Code also isn’t self-documenting*. If you are writing documentation that mirrors the code in any way, then you are commenting incorrectly. Documentation is supposed to make it easier to understand the code, so that you don’t need to analyse every single part of the whole in detail before you have a clue what anything does.

There are good reasons for archiving code. Repetition should not be one of them. If there is a significant bug, then anyone using the same code will have the same bug. The benefit from having extra eyes looking for the bug is far outweighed by the potential damage. No, having the code and data archived so that replicators can test against their own implementation – that’s useful.

* Except in the abstract, completely useless sense.
Edward Greisch says

28 Oct 2010 at 1:20 AM

The language Ada was supposedly self-documenting. I learned so many languages [about a dozen] prior to 1976, that I have confused them. Mathcad and Mathematica are supposedly just like the human way of doing it, except they aren’t. Some language environments are supposed to do the coding for you so that you can’t make a mistake, supposedly. But that environment only works on a computer I had 25 years ago. Formal methods are supposed to prevent errors as well. Java is supposed to work on all computers except it doesn’t work on an older computer after they update Java.

All scientists should have masters degrees in computer science, but they do need to start working before they retire. Perhaps 2 or 3 semesters of computer science would be sufficient.

I don’t think your code would do me any good if I had it. I don’t have your machine and I haven’t been trained to use your code. I have used other people’s code before. It works after you get a private 3 day course with the person who wrote it.

Maybe there is some compromise that would work, but people keep on inventing new languages and new machines so fast that I doubt that archiving code will work any time soon. You aren’t going to write Macintosh certifiable software for your own research anyway. Your archived code won’t be used by the general public.

It would be nice if the machine would program itself…….
klee12 says

28 Oct 2010 at 2:20 AM

Didactylos @ 95

Oops in my post that Didactylos referred to I wrote

First of all in any code project there are specifications (or requirements) that determine in great detail the input and the desired output.

which was a mistake due to typing too fast. I should have written that “there should be specifications “. Sorry about that.

Didactylos wrote

And people who claim that the code is all the specification you need just make me sick.

I should have elaborated on that. Specifications are, perhaps implicitly, drawn up before the code is written, for example the one generally decides how to handle erroneous input before rather than after the code is written. Or one might want to specify that all computations or be done in double precision. But often the specifications are not reflected in the code. One cannot claim the code is correct just because it runs and gives results. One has to be sure that the specifications are met. After the code is written the code determines whether the specifications are met. That is what I meant by the code is everything. And programmer B who wants to determine whether or not programmer A’s code satisfies the specifications needs to look at the code.

Code also isn’t self-documenting*

The code can differ from the documentation. A programmer writes some well documented code, later makes an enhancement, then forgets to change the documentation to reflect the change. In an earlier post I said when I am having a problem porting programs, all I want in the way of documentation is a statement of what the subroutine does, and the order and data structures of the input parameters and the output parameters. I can read the code. But even then I will check that that calling sequence is correct for the caller and callee. It is not sufficient to read the documentation and say yes that should work. One must look at the code. That is what I meant when I said the code is everything. If programmer B is checking the a program for bugs, B cannot depend on the documentation in the form of comments.

I think we have lost sight of the point of this little subthread. I asserted that in science reproducibility is important. Publishing the data and code is sufficient for reproducibility. A description of the algorithm, even if complete and lucid, is not sufficient (in general) because (1) there might be errors in the (possibly implicit) specifications and (2) there might be errors in the code, even if the code is well documented. IMHO, detecting these types of errors is very difficult if not impossible without the code.

Let me point out that even teams of very good and experienced programmers make errors. Every thime Microsoft issues a service pack or a Linux distribution issues a security warning about one if it’s programs, Microsoft or the Linux distribution is saying they’ve discovered at least one serious error in a software release. It is no shame to have write programs with bugs; all programmers do it.

klee12
RealEarth says

28 Oct 2010 at 3:48 AM

Hello! In France today, you should have a look at this tchat about “real” climate http://www.lexpress.fr/actualite/environnement/posez-vos-questions-sur-les-climato-sceptiques_931798.html
The Ville says

28 Oct 2010 at 6:32 AM

P Lewis:
“Indeed The Ville, if you follow your link and then the Git software link you get Linus Torvalds reportedly saying “I’m an egotistical [sic] bastard, and I name all my projects after myself. First Linux, now git.””

heh, heh.

So us Brits still have an influence on the world. :-)
Leif Svalgaard says

28 Oct 2010 at 7:35 AM

The best solution is the develop and describe the algorithm and the code together. An example is the non-trivial problem of computing potential magnetic fields in the solar corona from line-of-sight measurements http://www.leif.org/research/Calculation of Spherical Harmonics.pdf
This works when the code is small and can be understood. How one deals with code in the millions of lines is a different problem. Perhaps the real problem is that code should not be allowed to grow into such monsters.