We hold these truths to be self-evident, that all models are created equal, that they are endowed by their Creators with certain unalienable Rights, that among these are a DOI, Runability and Inclusion in the CMIP ensemble mean.
Well, not quite. But it is Independence Day in the US, and coincidentally there is a new discussion paper (Abramowitz et al) (direct link) posted on model independence just posted at Earth System Dynamics.
What does anyone mean by model independence? In the international coordinated efforts to assess climate model skill (such as the Coupled Model Intercomparison Project), multiple groups from around the world submit their model results from specified experiments to a joint archive. The basic idea is that if different models from different groups agree on a result, then that result is likely to be robust based on the (shared) fundamental understanding of the climate system despite the structural uncertainty in modeling the climate. But there are two very obvious ways in which this ideal is not met in practice.
First, if the models are actually the same, then it’s totally unsurprising that a result might be common between them. One of the two models would be redundant and add nothing to our knowledge of structural uncertainties.
Second, the models might well be totally independent in formulation, history and usage, but the two models share a common, but fallacious, assumption about the real world. Then a common result might reflect that shared error, and not reflect anything about the real world at all.
These two issues are also closely tied to the problem of model selection. Given an ensemble of models, that have varied levels of skill across any number of metrics, is there a subset or weighting of models that could be expected to give the most skillful predictions? And if so, how would you demonstrate that?
These problems have been considered (within the climate realm) since the beginnings of the “MIP” process in the 1990s, but they are (perhaps surprisingly) very tough to deal with.
One of the most interesting things about the MIP ensembles is that the mean of all the models generally has higher skill than any individual model. This is illustrated in the graphic from Reichler and Kim, (2008). Each dot is a model, with the ensemble mean in black, and an RMS score (across a range of metrics) increasing left to right, so that the most skillful models or means are those furthest to the left.
But as Reto Knutti and colleagues have showed, the increase in skill of the ensemble mean doesn’t keep increasing as you add more models. After you’ve averaged 10 or 15 models, the skill no longer improves. This is not what you would expect if every model result was an unbiased independent estimate of the true climate. But since the models are neither unbiased, nor independent, the fact that there is any increase in skill after averaging is more surprising!
One Model, One Vote
The default approach to the ensemble (used almost uniformly in the IPCC reports for instance), is the notion of “model democracy”. Each model is weighted equally to all the others. While no-one thinks this is optimal, no-one has really been able to articulate a robust reasoning that would give a general method that’s better. Obviously, if two models are basically the same but have different names (which happened in CMIP5), such an ensemble would be wrongly (but only slightly) biased. But how different would two models need to be to be worthy of inclusion? What about models from a single modeling group that are just ‘variations on a theme’? They might provide a good test of a specific sensitivity, but would they be different ‘enough’ to warrant inclusion in the bigger ensemble?
Model selection has however been applied in hundreds of papers based on the CMIP5/CMIP3 ensemble. Generally speaking, authors have selected a metric that they feel is important for their topic, picked an arbitrary threshold for sufficient skill and produced a constrained projection based on a subset or weighted mean of the models. Almost invariably though, the constrained projection is very similar to the projection from the full ensemble. The key missing element is that people don’t often check to see whether the skill metric that is being used has any relationship to the quantity being predicted. If it is unrelated, then the sub-selection of models will very likely span the same range as the full ensemble.
The one case where model selection was used in AR5 was for the Arctic sea ice projections (based on Massonnet et al, 2012) where it is relatively easily demonstrated that the trends in sea ice are a function of how much sea ice you start with. This clarity has been surprisingly difficult to replicate in other studies though.
So what should we do? This topic was the subject of a workshop last year in Boulder, and the new ESD paper is a partial reflection of that discussion. There is a video presentation of some of these issues from Gab Abramowitz at the Aspen Global Change Institute that is worth viewing.
Unfortunately, we have not solved this problem, but maybe this paper and associated discussions can raise awareness of the issues.
In the meantime, a joint declaration of some sort is probably a little optimistic…
We, therefore, the Representatives of the united Modelling Groups of the World, in AGU Congress, Assembled, appealing to the Supreme Judge of the model ensemble for the rectitude of our intentions, do, in the Name, and by Authority of the good People of these modeling Centers, solemnly publish and declare, That these disparate Models are, and of Right ought to be Free and Independent Codes, that they are Absolved from all Allegiance to NCAR, GFDL and Arakawa, and that all algorithmic connection between them and the Met Office of Great Britain, is and ought to be totally dissolved; and that as Free and Independent Models, they have full Power to run Simulations, conclude Papers, contract Intercomparison Projects, establish Shared Protocols, and to do all other Acts and Things which Independent Models may of right do. — And for the support of this Declaration, with a firm reliance on the protection of Divine PCMDI, we mutually pledge to each other our Working Lives, our Git Repositories, and our sacred H-Index.
- G. Abramowitz, N. Herger, E. Gutmann, D. Hammerling, R. Knutti, M. Leduc, R. Lorenz, R. Pincus, and G.A. Schmidt, "Model dependence in multi-model climate ensembles: weighting, sub-selection and out-of-sample testing", 2018. http://dx.doi.org/10.5194/esd-2018-51
- T. Reichler, and J. Kim, "How Well Do Coupled Models Simulate Today's Climate?", Bulletin of the American Meteorological Society, vol. 89, pp. 303-312, 2008. http://dx.doi.org/10.1175/BAMS-89-3-303
- F. Massonnet, T. Fichefet, H. Goosse, C.M. Bitz, G. Philippon-Berthier, M.M. Holland, and P. Barriat, "Constraining projections of summer Arctic sea ice", The Cryosphere, vol. 6, pp. 1383-1394, 2012. http://dx.doi.org/10.5194/tc-6-1383-2012