In the lead up to the 4th Assessment Report, all the main climate modelling groups (17 of them at last count) made a series of coordinated simulations for the 20th Century and various scenarios for the future. All of this output is publicly available in the PCMDI IPCC AR4 archive (now officially called the CMIP3 archive, in recognition of the two previous, though less comprehensive, collections). We’ve mentioned this archive before in passing, but we’ve never really discussed what it is, how it came to be, how it is being used and how it is (or should be) radically transforming the comparisons of model output and observational data.
First off, it’s important to note that this effort was not organised by IPCC itself. Instead, it was coordinated by the Working Group on Coupled Modelling (WGCM), an unpaid committee that is part of an alphabet soup of committees, nominally run by the WMO, that try to coordinate all aspects of climate-related research. In the lead up to AR4, WGCM took up the task of deciding what the key experiments would be, what would be requested from the modelling groups and how the data archive would be organised. This was highly non-trivial, and adjustments to the data requirements were still being made right up until the last minute. While this may seem arcane, or even boring, the point I’d like to leave is that just ‘making data available’ is the least of the problems in making data useful. There was a good summary of the process in Bulletin of the American Meteorological Society last month.
Previous efforts to coordinate model simulations had come up against two main barriers: getting the modelling groups to participate and making sure enough data was saved that useful work could be done.
Modelling groups tend to work in cycles. That is, there will be a period of a few years of development of a new model then a year or two of analysis and use of that model, until there is enough momentum and new ideas to upgrade the model and starting a new round of development. These cycles can be driven by purchasing policies for new computers, staff turnover, general enthusiasm, developmental delays etc. and until recently were unique to each modelling group. When new initiatives are announced (and they come roughly once every six months), the decision of the modelling group to participate depends on where they are in their cycle. If they are in the middle of the development phase, they will likely not want to use their last model (because the new one will almost certainly be better), but they might not be able to use the new one either because it just isn’t ready. These phasing issues definitely impacted earlier attempts to produce model output archives.
What was different this time round is that the IPCC timetable has, after almost 20 years, managed to synchronise development cycles such that, with only a couple of notable exceptions, most groups were ready with their new models early in 2004 – which is when these simulations needed to start if the analysis was going to be available for the AR4 report being written in 2005/6. (It’s interesting to compare this with nonlinear phase synchronisation in, for instance, fireflies).
The other big change this time around was the amount of data requested. The diagnostics in previous archives had been relatively sparse – the main atmospheric variables (temperature, precipitation, winds etc.) but not huge amounts extra, and generally only at monthly resolution. This had limited the usefulness of the previous archives because if something interesting was seen, it was almost impossible to diagnose why it had happened without having access to more information. This time, the diagnostic requests for the atmospheric, ocean, land and ice were much more extensive and a significant amount of high-frequency data was asked for as well (i.e. 6 hourly fields). For the first time, this meant that outsiders could really look at the ‘weather’ regimes of the climate models.
The work involved in these experiments was significant and unfunded. At GISS, the simulations took about a year to do. That includes a few partial do-overs to fix small problems (like an inadvertent mis-specification of the ozone depletion trend), the processing of the data, the transfer to PCMDI and the ongoing checking to make sure that the data was what it was supposed to be. The amount of data was so large – about a dozen different experiments, a few ensemble members for most experiments, large amounts of high-frequency data – that transferring it to PCMDI over the internet would have taken years. Thus, all the data was shipped on terabyte hard drives.
Once the data was available from all the modelling groups (all in consistent netcdf files with standardised names and formatting), a few groups were given some seed money from NSF/NOAA/NASA to get cracking on various important comparisons. However, the number of people who have registered to use the data (more than 1000) far exceeded the number of people who were actually being paid to look at it. Although some of the people who were looking at the data were from the modelling groups, the vast majority were from the wider academic community and for many it was the first time that they’d had direct access to raw GCM output.
With that influx of new talent, many innovative diagnostics were examined. Many, indeed, that hadn’t been looked at by the modelling groups themselves, even internally. It is possibly under-appreciated that the number of possible model-data comparisons far exceeds the capacity of any one modelling center to examine them.
The advantages of the database is the ability to address a number of different kinds of uncertainty, not everything of course, but certainly more than was available before. Specifically, the uncertainty in distinguishing forced and unforced variability and the uncertainty due to model imperfections.
When comparing climate models to reality the first problem to confront is the ‘weather’, defined loosely as the unforced variability (that exists on multiple timescales). Any particular realisation of a climate model simulation, say of the 20th Century, will have a different sequence of weather – that is, the weather pattern on Jan 31, 1967 in one realisation will be uncorrelated to the weather pattern on Jan 31, 1967 in another realisation, even though each run has the same climate forcing (increases in greenhouse gases, volcanoes etc.). There is no expectation that the weather in any one model will be correlated to that in the real world either. So any comparison of climate models and data needs to estimate the amount of change that is due to the weather and the amount related to the forcing. In the real world, that is difficult because there is certainly a degree of unforced variability even at decadal scales (and possibly longer). However, in the model archive it is relatively easy to distinguish.
The standard trick is to look at the ensemble of model runs. If each run has different, uncorrelated weather, then averaging over the different simulations (the ensemble mean) gives an estimate of the underlying forced change. Normally this is done for one single model and for metrics like the global mean temperature, only a few ensemble members are needed to reduce the noise. For other metrics – like regional diagnostics – more ensemble members are required. There is another standard way to reduce weather noise, and that is to average over time, or over specific events. If you are interested in the impact of volcanic eruptions, it is basically equivalent to run the same eruption 20 times with different starting points, or collect together the response of 20 different eruptions. The same can be done with the response to El Niño for instance.
With the new archive though, people have tried something new – averaging the results of all the different models. This is termed a meta-ensemble, and at first thought it doesn’t seem very sensible. Unlike the weather noise, the difference between models is not drawn from a nicely behaved distribution, the models are not independent in any solidly statistical sense, and no-one really thinks they are all equally valid. Thus many of the pre-requisites for making this mathematically sound are missing, or at best, unquantified. Expectations from a meta-ensemble are therefore low. But, and this is a curious thing, it turns out that the meta-ensemble of all the IPCC simulations actually outperforms any single model when compared to the real world. That implies that at least some part of the model differences is in fact random and can be cancelled out. Of course, many systematic problems remain even in a meta-ensemble.
There are lots of ongoing attempts to refine this. What happens if you try and exclude some models that don’t pass an initial screening? Can you weight the models in an optimum way to improve forecasts? Unfortunately, there doesn’t seem to be any universal way to do this despite a few successful attempts. More research on this question is definitely needed.
Note however that the ensemble or meta-ensemble only gives a measure of the central tendency or forced component. They do not help answer the question of whether the models are consistent with any observed change. For that, one needs to look at the spread of the model simulations, noting that each simulation is a potential realisation of the underlying assumptions in the models. Do not – for instance, confuse the uncertainty in the estimate of the ensemble mean with the spread!
Particularly important simulations for model-data comparisons are the forced coupled-model runs for the 20th Century, and ‘AMIP’-style runs for the late 20th Century. ‘AMIP’ runs are atmospheric model runs that impose the observed sea surface temperature conditions instead of calculating them with an ocean model, optionally using other forcings as well and are particularly useful if it matters that you get the timing and amplitude of El Niño correct in a comparison. No more need the question be asked ‘what do the models say?’ – you can ask them directly.
The usefulness of any comparison is whether it really provides a constraint on the models and there are plenty of good examples of this. What is ideal are diagnostics that are robust in the models, not too affected by weather, and can be estimated in the real world e.g Ben Santer’s paper on tropospheric trends, the discussion we had on global dimming trends, and the AR4 report is full of more examples. What isn’t useful are short period and/or limited area diagnostics for which the ensemble spread is enormous.
In such a large endeavor, it’s inevitable that not everything is done to everyone’s satisfaction and that in hindsight some opportunities were missed. The following items should therefore be read as suggestions for next time around, and not as criticisms of the organisation this time.
Initially the model output was only accessible to people who had registered and had a specific proposal to study the data. While this makes some sense in discouraging needless duplication of effort, it isn’t necessary and discourages the kind of casual browsing that is useful for getting a feel for the output or spotting something unexpected. However, the archive will soon be available with no restrictions and hopefully that setup can be maintained for other archives in future.
Another issue with access is the sheer amount amount of data and the relative slowness of downloading data over the internet. Here some lessons could be taken from more popular high-bandwidth applications. Reducing time-to-download for videos or music has relied on distributed access to the data. Applications like BitTorrent manage download speeds that are hugely faster than direct downloads because you end up getting data from dozens of locations at the same time, from people who’d downloaded the same thing as you. Therefore the more popular an item, the quicker it is to download. There is much that could be learned from this data model.
The other way to reduce download times is to make sure that you only download what is wanted. If you only want a time series of global mean temperatures, you shouldn’t need to download the two-dimensional field and create your own averages. Thus for many purposes, automatic global, zonal-mean or vertical averaging would have saved an enormous amount of time.
Finally, the essence of the Web 2.0 movement is interactivity – consumers can also be producers. In the current CMIP3 setup, the modelling groups are the producers but the return flow of information is rather limited. People who analyse the data have published many interesting papers (over 380 and counting) but their analyses have not been ‘mainstreamed’ into model development efforts. For instance, there is a great paper by Lin et al on tropical intra-seasonal variability (such as the Madden-Julian Oscillation) in the models. Their analysis was quite complex and would be a useful addition to the suite of diagnostics regularly tested in model development, but it is impractical to expect Dr. Lin to just redo his analysis every time the models change. A better model would be for the archive to host the analysis scripts as well so that they could be accessed as easily as the data. There are of course issues of citation with such an idea, but it needn’t be insuperable. In a similar way, how many times did different people calculate the NAO or Niño 3.4 indices in the models? Having some organised user-generated content could have saved a lot of time there.
Maybe some of these ideas (and any others readers might care to suggest), could even be tried out relatively soon…
The diagnoses of the archive done so far are really only the tip of the iceberg compared to what could be done and it is very likely that the archive will be providing an invaluable resource for researchers for years. It is beyond question that the organisers deserve a great deal of gratitude from the community for having spearheaded this.