This is fantastic work. I’ve been making constant use of the BEST/Google maps station browser to tackle the very interesting question of why our trends are higher than GISTEMP (which also extrapolates air temperatures over sea ice). It’s unfortunately far from simple, but the combination of your tools and MERRA have allowed us I think to crack an important piece of the puzzle.
Writing it up is taking a while though. Hopefully we’ll have a rudimentary report later in the month.
Folks can also find full (1880-2013) daily temperature movies below. Note that they are each an hour or so long, so unless you are really a daily data aficionado they are mainly useful for finding specific days of interest.
It turns out that, prior to the airport’s construction, the station now associated with the airport was on the top of a black roofed building closer to the city. This is a common occurrence for stations in the U.S., where many stations were moved from city cores to newly constructed airports or wastewater treatment plants in the 1940s.
This effect can be seen with even “dead simple”*** processing of raw temperature data. Exclude the stations that have been moved to airports, and the raw data results align amazingly closely with the published NASA “met stations” results.
Below are 3 plots. The first shows the results of simple anomaly gridding/averaging using all temperature stations. The second shows the results of the same procedure using only stations located at airports. And the third shows the results using only stations *not* located at airports. (The airport/non-airport station status can be found in the GHCN metadata files).
The simple gridding/averaging results are shown in green. The official NASA/GISS “meteorological stations” results are shown in red (included for comparison purposes).
The pre-1940 warm bias is clearly seen in (1) and especially in (2). When all airport stations are excluded from the processing (3), the bias largely disappears.
*** Gridding with 20deg x 20deg grid-cells (at the Equator), longitude dimensions adjusted to keep the grid-cells as nearly constant as possible as you move N/S. (No grid-cell interpolation — just baselining, gridding, and averaging.) You could teach this stuff to on-the-ball first-year programming students.
I am not sure whether this dataset is such an improvement. When I understand the homogenization procedure of BEST right, it only corrects errors in the mean. The example of a move from black roof in a city to an airport is also a good example that not only the mean changes. The black roof will show more variability because on hot sunny days the warm bias is larger than on windy cloudy days. Thus we would need a bias correction of the complete probability distribution and not just its mean. Or we should homogenize the indices we are interested in, for example the number of days above 40°C. The BEST algorithm being fully automatic could be well suited for a an approach.
Another problem I see is the use of kriging. The variability will change with the number of stations available to estimate the daily means of a grid box. It will be hard to distinguish changes in weather variability with changes in the error in this estimate due to changes in the configuration of the station network. Thus I would feel it is saver to analyse changes in extremes and weather variability on station data and avoid the additional problems of gridded datasets, especially at daily scales.
I take it that difference between your data set and Cowtan and Way in recent years is mainly due to a greater slowdown in the BE land temperature analysis (although the C&W hybrid methodology probably accounts for a small part of the difference as well). I have some ideas of the source of that slightly more pronounced slowdown in BE, and I look forward to discussing these with you.
Meanwhile, this does provide yet more confirmation that coverage bias is increasingly problematic in the data series that do not attempt to address observational gaps.
The problem Victor is that “now at an airport” doesn’t tell you anything about the history of a station. Plus, the GHCN metadata is not exactly current or accurate. “at an airport” can mean associated with an airport but located some distance away. So the charts indication perhaps the upper boundary on the effect. Further, Zeke was talking about the practice in the US. whether that was the same world wide is an open question
This is good work but something has bothered me about the underlying SST data.
The new BEST Land+Ocean is largely the Berkeley Land data set tacked on to the HadSST3 data set with minimal changes to the latter (I can’t see any). BEST could have made the decision to go with the ERSST data set from NOAA but Steve Mosher indicated that the HadSST3 had better documentation on the errors than the ERSST.
HadSST3 I think does a better job than the ERSST on capturing the early years, for example it seems to detect the Krakatoa cold excursion better than did the ERSST.
Where I think HadSST3 fails is in capturing the WWII anomaly, and the years thereafter. http://imageshack.com/a/img35/3662/1jy8.gif
How can HadSST3 and ERSST both show a similar level during the 1940-1945 time frame, but HadSST3 have all those adjustments in the subsequent 10-20 years? I don’t understand this. One side of the Atlantic must think much differently about adjustments than the other.
Between GISS, NOAA-NCDC, HadCRUT, and BEST, there has to be some sort of “ground truth” here and I believe that an agreed upon common time series can be established. I hoped that BEST would act as a tie-breaker and figure this difference out because I thought that was their original charter — to make sense out of the confusion over the way people viewed historical temperature readings.
Steven Mosher says: “The problem Victor is that “now at an airport” doesn’t tell you anything about the history of a station. Plus, the GHCN metadata is not exactly current or accurate. “at an airport” can mean associated with an airport but located some distance away. So the charts indication perhaps the upper boundary on the effect. Further, Zeke was talking about the practice in the US. whether that was the same world wide is an open question.”
I am not sure which point you are making. Methodological questions should go to caerbannog. I only plotted his data. I think it is intriguing and I hope someone will find the time to make a scientific study out of it. There is a similar simple analysis by Nick Barnes on airports, which find similar numbers.
We actually have one advantage over Watts et al., which only have information on the end point. In case of airports we also know that in 1880, the stations were not at an airport.
The plot shows a trend bias for the stations that are now at an airport. It does not show the trend bias in the total dataset. Then we would need good metadata and would also need to know exactly how many stations are at airports.
I made my comment to you because I know you. It’s a pretty simple point regardless of who did the charts or who had similar results.
1. The metadata on airports in GHCN is dated. It’s easy to update it using current airport locations. ask me if you’d like the code.
2. Even with that you need to look at station history which is no simple
So, I would say, that the effect one sees is a good indication that there
may be an issue, but its not as definitive as the title of your chart indicates. That’s all.
Next: I’m not so sure your about concerns over kriging. First to qualify for use a daily station required 100% coverage during the month.
But your concern is noted and it might be interesting to see if we can come up with a test for how important that is.
Sorry Victor, yes it would make a good study. I started looking at it some time ago, but the station history data in much worse shape then. perhaps I’ll revisit it or at least try to make the metadata available for somebody who wants to do the work.
“How about a map showing changes in population density over time? Correlate this with a change in temps and I’m sure we’ll see a connection between them.
I’m working on that for another post related to some errors in mckittrick and micheals (2007, 2010) it may be done a few weeks.
There are several difficulties with this.
1. Station locations. For a good number of the stations the location data may only be good to 10km or so. For other stations the location data is good down to 30 meters or so. and locations change.
2. Because of the station location differences you have to make some choices about what population data to use and what resolution. Do you use gridded data what resolution?( 1arc second, 2.5 minutes, 5 minutes? etc )
3. What time periods do you want to look at? depending on your answer to #2 you have different choices.
In any case I’m assembling a data set of population density back to 1750 so that folks can look at this if they want to. I don’t think the picture will be as clear as you suggest.
Thanks for the comments. I agree that our homogenization is necessarily imperfect; it relies on detecting larger changes in the mean and will not effectively capture some types of changes in variability. That said, there have been a number of papers trying to look at questions of variability and extremes using raw daily data, and this dataset should at the very least provide an improvement on that.
A good approach would be to find pairs of stations close to one another with overlapping records where one is an airport station and one is not. Updating airport classifications in the station metadata, as Mosher suggests, would also be helpful.
Hi Zeke, Yes, there are much too many studies where people analyse raw data for changes in extremes and weather variability. Homogenization wrt to the mean is an improvement, but only a partial one. I thought this post was a good occasion to bring that to the attention. Too many authors and reviewers do not seem to worry enough.
The climate sceptics of the IPCC warned quite clearly that inhomogeneities in the tails of the distribution are more problematic as the ones in the mean.
Nice to see you have studied airports already in more detail. Looks like collocation is important. Strange that the difference becomes even smaller.
If the UHI has some importance, you would expect that a relocation from a city to an airport would on average result in a strong cooling trend bias. Thus I was already somewhat surprised that the data of caerbannog showed so little effect. What seemed to fit quite well that the jump was around WWII, which at least in Europe was a frequent date for such relocations.
Naturally this jump would only be there for stations that were in a city before, so one may want to restrict such a study to long series and remove shorter stations that may have been erected together with the airport.
Steven and Zeke, why do you worry so much about the metadata? The stations mentioned as airport, especially the ones with airport in the name, will be airports at the moment, not? Seems hard to do that wrong. The non-airport class, may not be perfect.
@Victor Venema, especially beginning at Comment #4,
These kinds of issues are one reason why I find Bayesian procedures in experimental work to be particularly compelling. Sure, you build in models comparable to kriging and there are commitments to general forms of priors (although that is less of an issue these days), but the setting of kriging parameters is not done primarily by numbers of points, but by hyperpriors which borrow information not only from the locale but from other cells. It’s hard to be specific without math. The availability of a hierarchy for a model also lets the student build in terms to account for possible distortions, and then regress on data to see directly how important or unimportant they are based upon the data.
I see a significant risk in a detailed, highly manual procedure trying to account for This Bias And That. You can never be sure if you have them all, you can never be sure you did them all right, and, worst, it’s not easy to put a number of these procedural risks. I’d rather let the data speak.
Anyone working on a comparison of these new datasets with HadCRUT4 ensembles? I’m interested and working on a study of how resampling of HadCRUT4 ensembles can in many instances deceive the student by producing an estimate of variance in them which is substantially too small. There are a couple of reasons for that which are too technical to go into here. I’m trying to figure out, however, how much an understatement in coverage affects that, and if I can quantify in the HadCRUT4 data how much is due to missing data effects and how much this understatement of variance. In particular, Arctic or anywhere, if there are missing observations in HadCRUT4, I’m interested in what’s typically used as a means of imputation or, more in the spirit of that dataset, how to correct in its variance for these missing values.
You are right to note that there are many kinds of inhomogeneities, and some are going to harder to deal with than others. In part, we hope this first version of the data set will help to generate feedback from potential users about possible issues and limitations.
As you note, spurious changes in variability are a particularly challenging problem. In some small part, we improve this situation by determining homogeneity adjustments for max, min, and mean temperatures separately. This will help in some situations where the microclimate of the station has a spuriously high or low diurnal range (as might be observed from a station placed near a man-made surface). However, that will only help some kinds of situations (and can even give detrimental adjustments in some cases). The general problem of disturbed variance is an issue I’d like to look more closely at, though I don’t know of any truly good solutions.
Of course, looking at individual station data is always an alternative when studying variability and extremes. If you really trust that those stations are reliable, then station data is probably the best thing one can use. However, both Berkeley and NOAA estimate that most stations longer than a couple decades have been subject to significant biasing events, which suggests that determining long-term trends from station data is likely to be a perilous endeavor.
One hopes that by carefully combining many stations into regional averages and identifying biasing events, one can extract more reliable information than from individual records. Such interpolated fields may work well for some kinds of questions and less well for others. For example, the resulting field generally captures information at a regional scale (e.g. 100 km) while true local effects may get mistaken for noise.
As you note, changes in station network composition can become a confounding factor for some kinds of climate analysis. This will be an issue for all kinds of reconstruction that allow network composition to vary over time (not just Kriging). We can’t eliminate it, but hopefully one can use statistics and simulated reconstructions on GCM data to determine the likely magnitude of the impact for various variables of interest and avoid drawing erroneous conclusions. In general, we would expect the fields to be very robust when densely sampled data is available, and perhaps less so at times and places that have been less well monitored.
Ultimately, all forms of climate and weather analysis are likely to suffer from limitations. It is our hope that this new data set will provide a useful point of comparison relative to other techniques, and that different ways of looking at natural processes will make it easier to get at the underlying truth.
Berkeley Earth Temperature Analysis – possible flaws?
I have discovered that there are some flaws in Berkeley Earth’s temperature analysis.In several cases they have made a double counting of Danish weather stations. Below is a list of some of the double counted stations:
1. Vestervig and Vestevig-1
2. Nordby and Nordby(FANO)-1
3. HammeroddeFyr and Hammerodde Fyr-1
4. Maribo and Maribo Airport
5. Floro AP and Floro Lufthavn (Norway)
These examples can be used for testing the weather station identification computer program of BEST.
If there are many of these artificial weather stations, questions might be asked about the validity of BEST’s results.
The cause of the existence of the artificial stations might come from reading data from several sources with differences in the geographical coordinates and differences in data. It is a problem not created by BEST, but by the providers of temperature data.
It might be discussed who is responsible for checking the data before they are used, however I think it must be the responsibility of BEST to do it.
In respect of Prof. Richard Muller’s harsh critique of climate science, BEST cannot or should not ignore this double counting problem.
Jan Galkowski says: “These kinds of issues are one reason why I find Bayesian procedures in experimental work to be particularly compelling.”
I would personally expect that any gridding method would have problems with a changing number of stations. Studying variability and thus extremes is fundamentally more complicated as the means we are used to study.
Robert Rohde says: “As you note, spurious changes in variability are a particularly challenging problem. In some small part, we improve this situation by determining homogeneity adjustments for max, min, and mean temperatures separately. This will help in some situations where the microclimate of the station has a spuriously high or low diurnal range (as might be observed from a station placed near a man-made surface).”
The diurnal temperature range is a great variable to find breaks. Break in Tmin and Tmax are often negatively correlated.
Edward Greisch says: “Is there a correction you can apply for being on a black roof? I don’t envy you having to deal with that historical data.”
You can make two measurements in parallel. In this specific case: one on the roof and the other at the local airport. To study the problem of the roof itself, you can make a measurement on the roof and in a nearby garden.
Many such measurements exist and “only” need to be analysed, but we also need much more of such measurements.
Klaus Flemløse says: “In several cases they have made a double counting of Danish weather stations. Below is a list of some of the double counted stations: “
The International Surface Temperature Initiative (or more accurately Jared Rennie) has developed a merging algorithm that should remove such double counting. I would be appreciated if you could you have a look whether you can find similar problems here. How did you find these doubles?
When there are two stations that do not have identical location data (lat/lon) or identical observations, our merging algorithm will tend to classify them as two different stations. This can happen occasionally when the same station appears in different datasets (say, GSOD and GHCN-D) but with different locational precision and different methods of calculating the mean. On the other hand, an algorithm that is too aggressive in merging datasets will result in stations being lumped together that should not be due to small overlaps of identical values (e.g. a station in one location that gets merged with a station in a new location in one database, and a different station at a new location in another). We’ve run into this issue a few times as well, most notably with the central park/Laguardia stations in New York City.
The good news is that having a few accidental duplicates will make no difference on the resulting temperature fields. Unlike a more traditional gridding analysis, where each station is given equal weight in estimating grid cell-level anomalies, the Kriging approach will give no additional weight to two locations at (effectively) the same location. The only time I could see it causing issues is if one duplicate station has significant differences in lat/lon coordinates compared to the other.
We are always working to improve our station merging process, but this is likely an issue that we can mitigate but not totally eliminate. As long as there are duplicate stations with different names, locations, and non-identical anomalies reported, it is difficult for us to automatically determine that they are, in fact, duplicates. We will maintain a list of ones that people have identified, and perhaps add in a manual duplicate removal step in a future release.
In general, I am a fan of the Berkeley Earth approach for reconstructing past temperature variations – particularly because they present their data in an open user-friendly manner which should have long ago been adopted by some of the other groups. Mathematically the approach they use is fairly solid which allows for a higher spatial and temporal resolution of the end product.
However, there are a few concerns that I have – maybe not as strong as to say concerns – but perhaps questions is a better word.
 Approach – I believe they create a baseline climatology as a function of lat, long, elev for all land areas using what looks like a form of spline – Anuspline – a program for interpolating temperatures uses a similar approach. The residual from the climatology surface is then kriged and re-added to the climatology. They then subtract the baseline climatology for every station and krige the anomalies back through time. Once anomalies are spatially and temporally kriged they re-add the surface climatology to the anomalies to produce their absolute air temperature gridded product. Outliers are adjusted by comparing this surface climatology through time field to the actual observations and then the process is iterated with the variables recomputed.
My only questions pertain to how heavily reliant the approach is on a surface climatology that is unlikely to be simply a function of X,Y,Z with a spline and re-adding interpolated residuals.
Perhaps it is naivité on my part but would it not be more accurate to produce a baseline surface by spatially interpolating the anomalies from a CAM style calculation and then forward/backward model the stations which are not available during the CAM period using this kriged anomaly surface to then cover the CAM period and then estimate the surface climatology. My intuition tells me that there is less uncertainty is estimating the monthly baseline temperature at position X using the forward/backward model with spatially kriged anomalies than using an X,Y,Z function where the errors can be an order of magnitude greater.
 The adjustments – I’ve examined a lot of individual high latitude stations in BEST and I find it curious how frequently the raw temperatures are adjusted (even by small amounts). I can understand how the BEST approach works very well for monitoring long term temperature trends but for absolute values I am somewhat concerned in using the adjusted series. If there is a high quality series which has not changed position or had any other inhomogeneities I’m having trouble understanding why this series would be adjusted. I guess what I’m describing is that I am concerned that the tolerance threshold for deviation from the regional expectation is too low and that series which have local variations which differ from the regional expectation might experience some smoothing.
Overall very good work though and well-documented. Thank you for your contributions.
I am pleased to know that the BEST is aware the double counting in the database. But it is not sufficient to recognize the problem. You must estimate the extent of the double counting. Is it 1%, 5% or 10% of all weather stations ?
For your information, there are two additional double counting from Norway
OKSOY FYR and OKSOEY FYR
KJEVIK and KRISTIANSAND/KJEVIK
I could probably continue for some time to find the double counting of weather stations in the BEST database.
You will end up with additional examples to test the merging algorithm. I suppose you will be happy as well as angry with me at the same time. However, I have other things to do.
Below are trends (in deg C per decade) for 1979-2013, 1998-2012 and 1999-2013. I have included both the hybrid and infilled versions of Cowtan and Way, version 2. As mentioned previously, Berkeley Earth is much closer to NASA GisTemp, than it is to Cowtan and Way.
I’ll check with Robert to see what the difference in the code was for those two runs.
I agree that trend-wise we are closer (albeit slightly warmer) to GISS, at least for the modern period (1979-present). We are notably closer to C&W in the earlier part of the record, however. We are working with Kevin and Robert to explore difference between our results in more detail.
The former is more of a question for Robert Rohde, and I’ll pass it his way. As far as the latter goes, its quite possible that the Berkeley approach overly smooths at a local level. Its an issue we’ve known for some time, and a consequence of our decision to focus on getting the best regional climate reconstruction rather than the best local fidelity. NCDC’s PHA will likely give you better local temperatures, though (we’d argue) somewhat more inhomogenious regional trends (see our AGU poster this year for some comparisons).
Here are recent trends (again in deg per decade) for land surface. Over the satellite period, Berkeley Earth has the lowest trend, but does show a bit less “slowdown” than HadCrut4 and NOAA. NASA Gistemp appears warmer over the entire period, and in the recent short term.
Trends for 1979-2011 are:
0.26 Berk. Earth
0.28 BE GHCN
Berkeley Earth has the lowest trend while the BE “GHCN only” is up with NOAA. This suggests that there may well be a cool bias related to additional stations added in certain regions. Alternatively, some sparse regions may have bona fide regional cooling that is not captured from GHCN. Careful testing of the Berkeley Earth sensitivity to different data scenarios in different regions is required to resolve this discrepancy.
great that you have made a BerkeleyEarth land and ocean dataset of global mean temperature.
I have a question regarding absolute temperature.
The BerkeleyEarth 1951-1980 land+ocean global mean temperature is stated as 14.774 °C.
This is substantial more then HadCRUT 13.9 °C (1961-1990) and NOAA/NCDC also 13.9 °C (1901-2000).
Is this difference due to the ocean or due to the station data?
Or there are differences at the seaice area? Or maybe with the altitude?
At least HadCRUT gives a 5°x5° gridded file with absolute temperature. So I ask what is the spacial distribution of the difference between the BerkeleyEarth and the HadCRUT in absolute temperature? Is this difference rather uniform or are there a large scatter? Depend this on the month?
Any reason for that? Especially because you use the HadSST3 data for the ocean.
Regarding the average temperature values, there are some spatial differences in temperature pattern but the bulk of the difference relates to a systematic question of comparing sea surface temperatures to air temperatures.
Spatially the principal differences between HadCRUT and Berkeley annual averages occur over two regions, Antarctica and the North Atlantic (60-80 N). Over Antarctic land there are both warmer and colder spots in Berkeley’s estimation, though warmer on average. HadCRUT uses very little data over Antarctica, so I wouldn’t be surprised to see differences here. In the the North Atlantic, we are systematically warmer over the sea ice regions and somewhat colder over Greenland. The spatial differences vary locally from a few degrees to several degrees C.
Those differences, though interesting and warmer on average, occupy relatively little surface area and aren’t the main reason for a different average. The biggest difference is a small systematic offset over the bulk of the oceans. To understand this I need to go on a little digression.
The HadCRUT absolute climatology is a quasi-independent data product that uses land and ocean data, but isn’t directly constructed from HadCRU’s land and ocean anomaly data sets (see Jones et al. 1999). For example, the HadCRUT absolute field brings together more station data in many land regions than they use in their anomaly data (see New et al. 1999). Another difference is that though sea surface temperature (SST) is used for determining HadSST and HadCRUT anomalies, the absolute temperature field over the oceans reported with HadCRUT is calibrated to match marine air temperature (MAT). SSTs are typically 0.5-1.5 C warmer than MAT at the same location.
In the Berkeley Land + Ocean data we consistently used SST values for both anomalies and absolute values over the open ocean. Doing so results in an estimate of the absolute field over oceans that is systematic warmer than if you do the calibrations in terms of MAT. Using MAT is probably more consistent with the notion of defining a global average temperature over both land and ocean, so there is a argument to recommend that approach. On the other hand, creating data sets that mix MAT absolutes and SST anomalies is likely to cause confusion for some users. At present, we opted to use a consistent data type (all SST) for the oceans, though it is a question we may revisit in the future.
The original base line mistake is just standard error for the source, but there still seems to be non-correspondence with the mapping program here: http://data.giss.nasa.gov/gistemp/maps/
Can you straighten this out?
Comment by Pete Dunkelberg — 21 Mar 2014 @ 8:51 AM