Thoughts on Severe Local Storm Forecasts:

Quality, Verification, and Scientific Integrity




Chuck Doswell


Posted: 15 February 1998 Last update: 10 April 2009: Some small changes plus updated some outdated links

All the standard disclaimers apply. These are my personal opinions only and do not have any official status. Feel free to e-mail any comments to <> (Use the hyperlink or cut and paste the address, substituting @ for #)

1. Introduction

It's essential to begin this with some assumptions that may seem simple and obvious, but which are heavily freighted with implications, as I'm going to try to show. The primary assumption is that we begin with a concern for forecast quality. In order fully to appreciate what this means, I recommend reading Murphy (1993); by quality I mean the degree of correspondence between the forecasts and the observations. Presumably, the greater the degree of correspondence between forecasts and observations, the higher the quality of those forecasts.

There are other aspects of the forecasts that might well be of some concern. For instance, grammatical, spelling, and punctuation errors certainly can be argued to be important parts of the "quality" of those forecasts. Some part of product "quality control" properly should be devoted to making the forecast's grammar, spelling, and punctuation as good as possible. However, for my purposes, such issues are irrelevant in my definition of quality. This appears to be an item of some difference between my priorities and those of the National Weather Service (NWS) these days.


2. Verification

A second assumption is that the purpose of verification is to assess quantitatively the quality of the forecasts. An immediate consequence of this assumption is that a forecast not verified is one about which we can have only qualitative assessments of its quality. Without verification, then, we have only flimsy evidence at best concerning forecast quality.

This suggests that anyone who cares about forecast quality must necessarily be heavily committed to verification. To the extent that the NWS is only now beginning to consider a revision to its long-deficient national forecast verification system, it certainly can be argued that the NWS seems to be saying that it doesn't care very much about the quality of its forecasts. The resources allocated to creating a meaningful national forecast verification system are rather meager. Naturally, it always can be argued that resources for the NWS are meager across the board; I'd be among the last to dispute that. Nevertheless, resource allocation is indeed a reasonably accurate measure of priorities and the conclusion seems inescapable. Please note: I mean no disservice to those who've worked tirelessly in obscurity to attempt to do something about the problem of minimal resources for verification work. In fact, I'm proud of those whose sacrifices on behalf of verification have given us inspiration to continue to try to do something to change the situation.

Another aspect of verification that follows from this assumption is that verification must account fully for the correspondence between forecasts and observations. That is, a verification system that's superficial and one-dimensional is correspondingly incapable of providing forecasters and managers with appropriate feedback regarding the correspondence between forecasts and observations. Again, the writings of Murphy (e.g., Murphy and Winkler 1987) come quickly to mind. Forecast verification is a multidimensional problem and inevitably resists compression into single numbers, no matter how much the bureaucrats want everything to be reduced to the level of simplicity that they can comprehend.

An often-overlooked aspect of verification is that the observations must coincide with the forecasts. Suppose we're forecasting thunderstorms ... a substitution of radar data for the "observations" of thunderstorms would be inappropriate, because there's no one-for-one relationship between radar reflectivity and the occurrence of lightning/thunder. Lightning ground strike location information, on the other hand, would be an appropriate substitution for hearing thunder, since every lightning strike is inevitably associated with thunder (although not all thunderstorms produce cloud-to-ground [CG] lightning, and not all CGs are detected). This may appear to be obvious ... the observed data must fit the forecast ... but it often seems that obvious things need to be stated.

Much has been written elsewhere about verification, by me (including Doswell 1996, or Brooks and Doswell 1996) and many, many others. I'm attempting to convince my readers that they must resist the temptation to reduce forecast quality assessment to some sort of "pecking order" ranking system based on some sort of single measure. I have no intention of repeating very much of the arguments here (see the references). Suffice it to say that the present national verification system comes nowhere even remotely close to reflecting the true dimensions of the forecast verification task. Until this gap between the need and the system can be closed, I can't take the national verification program very seriously.


3. Forecast improvement

A third assumption I'm going to make is that the first and foremost reason for knowing the quality of forecasts is to improve that quality. Naturally, there might well be other objectives in knowing forecast quality, including knowing whether or not the forecasts offer value to the users of the forecast. [I've written about forecast value elsewhere, as have Roebber and Bosart (1996), for example ... in fact there's a whole book about the topic (Katz and Murphy 1997) now ... so I'm not about to delve into that topic at this time. Most forecasters aren't very concerned about the real value of their forecasts (see Doswell and Brooks 1998); they're more occupied with the meteorology than the question of forecast value.]

Disregarding other objectives for verification, then, in order to improve upon their forecasts, it becomes critical to provide rapid feedback to forecasters. There are at least two distinct aspects to forecast improvements based on that verification feedback.

The first concerns what might be termed calibration of the forecasts. As noted in Brooks and Doswell (1996), a comprehensive verification can bring to light many peculiar aspects of how forecasters distribute their forecasts relative to the events. They may have various kinds of biases they are not even cognizant of, and would change readily if they simply knew about them. Such changes would improve their verification in the complete absence of any new meteorological insight. Changing forecasting behaviors that inadvertently reduce the correspondence between forecasts and observations is simply common sense, but first you have to be aware that you are engaging in such counterproductive tactics.

Second, once you've calibrated your forecasts properly, you can turn your attention to the meteorological aspects of the events for which you have done badly (and those for which you have performed exceptionally well, also). Clearly, the idea is to benefit from both your mistakes and your successes. This process, so often not done, is what I've termed loop-closing ... the idea is that you try to develop new strategies to deal with your problem forecasts. This only makes sense after you've finished the calibration step, since many bad forecasts could simply be addressed by recalibration.

There are, in turn, two types of meteorologically-related forecast problems: 1) problems for which answers exist and you simply haven't been using the existing knowledge or not using it properly, and 2) problems for which answers simply don't exist. The former is a training problem, the latter is a research problem. I'll return to this shortly, but I must address another important issue, first.


4. Verification Data

Once your attention has been turned to verification, a critical aspect of the problem becomes apparent. You must have appropriate data with which to verify your forecasts. I have addressed some of this elsewhere. The item that concerns me the most at the moment is severe weather occurrence data. The history of these data is long and complex but I want to point out some important aspects of the data:

a. The data are very inhomogeneous. Both the quality and quantity of these data depend on the resources devoted to the task of collecting and sifting through them, the degree of enthusiasm of the collector(s), various political pressures (e.g., for purely economic reasons, some states historically have pressured the collectors of the data to downplay the occurrence of tornadoes), the education and training of the spotters and the data collectors, etc. Factors such as population density, typical visibilities, etc. all have an impact on the reported occurrences of severe weather. When tornadoes develop, it often happens that reports of severe hail and wind diminish in frequency. As the emphasis on local warning verification has developed since the early 1980s, the occurrence frequencies of severe thunderstorm events have undergone a substantial increase. Even before that, the development of the watch/warning program (with its associated storm spotter network) also contributed to an increase in reports.

b. If a miracle occurred, and everything I asked for were suddenly granted, it would be at least 20 years before we had enough data to begin some of the serious work that needs to be done. The nature of this process means that some of the real value of any investment in the database won't become apparent for many years. Is there sufficient public will to support substantive change over a span of decades before the big payoffs begin? I wonder. Of course, there would be short-term payoffs, especially to forecast verification!

c. These data, with all their flaws and problems are the basis for performing verification. To the extent that these data are flawed, so necessarily is any verification based upon them. This certainly suggests that doing something like, say, basing insurance rates on the data would be questionable ... yet this sort of application is going on and the pressure for it is increasing.

d. At present, we're in a "fox guarding the chickenhouse" mode, where the same NWS people that issued the severe weather warnings are those responsible for the collection and documentation of the data used to verify the warnings. It's clear to me that various forms of abuse of this are occurring, pious public denials notwithstanding. It may be as simple as working very hard to find that single 3/4 inch hailstone in a sea of much smaller hail, and only doing probing calls in counties for which warnings were issued, or it can be as devious as changing the reported occurrence times to after the warning issuance times, asking leading questions ["You did see hail at least as large as nickels, didn't you?"] during probing calls, or tilting ambiguous observations to the side most favorable to the warnings:

I know beyond any doubt that this is going on, although by no means everywhere, by everyone, all the time. In some ways, if it was uniformly done, it might even be better for the science, since we could simply ignore the marginal reports.

e. It needs to be recognized that the reports (and non-reports) of severe weather need not be treated as dichotomous (they either occurred or they didn't). The fact is that all reports and non-reports have varying degrees of confidence associated with them. If I get a report of a tornado via a phone call from Al Moller or Gene Moore (famous storm chasers!!), I'm vastly more confident of that actually being a tornado than a tornado report arising from a situation where a wall fell on a bunch of people in New York state during a thunderstorm and there was considerable political pressure to call it a tornado.

If we actually followed the recommendation that Don Burgess and I made 10 years ago (Doswell and Burgess 1988) to put the SOURCE of severe reports into the database ... [Why has this been so hard to do? - (Update - this may have changed, finally, in August of 2007)] ... we could assign probabilities to the reports, as well as to the forecasts. Where this has a potentially huge impact is in the non-occurrence of events. Consider the following scenarios, all of which are associated with a non-report of severe weather:

  1. A storm with all the radar and satellite signatures of a tornadic storm moves across an unpopulated area.
  2. A storm with marginal radar and satellite signatures of a severe storm moves across an unpopulated area.
  3. A storm with no radar and satellite signatures of severity moves across an unpopulated area.
  4. A storm is seen via satellite to move across an unpopulated area, but it is too far from any radar to be analyzed. The CG lightning flash detection network shows a few lightning flashes in the area.
  5. A cloud is seen via satellite to move across an unpopulated area, and the CG lightning flash detection network detects no lightning. No radar data are available.
  6. No clouds and no radar echoes are seen over an unpopulated area.

The non-report in each of these cases has a different probability of being a valid instance of a tornado non-occurrence. I'll leave it up to you to imagine other issues and scenarios that would alter the probability of reports and non-reports being valid. We have the tools to accomplish the transformation of our severe storm data base in this way, but it would take a major effort to put the pieces together properly and, to date, we have trouble getting the resources for an occasional decent scientific survey after a major tornado event! Until there's a will to provide a database with scientific integrity, trying to improve our severe weather forecasts is nearly hopeless!!


5. Integrity

As I've reviewed the data, it's become clear that it's nearly impossible to make much use of them for verification. The data are so biased, so inhomogeneous, so non-stationary, that I cannot recommend a very detailed verification. Interpreting these data literally is sheer folly. Too much of this has become a game we're playing with numbers, and the science of meteorology as well as the forecasting based on that science is inevitably going to suffer from this. Subverting the data for short-term gain in verification statistics is a manifestation of scientific misconduct and no person having any scrap of scientific integrity should engage in such activities. Failing to support an effort to upgrade and update the severe weather occurrence data is fatal to the science based on the data. Yet it goes on.

Let me review this so far, in order to make this perfectly clear:

Failure to understand the importance of the verifying data in the process of short-sighted verification "games" negates the long-term process of forecast improvement. No one who cares about the long-term outcome of the total process can tolerate anything less than our absolute best verification data. This means (a) recognition of its importance by providing sufficient resources to the task to ensure the highest possible quality, and (b) development of a "zero tolerance" at all levels throughout the system for any activity that compromises the integrity of the data.

I make no claim to knowing how to get around the "fox guarding the chickenhouse " problem. I'm not convinced some sort of independent verification team is the answer, though. [Update: now I am convinced of it!] The NWS forecasters are in the best possible position to know meteorological events happened and where to look for occurrence reports, but they must somehow be rededicated to the integrity of the results, rather than seeking to maximize personal verification statistics, or those of the office via various forms of manipulation of the reports. Further, it's clear that when verification is an additional duty of forecasters, added to their already excessive workload, it's unlikely that a really serious effort can be sustained (except in rare cases of major disasters). Many reports slip through the cracks within such a leaky system.

Management continues to adopt a less than satisfactory position with respect to this issue. Their main concern, and perhaps even arguably so, is the continued production of forecasts in a timely manner to serve the customers. Verification tends to take a back seat and its importance to the long-term effort of forecast quality improvement seems not to be recognized at high enough levels to make any real difference in the resource allocation. Nor is there any indication that scientific misconduct is seen as a serious problem in the offices, since the offices that engage in such things have the "best verification statistics." It seems that management has created a monster from the well-intentioned effort to verify the warnings and forecasts. Thus, the process of verification has been subverted to serve purely short-range ends and its long-term goals are thwarted. I can only surmise that my assumption #1 is being violated at high levels in the management chain: it seems there's not a very serious commitment to forecast quality!


6. Closing the loop

Yet another issue plagues us, at the moment. Let's assume for the sake of argument that somehow we are able to obtain verifying severe weather reports of sufficient quality to do an adequate verification. I've already described the two ways that this feeds back to forecast quality improvement: calibration and loop-closing.

Loop-closing necessarily entails going back and taking a look at the meteorology of the events and how the forecasters handled that meteorology. The object is not always to find fault; it can be just as useful and important to know what was done right . At the present time, unfortunately, it's difficult if not impossible for this to take place, simply because there's no commitment to develop, maintain, and make available to forecasters any archive of the vast amount of information that flows through their systems on a daily basis. In order to develop the probabilistic severe weather database I've described above, we need:

  1. Source data to be added to the severe weather reports
  2. Identification of cases where surveys were conducted and information about the nature of those surveys
  3. Radar, lightning flash detection, satellite, and conventional (i.e., surface and upper air observations) data
  4. Geographic information systems (GIS) data of all sorts (i.e., population, etc.)
  5. Development of software to meld these various information sources into a systematic estimate of confidence in a report

At the SPC, I'm told the daily data throughput is on the order of 4-5 gigabytes (probably more) of surface and upper air observations, lightning detection network data, model output, satellite images, profiler and WSR-88D radar data, etc. Most of this torrent of information passes through the systems and spills out into ... nowhere (the "bit bucket"). Without saving this, we're effectively blocking our ability to make use of it later. The taxpayers paid for the data and, apparently, must pay for it again in order to make use of it for anything other than daily forecast preparation. In this day and age, 4-5 Gbytes sounds like a lot, but it really is well within our technical ability with off-the-shelf hardware.

If we were to commit to save these data, along with keystroke information (to know what was actually looked at during the course of the forecast day), it wouldn't require a great technical leap to do so. Basically, software needs to be written to spool the data to an archive and then to recapture it. Hardware acquisition would probably be the cheapest and easiest part.

If we had such information, not only would the opportunities for loop-closing open up tremendously, but the value of such information for training would be enormous. "Displaced real-time" training exercises, where the data would become available exactly as they were received in real time would be enormously successful in a real training program. Of course, no such substantive training program, with simulated real-world environments, is currently available, nor is any such program even in the planning. But without saving the data, it's virtually impossible even to consider developing it. This is something that's within our grasp, if the will were there to seize upon the opportunity.

Moreover, the loop-closing data would be tremendously valuable for research aimed at closing the gaps between our goals and what we are currently able to do with today's meteorological understanding. A true partnership between research and operations is possible, as some researchers (including yours truly!) might well view the access to this volume of data as a real research opportunity. As with the other aspects of applying these data sets, a lack of commitment to this goal is really all that is lacking. The resources necessary to make this a reality simply aren't that large ... this isn't some pie-in-the-sky, impractical dream. I find it astonishing, in fact, that archiving this flood of data is not already being done!


7. Conclusions

There does not seem to be any evidence of any real commitment to quality. We have become so caught up in the technical details of the so-called NWS Modernization that we have not really taken advantage of the opportunities that the technological changes have presented. We have implemented technology in ways that have minimized its overall value to forecasting. I hate to say it, but I can only interpret the lack of commitment to meaningful verification, to high quality verification data, and to substantive training as being preoccupied with facets of weather forecasting that have little or no intersection with forecast quality.

Although the relationship between forecast quality and value is a complicated one, this does not change the fact that we are not doing very much to improve forecast quality, again discounting pious pronouncements with little substance behind them. I'm of the opinion that a commitment to quality starts at the top and I believe that the current "atmosphere" at the top is focused on issues far removed from forecast quality. As is normal, decisions seem to be driven by politics and economics, not the scientific and technical issues. Provided that isn't changed by someone at high enough levels, it's unlikely to change at the local level, where the products that serve the weather forecast customers are created. To the extent that local managers are doing some or all of the right things to improve forecast quality (i.e., meaningful verification, loop-closing, substantive training, etc.), that's a tribute to their personal integrity rather than a reflection of any systemic commitment.



(most publications with me as an author are available here)

Brooks, H.E., and C.A. Doswell III, 1996: A comparison of measures-oriented and distributions-oriented approaches to forecast verification. Wea. Forecasting, 11, 288-303.

Doswell, C.A. III, 1996: Verification of Forecasts of Convection: Uses, Abuses, and Requirements. Preprints, 5th Australian Severe Thunderstorm Conference (Avoca Beach, NSW, Australia), Bureau of Meteorology, 191-196.

______, and D.W. Burgess, 1988: On some issues of United States tornado climatology. Mon. Wea. Rev., 116, 495-501.

______, and H.E. Brooks, 1998: Budget-cutting and the value of weather services. Wea. Forecasting, [in press].

Katz, R.W., and A.H. Murphy, 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp.

Murphy, A. H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting, 8, 281-293.

_____, and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, 1330-1338.

Roebber, P.J., and L.F. Bosart, 1996: The complex relationship between forecast skill and forecast value. Wea. Forecasting, 11, 544-559.