Bloggtittelen er blitt brutalt overfalt av bokstavrim, med det resultat at metaforbruken halter. Viten er ikke noe endelig mål i det fjerne, men snarere noe man plukker opp underveis. Jeg vil vise funn jeg snubler over. Foreløpig kun som små stubber, med mulighet for senere utbroderinger. Som et slags manifest velger jeg, i all beskjedenhet, å la Erwin Schrödingers unnskyldning skissere rammene for bloggen. Kommentarer er alltid velkomne.
- Harald

fredag 8. februar 2008

Jeg, en klimaskeptiker?

Jeg var nylig på et seminar i regi av statistikermiljøene i Oslo, hvor Kjell Stordahl presenterte kritikk av IPCCs metoder. Spesielt kritiserte han manglende vurdering av usikkerheten i prognosene - eller snarere manglende vurdering av usikkerheten i den projiserte klimautviklingen under ulike forutsetninger om CO2-utslipp (ettersom klimaforskerne ikke vil innrømme at de lager prognoser).

Stordahls presentasjon finnes i Tilfeldig Gang, september 2007 , s. 5-11 (utgitt av Norsk Statistisk Forening).

Hva skal til for å framstå som en troverdig 'klimaskeptiker'?

Det er også en debatt på forskning.no:
Er usikkerheten i klimaprognosene undervurdert? Kjell Stordahl
Misforståelser om klimaprognoser Knut H Alfsen og Helge Drange
Klimaprognoser og usikkerhet Kjell Stordahl
Mer om klimaprogoser og usikkerhet Knut H Alfsen og Helge Drange
Fortsatt uklarhet rundt klimaprognoser Kjell Stordahl

onsdag 6. februar 2008

Løgner, fordømte løgner og forvirret statistikk...


Mer å tygge på for den som tråler etter statistisk signifikans:

Toward Evidence Based Medical Statistics: The P Value Fallacy (pdf, kommentar), Steven N Goodman (Johns Hopkins University)

Goodman serverer en usedvanlig velskrevet og grundig presentasjon av logiske feilslutninger forbundet med å trekke konklusjoner fra statistiske data. Her er et knippe smakebiter:

P-verdien:
The P value is defined as the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. [...] It is worth noting one widely prevalent and particularly unfortunate misinterpretation of the P value. Most researchers and readers think that a P value of 0.05 means that the null hypothesis has a probability of only 5%.


Et konkret eksempel:

A recent randomized, controlled trial of hydrocortisone treatment for the chronic fatigue syndrome showed a treatment effect that neared the threshold for statistical significance, P = 0.06. The discussion section began, “. . . hydrocortisone treatment was associated with an improvement in symptoms . . . This is the first such study . . . to demonstrate improvement with a drug treatment of [the chronic fatigue syndrome]” . What is remarkable about this paper is how unremarkable it is. [...] a conclusion is stated before the actual discussion, as though it is derived directly from the results, a mere linguistic transformation of P = 0.06. This is a natural consequence of a statistical method that has almost eliminated our ability to distinguish between statistical results and scientific conclusions. We will see how this is a natural outgrowth of the “P value fallacy.”

Om hypotesetesting:
Hypothesis tests are equivalent to a system of justice that is not concerned with which individual defendant is found guilty or innocent (that is, “whether each separate hypothesis is true or false”) but tries instead to control the overall number of incorrect verdicts (that is, “in the long run of experience, we shall not often be wrong”). Controlling mistakes in the long run is a laudable goal, but just as our sense of justice demands that individual persons be correctly judged, scientific intuition says that we should try to draw the proper conclusions from individual studies.

The hypothesis test approach offered scientists a Faustian bargain—a seemingly automatic way to limit the number of mistaken conclusions in the long run, but only by abandoning the ability to measure evidence and assess truth from a single experiment. It is doubtful that hypothesis tests would have achieved their current degree of acceptance if something had not been added that let scientists mistakenly think they could avoid that trade-off. That something turned out to be Fisher’s “P value,” much to the dismay of Fisher, Neyman, Pearson, and many experts on statistical inference who followed.

Om fallgruven:
The idea that the P value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).

Hvorfor er det blitt slik?
It is a complex story, but the basic theme is that therapeutic reformers in academic medicine and in government, along with medical researchers and journal editors, found it enormously useful to have a quantitative methodology that ostensibly generated conclusions independent of the persons performing the experiment. It was believed that because the methods were “objective,” they necessarily produced reliable, “scientific” conclusions that could serve as the bases for therapeutic decisions and government policy. This method thus facilitated a subtle change in the balance of medical authority from those with knowledge of the biological basis of medicine toward those with knowledge of quantitative methods, or toward the quantitative results alone, as though the numbers somehow spoke for themselves.

Tilbake til eksemplet:
The statement that there was a relation between hydrocortisone treatment and improvement of the chronic fatigue syndrome was a knowledge claim, an inductive inference. To make such a claim, a bridge must be constructed between “P = 0.06” and “treatment was associated with improvement in symptoms.” That bridge consists of everything that the authors put into the latter part of their discussion: the magnitude of the change (small), the failure to change other end points, the absence of supporting studies, and the weak support for the proposed biological mechanism. Ideally, all of this other information should have been combined with the modest statistical evidence for the main end point to generate a conclusion about the likely presence or absence of a true hydrocortisone effect. The authors did recommend against the use of the treatment, primarily because the risk for adrenal suppression could outweigh the small beneficial effect, but the claim for the benefit of hydrocortisone remained.

Hva kan gjøres?
Some of the strongest arguments in support of standard statistical methods is that they are a great improvement over the chaos that preceded them and that they have proved enormously useful in practice. Both of these are true, in part because statisticians, armed with an understanding of the limitations of traditional methods, interpret quantitative results, especially P values, very differently from how most nonstatisticians do. But in a world where medical researchers have access to increasingly sophisticated statistical software, the statistical complexity of published research is increasing, and more clinical care is being driven by the empirical evidence base, a deeper understanding of statistics has become too important to leave only to statisticians.

The second article will explore the use of Bayes factor—the Bayesian measure of evidence— and show how this approach can change not only the numbers we report but, more important, how we think about them.

Toward Evidence-Based Medical Statistics. 2: The Bayes Factor


Introduction to Bayesian methods I: measuring the strength of evidence
(Steven N Goodman, Clinical Trials 2005; 2: 282-290)

Bayesian inference is a formal method to combine evidence external to a study, represented by a prior probability curve, with the evidence generated by the study, represented by a likelihood function. Because Bayes theorem provides a proper way to measure and to combine study evidence, Bayesian methods can be viewed as a calculus of evidence, not just belief. In this introduction, we explore the properties and consequences of using the Bayesian measure of evidence, the Bayes factor (in its simplest form, the likelihood ratio). The Bayes factor compares the relative support given to two hypotheses by the data, in contrast to the P-value, which is calculated with reference only to the null hypothesis. This comparative property of the Bayes factor, combined with the need to explicitly predefine the alternative hypothesis, produces a different assessment of the strength of evidence against the null hypothesis than does the P-value, and it gives Bayesian procedures attractive frequency properties. However, the most important contribution of Bayesian methods is the way in which they affect both who participates in a scientific dialogue, and what is discussed. With the emphasis moved from "error rates" to evidence, content experts have an opportunity for their input to be meaningfully incorporated, making it easier for regulatory decisions to be made correctly.