An interesting article that includes some discussion about statistical evidence on the FiveThirtyEight website under Science (August 19):

http://fivethirtyeight.com/features/science-isnt-broken/

Skip to content
# Measuring Statistical Evidence

## This site is concerned with discussions about how statistical evidence is to be measured. Of particular relevance is the book Measuring Statistical Evidence Using Relative Belief by M. Evans, Monographs on Statistics and Applied Probability 144, CRC Press, Taylor & Francis Group.

#
Month: August 2015

# Science Isn’t Broken by Christie Aschwanden

# Review by Christian Robert

An interesting article that includes some discussion about statistical evidence on the FiveThirtyEight website under Science (August 19):

http://fivethirtyeight.com/features/science-isnt-broken/

Advertisements

Christian Robert wrote a review of the book on his blog which can be read at the following address:

I thank him for that and also for the comments which “lean” towards the favourable in part although there are certainly criticisms too. There are a few places where I disagree with what he writes and so, in the spirit of a good discussion, I’ll respond to those here. You have to read his review for some of these comments to make sense. Christian’s comments are quoted.

- ” There is just the point made in the quote above that seems unclear, in that it implies an intrinsic belief in the model, which should be held with with the utmost suspicion! That is, the data is almost certainly unrelated with the postulated model since all models are wrong et cetera…”

I’m not sure why it is unclear because I wholeheartedly agree with his comment and that is certainly the position taken in the book. In fact, while much criticism is targeted towards Bayesian inference because of its reliance on priors, a more serious criticism would be its reliance on models as typically the choice of the model is much more important. The position taken in the book is that both the model and the prior should be checked against the data for their reasonableness. Chapter 5 prescribes checking the model first and, if this passes, then checking the prior.

- “Speaking of paradoxes, the Jeffreys-Lindley paradox is discussed in the next chapter and blamed on the Bayes factor and its lack of “calibration as a measure of evidence” (p.84). The book claims a resolution of the paradox on p.132 by showing confluence between the p-value and the relative belief ratio. This simply shows confluence with the p-value in my opinion.”

The Jeffreys-Lindley paradox is definitely not being blamed on the Bayes factor. In fact, it is explicitly stated that the Bayes factor is doing the right thing in this example. The problem lies, as others have noted, with the use of diffuse priors. The discussion is perhaps not clear enough in the text but here is how I see the resolution of the paradox.

(a) As you make the prior more diffuse the relative belief ratio/Bayes factor (they are the same in this example) converges to infinity and this is appropriate (the prior pushes all the mass to +- infinity so the hypothesized value looks more and more reasonable against the observed data compared to the values that have the bulk of the prior belief).

(b) But we can’t simply measure evidence by a Bayes factor (and I think this example shows this perfectly) as we need to calibrate the value to say whether or not strong or weak evidence has been obtained. So for example, you may obtain a Bayes factor of 200 in favor of a hypothesis. Is that strong or weak evidence in favor? A sensible calibration is given by computing the posterior probability discussed in the book called the strength (the posterior probability of the true value having a relative belief ratio no larger than the relative belief ratio of the hypothesized value). Note that this looks like a p-value but it is not as its value has a different interpretation depending on whether you have evidence for (BF>1) or evidence against (BF<1). The interesting thing is that as the prior becomes more diffuse the strength converges to the classical p-value here. So a small value of the p-value can simply indicate that you have weak evidence in favor. Given that many agree that a p-value does not measure evidence, this seems like a satisfying result and resolves an apparent conflict, although you will have to change your viewpoint if you think p-values are measures of evidence.

(c) As discussed in the book there is another step in the resolution. Given that we have a measure of evidence of a hypothesis being true, the a priori probability of getting evidence in favor of the hypothesis when it is false can be computed (called the bias in favor of the hypothesis). In the context of the Jeffreys-Lindley paradox this probability converges to 1 as the prior becomes more diffuse. So if we knew a priori that we are almost certain to get evidence in favor, no matter what the data is and when the hypothesis is false, would we conclude there is evidence in favor when BF>1 is observed? I don’t think so. This result demonstrates the folly of simply choosing diffuse priors as defaults thinking there is no cost. Of course, you may elicit a prior and still encounter bias (there is also bias against). The cure for this, as established in the book, is data as the biases vanish with increasing amounts of data. For me this resolves the Jeffreys-Lindley paradox since I can calibrate a Bayes factor, state the biases and cure the problem via design.

- “The estimator he advocates in association with this evidence is the maximum relative belief estimator, maximizing the relative belief ratio, another type of MAP then. With the same drawbacks as the MAP depends on the dominating measure and is not associated with a loss function in the continuous case.”

In general, a relative belief ratio is defined as a limit via a sequence of shrinking sets, provided this limit exists. Whenever there is a support measure which gives a positive and continuous prior density at the limiting point, this limit exists and will equal the ratio of the posterior density to the prior density (both taken with respect to the same support measure). In essence the relative belief ratio is independent of the support measure although, choosing the support measure carefully may make it easier to compute. By the same argument the relative belief ratio is constant under (smooth) reparameterizations. So the relative belief estimate doesn’t suffer from the lack of invariance problems or dependence on support measure as MAP does.

- “A major surprise for me when reading the book is that Evans ends up with a solution [for assessing the strength of the evidence] that is [acknowledged to be] very close (or even equivalent) to Murray Aitkin’s integrated likelihood approach!”

Murray Aitkin (at least in his book) proposes to use a p-value based on profile likelihood to assess a hypothesis. This is different than the approach being advocated in my book. First the basic measure of evidence in the book is the relative belief ratio. The strength is a calibration of this. So, for example, a small value of the strength is not to be interpreted as evidence against. When RB>1, then a small value of the strength means weak evidence in favor and when RB<1 this means strong evidence against. For Murray a small value of his p-value is evidence against.

As mentioned in the book, however, I do see that Dempster and Aitkin are on the right path to an “evidential” approach to statistics. The problem is the use of the p-value to measure evidence as this inherits all the usual criticisms of p-values. For me it is a real step forward to separate the relative belief ratio/Bayes factor as *the* measure of the evidence from its calibration via the strength which indeed looks like a p-value but, as noted, is quite different in its interpretation.

I don’t agree with the comments about “double use of the data” as a criticism of Aitkin’s proposal or of relative belief theory. The phrase itself is somewhat vague in its meaning although there are contexts where it seems clearer. For example, in model checking it doesn’t make sense to use just any aspect of the data to assess the model – which is what has lead to conservative p-values with posterior predictive checks. On the other hand using the data twice to estimate a quantity and to say something about its accuracy seems absolutely necessary for inference. It is in this latter sense that the relative belief ratio and its strength are being used. So far, I know of no bad mathematical characteristics (or at least none that can’t be corrected by remembering the role of continuous models as approximations) for the usage of the strength. In any case, the issues about “double use of the data” need to be clarified generally in statistics by more research.

- “The above remark is a very interesting point and one bound to appeal to critics of the mixture representation, like Andrew Gelman. However, given that the Dickey-Savage ratio is a representation of the Bayes factor in the point null case, I wonder how much of a difference this constitutes. Not mentioning the issue of defining a prior for the contingency of accepting the null. So in the end I do not see much of a difference.”

I think Christian is concerned here with how much of a difference there is between the relative belief approach and the traditional usage of Bayes factors. Strictly speaking, with a continuous prior, Bayes factors (as a ratio of posterior to prior odds) cannot be defined for sets having prior measure 0 even though we would definitely like to. The common approach to resolving this is to place a positive prior mass on the set and take the prior to be a mixture. One can’t argue with this when the mixture is *the* prior but generally it seems like a clumsy approach that results in some problems. A far better way is to remember (as is a theme in the book) that any time we are using continuous models we are approximating and then define the relevant concept via a limit. When you do this for the Bayes factor then, under very weak conditions, you get the relative belief ratio and I view this as a strong argument in favour of the relative belief ratio as the relevant measure of evidence. This, together with the calibration, is why I say in the Preface that relative belief theory is really about being careful about the definition and usage of the Bayes factor. It is also worth remarking that the theory of inference derived via the relative belief ratio (hypothesis assessment, estimation, prediction) is much simpler than trying to do this based on the Bayes factor. In fact, I would say that one of the reasons the Bayes factor has not been used as a basis for a theory of inference is because of the horrid (sorry but that is how I feel) definition via a mixture in the continuous case.

I agree with Christian, however, that the general spirit of relative belief theory is coming from the Bayes factor which I acknowledge as a central concept as a measure of evidence. The Bayes factor defined via the mixture will agree with the relative belief ratio when one is computing the Bayes factor of a specified value of the full model parameter. Otherwise, they are generally different and the relative belief ratio has much better mathematical properties. In the end, I think it is just simpler to measure change in belief by the ratio of the probabilities that measure the beliefs, rather than by a ratio of odds.

- “I was eagerly and obviously waiting for the model choice chapter, but it somewhat failed to materialise!”

We may be using different terminology, I’m not sure. By “model choice” I mean the process whereby a statistician writes down a sampling model to be used in the analysis and there is some (limited) discussion of how one goes about this in Chapter 5. I suspect Christian’s meaning for “model choice” is the process whereby one has a set of possible models M_1, M_2, … and then based on the data we pick one of these. For me this is an inference problem and, as such, it is covered by Chapter 4. In effect the statistician has chosen a model, namely, the union of M_1, M_2, … and is making inference about the index i. There is still the question of whether or not the union is appropriate and this is addressed by model checking as discussed in Chapter 5.

- “I also have a general difficulty with using ancillaries and sufficient statistics because, even when they are non-trivial and well-identified, they remain a characteristic of the model: using those to check the model thus sounds fraught with danger.”

Almost any check is going to depend on the model somehow it seems. For example, posterior predictives also depend on the model. In the book I’ve tried to be careful to distinguish inference (Chapter 4) from the checking of ingredients phase (Chapter 5). Not only are they quite different problems but it seems possible to come up with a pretty clear theory for the inference part. The checking aspect is more arbitrary and it is harder to come up with general principles. I certainly acknowledge that more can (and should) be said about the checking part but this is characteristic of all approaches to statistics. For example, there is some material in the book concerned with what one does when a model or prior fails but more needs to be said. Clearly, this is an issue of considerable practical importance.

- “I somehow find the approach lacking in several foundational and methodological aspects, maybe the most strident one being that the approach is burdened with a number of arbitrary choices, lacking the unitarian feeling associated with a regular Bayesian decisional approach.”

As in point 7, we have to distinguish between the proposals for inference (Chapter 4) and for choosing and checking the ingredients (Chapter 5). I will grant the criticisms about “arbitrary choices” for the material of Chapter 5 but I don’t see that for the inference material. Basically there are three ingredients (1) sampling model (2) (proper) prior (3) data and three principles of inference (I) principle of conditional probability to update beliefs (II) principle of evidence – beliefs increase means evidence for, beliefs decrease means evidence against (III) order the evidence among alternatives using the relative belief ratio.

So where is the arbitrariness? Certainly (1) and (2) could be considered as such as typically there is nothing that truly dictates the choices made, one just tries to use good and, hopefully somewhat organized, judgment through an elicitation process. Even after the choices have been made there is the checking phase which can at least gives us some comfort that inappropriate choices have not been made. Of course, (3) is never arbitrary if it is collected properly. Is (I) arbitrary? I suppose it might be but then, what is a better rule for updating beliefs? I thing (I) is pretty solid and it would take a lot of persuasion for me to abandon it. (II) doesn’t get discussed in the statistical literature but it seems almost obvious, at least I can’t think of any reason to doubt it. There are a number of different valid ways to measure evidence via change in belief, and the relative belief ratio is just one (there is discussion of other choices in the book). But from many points of view it seems to be the simplest with the nicest properties. So (III), like (I) can be considered arbitrary but that seems to be too fussy for me unless one can come up with a better way to measure evidence. With (1), (2) and (3) combined with (I), (II) and (III) the inferences follow necessarily without arbitrary choices.

- “I also wonder at the scaling features of the method, namely how it can cope with high dimensional or otherwise complex models, without going all the way to ask for an ABC version!”

It would be nice if the relative belief approach could contribute something meaningful to the large scale inference context. This is a current research project. As mentioned in the text, the concern is with a gold standard for inference. That complicated problems might require compromises seems reasonable as long as the compromise can be viewed as an approximation to what the gold standard demands.

- One further point. I have no argument with decision theory except when told that is *the* way to approach the subject of statistics. I don’t think there is a logical argument to support this assertion. Axiom systems such as Savage’s or coherency arguments like de Finetti’s certainly support a Bayesian approach, but there is no reason to take these as definitive. The role of statistics is to provide a sound reasoning process that informs us as clearly as possible about what the evidence, as expressed via the data, is saying about questions of interest. A decision may well contradict what the evidence says and why not if we are primarily concerned with maximizing utility. In a scientific context, however, I want to know what the evidence says without it being passed through the distorting lens of a (uncheckable) utility function. We may indeed end up contradicting the evidence for a wide variety of reasons. There is nothing wrong with that *provided* it is stated that we are contradicting the evidence and a justification provided.