There is an interesting article in Quanta `Is Infinity Real?’ https://www.quantamagazine.org/20160630-infinity-puzzle-solution/ and this references an article in Discover `Infinity Is a Beautiful Concept – And It’s Ruining Physics’. I like both articles because they support my point about theories of statistical inference that anytime infinity is used we must be sure that we are in effect approximating something finite. This gets rid of a lot of paradoxical behavior. If you have a counterexample and it only holds when something is infinite, then you really don’t have a counterexample. A simple example of this occurs with the MLE. When the sample space and parameter space are finite, the MLE is consistent but of course there are many counterexamples where the MLE is not consistent and these depend intrinsically on infinity. So is the problem with the MLE or with the models? Certainly the mathematics is often much nicer when use is made of infinity, and that is fine by me, just don’t believe that models containing infinities represent the truth.

# Integrated Likelihood and Relative Belief and the Prosecutor’s Fallacy

Several times I’ve encountered the comment that inferences based on relative belief are the same as inferences based on integrated likelihoods. This is in a formal sense correct but it misses an important point. The purpose of this post is to make this clear. For what follows take all probability measures as discrete. There is no change in the argument for the continuous case except one has to add some irrelevant Jacobians. Also all priors are proper.

Suppose interest is in inference about a parameter psi=PSI(theta) where theta is the model parameter. Anything that locates the value of theta in the inverse image PSI^{-1}{psi} is a nuisance parameter and the standard, and appropriate, Bayesian approach is to integrate these out. Let f(x|theta) be be the density of the data x when theta is true and L(theta|x) denote the likelihood. Note that L(theta|x)=cf(x|theta) for some c>0. So a likelihood is defined only up to a positive constant multiple and any function of theta in this equivalence class serves as a likelihood function.

Let pi(theta) be the prior on theta and pi(theta|psi) be the conditional prior of theta given psi. So

pi(theta|psi) = pi(theta)/(sum pi(theta) over theta in PSI^{-1}{psi}) = pi(theta)/pi_PSI(psi)

is the conditional prior density of theta given psi and pi_PSI(psi) is the marginal prior of psi. The integrated likelihood of psi is given by

sum L(theta|x)pi(theta|psi) over theta in PSI^{-1}{psi}

= c times sum f(x|theta)pi(theta|psi) over theta in PSI^{-1}{psi}.

The relative belief ratio of psi is given by

RB(psi|x) = pi_PSI(psi|x)/ pi_PSI(psi)

= (sum pi(theta|x) over theta in PSI^{-1}{psi})/ pi_PSI(psi)

where pi(theta|x) is the posterior density of theta and pi_PSI(psi|x) is the posterior density of psi. Now

pi(theta|x) = f(x|theta)pi(theta)/m(x) = L(theta|x)pi(theta)/cm(x)

where

m(x) = sum f(x|theta)pi(theta) over all theta

is the prior predictive density of the data x. It is immediate then that

RB(psi|x) = (integrated likelihood)/cm(x).

So indeed a relative belief ratio is an integrated likelihood. Since relative belief inferences for psi are determined by the ordering induced by the relative belief ratios, this implies that these inferences are the same as those induced by the integrated likelihood ordering.

So what is the difference? The difference lies in the interpretation of these quantities and this is significant. For note that RB(psi|x) is measuring change in belief from a priori to a posteriori. By the basic principle of evidence (this it isn’t a theorem but an axiom), if the data have caused the probability to go up, then there is evidence in favor and evidence against if the probability has gone down. So RB(psi|x) > 1 means there is evidence that psi is the true value and RB(psi|x) < 1 means there is evidence that psi is not the true value while RB(psi|x) = 1 means no evidence either way. Contrast this with the value that an integrated likelihood takes. Actually, the specific value is meaningless as c>0 is arbitrary. Any likelihood can at most determine relative evidence between values.

Does this difference matter? Yes, in many ways, but as an illustration consider a well-known example, called the prosecutor’s fallacy, where the role of the relative belief ratio as a measure of evidence clarifies some issues. This is discussed in the book in Examples 4.5.4, 4.6.3 and 4.7.4 where the relevant numerical computations are provided. According to this example the prosecutor has noted that a defendant shares a trait with the perpetrator of a crime and, since the trait is rare in the population, concludes (even calculating an erroneous probability of guilt) that this is overwhelming evidence of guilt. A statistician calculating the appropriate posterior probability of guilt finds that this is very small and concludes that this is evidence of innocence. Both are wrong! It defies common sense to suppose the fact that the defendant and the perpetrator share the same trait is not evidence in favor of guilt and indeed the relative belief ratio for guilt is greater than 1. The real question is whether or not this is strong or weak evidence of guilt and to determine this it is necessary to calibrate the value of the relative belief ratio. The calibration issue for relative belief ratios is discussed in the book and, when the proposal for calibration is followed in this example, it is determined that the evidence for guilt is weak (which doesn’t mean the evidence for innocence is strong as there is evidence for guilt and against innocence). So it is (hopefully) unlikely that we would convict based on only weak evidence for guilt.

But now change the circumstances of the problem but with exactly the same numbers. In this case the question is whether or not a person is a carrier of a deadly infectious disease when the data tell us that the individual has been in an area where the disease is rampant. So there is evidence of the person being a carrier and again it is weak, but should the person be quarantined or not? The answer to this question, as with the legal case, has nothing to do with statistics. The role of statistics is to provide a measure of the evidence and its strength. Once this is done other factors involving ethics, risks, etc. determine the outcome.

None of this follows from the integrated likelihood as it is not a measure of evidence. A central thesis of the book is that any theory of statistics has to be built on a measure of evidence as that is the main issue in applications of statistics.

There are a number of other benefits that arise from providing a measure of evidence. The book discusses many of these but there is one that is particularly notable. Given a measure of evidence you can measure a priori the bias in the prior, namely, you can calculate the prior probability, based on a particular amount of data, of obtaining evidence for or evidence against a particular hypothesis. For example, if you are told that the data provides evidence in favor (against) of a hypothesis would you believe this is relevant if you also were told that the prior probability of obtaining evidence in favor (against) is very large? In other words, the prior may be such that there is a foregone conclusion and that is what bias is measuring. Using integrated likelihood does not provide a means to answer such a question because it is not a measure of evidence while the relative belief ratio is a measure of evidence and so leads directly to a measure of bias.

So I do not agree that relative belief is just using integrated likelihood and too much that is relevant is lost by thinking of relative belief in this way. In fact, there is no need to even mention the concept of likelihood from the perspective of relative belief.

# Article in Times Higher Education on Reproducibility and Evidence

Aricle by Paul Jump, September 3 2015

https://www.timeshighereducation.co.uk/features/reproducing-results-how-big-is-the-problem

# Science Isn’t Broken by Christie Aschwanden

An interesting article that includes some discussion about statistical evidence on the FiveThirtyEight website under Science (August 19):

# Review by Christian Robert

Christian Robert wrote a review of the book on his blog which can be read at the following address:

I thank him for that and also for the comments which “lean” towards the favourable in part although there are certainly criticisms too. There are a few places where I disagree with what he writes and so, in the spirit of a good discussion, I’ll respond to those here. You have to read his review for some of these comments to make sense. Christian’s comments are quoted.

- ” There is just the point made in the quote above that seems unclear, in that it implies an intrinsic belief in the model, which should be held with with the utmost suspicion! That is, the data is almost certainly unrelated with the postulated model since all models are wrong et cetera…”

I’m not sure why it is unclear because I wholeheartedly agree with his comment and that is certainly the position taken in the book. In fact, while much criticism is targeted towards Bayesian inference because of its reliance on priors, a more serious criticism would be its reliance on models as typically the choice of the model is much more important. The position taken in the book is that both the model and the prior should be checked against the data for their reasonableness. Chapter 5 prescribes checking the model first and, if this passes, then checking the prior.

- “Speaking of paradoxes, the Jeffreys-Lindley paradox is discussed in the next chapter and blamed on the Bayes factor and its lack of “calibration as a measure of evidence” (p.84). The book claims a resolution of the paradox on p.132 by showing confluence between the p-value and the relative belief ratio. This simply shows confluence with the p-value in my opinion.”

The Jeffreys-Lindley paradox is definitely not being blamed on the Bayes factor. In fact, it is explicitly stated that the Bayes factor is doing the right thing in this example. The problem lies, as others have noted, with the use of diffuse priors. The discussion is perhaps not clear enough in the text but here is how I see the resolution of the paradox.

(a) As you make the prior more diffuse the relative belief ratio/Bayes factor (they are the same in this example) converges to infinity and this is appropriate (the prior pushes all the mass to +- infinity so the hypothesized value looks more and more reasonable against the observed data compared to the values that have the bulk of the prior belief).

(b) But we can’t simply measure evidence by a Bayes factor (and I think this example shows this perfectly) as we need to calibrate the value to say whether or not strong or weak evidence has been obtained. So for example, you may obtain a Bayes factor of 200 in favor of a hypothesis. Is that strong or weak evidence in favor? A sensible calibration is given by computing the posterior probability discussed in the book called the strength (the posterior probability of the true value having a relative belief ratio no larger than the relative belief ratio of the hypothesized value). Note that this looks like a p-value but it is not as its value has a different interpretation depending on whether you have evidence for (BF>1) or evidence against (BF<1). The interesting thing is that as the prior becomes more diffuse the strength converges to the classical p-value here. So a small value of the p-value can simply indicate that you have weak evidence in favor. Given that many agree that a p-value does not measure evidence, this seems like a satisfying result and resolves an apparent conflict, although you will have to change your viewpoint if you think p-values are measures of evidence.

(c) As discussed in the book there is another step in the resolution. Given that we have a measure of evidence of a hypothesis being true, the a priori probability of getting evidence in favor of the hypothesis when it is false can be computed (called the bias in favor of the hypothesis). In the context of the Jeffreys-Lindley paradox this probability converges to 1 as the prior becomes more diffuse. So if we knew a priori that we are almost certain to get evidence in favor, no matter what the data is and when the hypothesis is false, would we conclude there is evidence in favor when BF>1 is observed? I don’t think so. This result demonstrates the folly of simply choosing diffuse priors as defaults thinking there is no cost. Of course, you may elicit a prior and still encounter bias (there is also bias against). The cure for this, as established in the book, is data as the biases vanish with increasing amounts of data. For me this resolves the Jeffreys-Lindley paradox since I can calibrate a Bayes factor, state the biases and cure the problem via design.

- “The estimator he advocates in association with this evidence is the maximum relative belief estimator, maximizing the relative belief ratio, another type of MAP then. With the same drawbacks as the MAP depends on the dominating measure and is not associated with a loss function in the continuous case.”

In general, a relative belief ratio is defined as a limit via a sequence of shrinking sets, provided this limit exists. Whenever there is a support measure which gives a positive and continuous prior density at the limiting point, this limit exists and will equal the ratio of the posterior density to the prior density (both taken with respect to the same support measure). In essence the relative belief ratio is independent of the support measure although, choosing the support measure carefully may make it easier to compute. By the same argument the relative belief ratio is constant under (smooth) reparameterizations. So the relative belief estimate doesn’t suffer from the lack of invariance problems or dependence on support measure as MAP does.

- “A major surprise for me when reading the book is that Evans ends up with a solution [for assessing the strength of the evidence] that is [acknowledged to be] very close (or even equivalent) to Murray Aitkin’s integrated likelihood approach!”

Murray Aitkin (at least in his book) proposes to use a p-value based on profile likelihood to assess a hypothesis. This is different than the approach being advocated in my book. First the basic measure of evidence in the book is the relative belief ratio. The strength is a calibration of this. So, for example, a small value of the strength is not to be interpreted as evidence against. When RB>1, then a small value of the strength means weak evidence in favor and when RB<1 this means strong evidence against. For Murray a small value of his p-value is evidence against.

As mentioned in the book, however, I do see that Dempster and Aitkin are on the right path to an “evidential” approach to statistics. The problem is the use of the p-value to measure evidence as this inherits all the usual criticisms of p-values. For me it is a real step forward to separate the relative belief ratio/Bayes factor as *the* measure of the evidence from its calibration via the strength which indeed looks like a p-value but, as noted, is quite different in its interpretation.

I don’t agree with the comments about “double use of the data” as a criticism of Aitkin’s proposal or of relative belief theory. The phrase itself is somewhat vague in its meaning although there are contexts where it seems clearer. For example, in model checking it doesn’t make sense to use just any aspect of the data to assess the model – which is what has lead to conservative p-values with posterior predictive checks. On the other hand using the data twice to estimate a quantity and to say something about its accuracy seems absolutely necessary for inference. It is in this latter sense that the relative belief ratio and its strength are being used. So far, I know of no bad mathematical characteristics (or at least none that can’t be corrected by remembering the role of continuous models as approximations) for the usage of the strength. In any case, the issues about “double use of the data” need to be clarified generally in statistics by more research.

- “The above remark is a very interesting point and one bound to appeal to critics of the mixture representation, like Andrew Gelman. However, given that the Dickey-Savage ratio is a representation of the Bayes factor in the point null case, I wonder how much of a difference this constitutes. Not mentioning the issue of defining a prior for the contingency of accepting the null. So in the end I do not see much of a difference.”

I think Christian is concerned here with how much of a difference there is between the relative belief approach and the traditional usage of Bayes factors. Strictly speaking, with a continuous prior, Bayes factors (as a ratio of posterior to prior odds) cannot be defined for sets having prior measure 0 even though we would definitely like to. The common approach to resolving this is to place a positive prior mass on the set and take the prior to be a mixture. One can’t argue with this when the mixture is *the* prior but generally it seems like a clumsy approach that results in some problems. A far better way is to remember (as is a theme in the book) that any time we are using continuous models we are approximating and then define the relevant concept via a limit. When you do this for the Bayes factor then, under very weak conditions, you get the relative belief ratio and I view this as a strong argument in favour of the relative belief ratio as the relevant measure of evidence. This, together with the calibration, is why I say in the Preface that relative belief theory is really about being careful about the definition and usage of the Bayes factor. It is also worth remarking that the theory of inference derived via the relative belief ratio (hypothesis assessment, estimation, prediction) is much simpler than trying to do this based on the Bayes factor. In fact, I would say that one of the reasons the Bayes factor has not been used as a basis for a theory of inference is because of the horrid (sorry but that is how I feel) definition via a mixture in the continuous case.

I agree with Christian, however, that the general spirit of relative belief theory is coming from the Bayes factor which I acknowledge as a central concept as a measure of evidence. The Bayes factor defined via the mixture will agree with the relative belief ratio when one is computing the Bayes factor of a specified value of the full model parameter. Otherwise, they are generally different and the relative belief ratio has much better mathematical properties. In the end, I think it is just simpler to measure change in belief by the ratio of the probabilities that measure the beliefs, rather than by a ratio of odds.

- “I was eagerly and obviously waiting for the model choice chapter, but it somewhat failed to materialise!”

We may be using different terminology, I’m not sure. By “model choice” I mean the process whereby a statistician writes down a sampling model to be used in the analysis and there is some (limited) discussion of how one goes about this in Chapter 5. I suspect Christian’s meaning for “model choice” is the process whereby one has a set of possible models M_1, M_2, … and then based on the data we pick one of these. For me this is an inference problem and, as such, it is covered by Chapter 4. In effect the statistician has chosen a model, namely, the union of M_1, M_2, … and is making inference about the index i. There is still the question of whether or not the union is appropriate and this is addressed by model checking as discussed in Chapter 5.

- “I also have a general difficulty with using ancillaries and sufficient statistics because, even when they are non-trivial and well-identified, they remain a characteristic of the model: using those to check the model thus sounds fraught with danger.”

Almost any check is going to depend on the model somehow it seems. For example, posterior predictives also depend on the model. In the book I’ve tried to be careful to distinguish inference (Chapter 4) from the checking of ingredients phase (Chapter 5). Not only are they quite different problems but it seems possible to come up with a pretty clear theory for the inference part. The checking aspect is more arbitrary and it is harder to come up with general principles. I certainly acknowledge that more can (and should) be said about the checking part but this is characteristic of all approaches to statistics. For example, there is some material in the book concerned with what one does when a model or prior fails but more needs to be said. Clearly, this is an issue of considerable practical importance.

- “I somehow find the approach lacking in several foundational and methodological aspects, maybe the most strident one being that the approach is burdened with a number of arbitrary choices, lacking the unitarian feeling associated with a regular Bayesian decisional approach.”

As in point 7, we have to distinguish between the proposals for inference (Chapter 4) and for choosing and checking the ingredients (Chapter 5). I will grant the criticisms about “arbitrary choices” for the material of Chapter 5 but I don’t see that for the inference material. Basically there are three ingredients (1) sampling model (2) (proper) prior (3) data and three principles of inference (I) principle of conditional probability to update beliefs (II) principle of evidence – beliefs increase means evidence for, beliefs decrease means evidence against (III) order the evidence among alternatives using the relative belief ratio.

So where is the arbitrariness? Certainly (1) and (2) could be considered as such as typically there is nothing that truly dictates the choices made, one just tries to use good and, hopefully somewhat organized, judgment through an elicitation process. Even after the choices have been made there is the checking phase which can at least gives us some comfort that inappropriate choices have not been made. Of course, (3) is never arbitrary if it is collected properly. Is (I) arbitrary? I suppose it might be but then, what is a better rule for updating beliefs? I thing (I) is pretty solid and it would take a lot of persuasion for me to abandon it. (II) doesn’t get discussed in the statistical literature but it seems almost obvious, at least I can’t think of any reason to doubt it. There are a number of different valid ways to measure evidence via change in belief, and the relative belief ratio is just one (there is discussion of other choices in the book). But from many points of view it seems to be the simplest with the nicest properties. So (III), like (I) can be considered arbitrary but that seems to be too fussy for me unless one can come up with a better way to measure evidence. With (1), (2) and (3) combined with (I), (II) and (III) the inferences follow necessarily without arbitrary choices.

- “I also wonder at the scaling features of the method, namely how it can cope with high dimensional or otherwise complex models, without going all the way to ask for an ABC version!”

It would be nice if the relative belief approach could contribute something meaningful to the large scale inference context. This is a current research project. As mentioned in the text, the concern is with a gold standard for inference. That complicated problems might require compromises seems reasonable as long as the compromise can be viewed as an approximation to what the gold standard demands.

- One further point. I have no argument with decision theory except when told that is *the* way to approach the subject of statistics. I don’t think there is a logical argument to support this assertion. Axiom systems such as Savage’s or coherency arguments like de Finetti’s certainly support a Bayesian approach, but there is no reason to take these as definitive. The role of statistics is to provide a sound reasoning process that informs us as clearly as possible about what the evidence, as expressed via the data, is saying about questions of interest. A decision may well contradict what the evidence says and why not if we are primarily concerned with maximizing utility. In a scientific context, however, I want to know what the evidence says without it being passed through the distorting lens of a (uncheckable) utility function. We may indeed end up contradicting the evidence for a wide variety of reasons. There is nothing wrong with that *provided* it is stated that we are contradicting the evidence and a justification provided.

# Measuring Statistical Evidence

The role of “statistical evidence” in statistics is central and yet most approaches to developing a theory of statistics are somewhat ambiguous about how this is to be measured. At the very least this creates ambiguity and at its worst it leaves one with the impression that statisics is completely lacking in any logical foundations. If the worst case applies, why would one have any confidence in the inferences drawn from a statistical analysis of real data?

There are of course several attempts to deal with statistical evidence in the statistical literature. Perhaps the most prominent is the commonly used p-value. But this suffers from numerous well-documented difficulties as a measure of evidence. In fact, it is fair to say that the p-value is really not a valid measure of evidence. Pure likelihood theory comes closer to dealing adequately with the concept but there are basic gaps that need to be filled in and this seems unlikely to be possible without the addition of a prior to a problem. With the addition of a prior we do have a valid measure of statistical evidence, namely, the Bayes factor. But even here there are issues that need to be addressed. First there is the issue of the definition of the Bayes factor as this can be approached in several ways with some leading to better results than others. Second, and perhaps most important, there is the issue of calibration as in when is a Bayes factor reflecting strong evidence for or strong evidence against, etc.

The book Measuring Statistical Evidence Using Relative Belief (https://www.crcpress.com/Measuring-Statistical-Evidence-Using-Relative-Belief/Evans/9781482242799) discusses these issues. Furthermore, a measure of statistical evidence together with a calibration of this measure is presented and it is shown how a theory of statistical inference (estimation, prediction, hypothesis assessment, etc.) is determined by this. The basic measure of evidence is called the relative belief ratio and it is closely related to the Bayes factor. The approach produces inferences with many optimal properties as is demonstrated in the text.

Of course, this requires the prescription of a (proper) prior and many object to this addition because of its subjectivity. The issues surrounding objectivity and subjectivity are discussed in the text. The following quote somewhat summarizes the point of view taken.

“No matter how the ingredients are chosen they may be wrong in the sense that they are unreasonable in light of the data obtained. So, as part of any statistical analysis, it is necessary to check the ingredients chosen against the data. If the ingredients are determined to be flawed, then any inferences based on these choices are undermined with respect to their validity. Also, checking the model and the prior against the data, is part of how statistics can deal with the inherent subjectivity in any statistical analysis. There should be an explicit recognition that subjectivity is always part of a statistical analysis. A positive consequence of accepting this is that it is now possible to address an important problem, namely, the necessity of assessing and controlling the effects of subjectivity. This seems like a more appropriate role for statistics in science as opposed to arguing for a mythical objectivity or for the virtues of subjectivity based on some kind of coherency.”

So objectivity is indeed the (unattainable) goal in any scientific work and the necessity of subjectivity is dealt with, as much as is possible through statistical tools. For example, with a measure of evidence we can assess the extent to which a prior induces bias into a statistical analysis.

Comments on these or any other issues associated with measuring evidence are welcome.