Wrong understanding of statistics leads to wrong science.  The p-value is dead.  Long live the e-value!

Wrong understanding of statistics leads to wrong science. The p-value is dead. Long live the e-value!

Rianne de Heide is a statistician at the Vrije Universiteit Amsterdam. While she explains her research, she has to suppress the urge several times to draw on the blackboard, which is missing in the room at the VU in Amsterdam. She wants to show mathematical definitions and graphs. Because this is necessary to really understand the p-value. The p-value is the standard that applies in science to demonstrate a connection. “The problem is that it appears difficult for researchers to understand what a p-value actually is.”

P-values ​​are widely used. Especially in medicine, psychology and economics. A p-value indicates the chance that the results obtained by scientists in a study give an incorrect picture of reality. That the data found is very coincidental, an exception. If the probability is less than 0.05, the results are assumed to be correct. For example, to demonstrate that a medicine works, a p-value of less than 0.05 has been established as the official standard used by the American Food and Drug Administration (FDA) and the European Medicines Agency (EMA).

It proves difficult for doctors, psychologists and anyone else who wants to use the p-value to understand how exactly the p-value works. Mistakes are sometimes made. De Heide has therefore worked with other mathematicians on a replacement for the p-value: the ‘e-value’.

In January she presented the research she has been working on – with Peter Grünwald and Wouter Koolen – since 2016 at the Royal Statistical Society in London, an important organization for statistics. “It has been clear for years that that p-value does not actually work well. It is a great honor that I can present my work here.”

It now often happens that when research is done again, different results emerge

Why is it so important to replace the p-value?

“In both medical and social science, researchers are talking about the replication crisis. It now often happens that when research is done again, different results emerge. For example, one study may find a positive effect of a drug and another may not find a positive effect at all.

“It turns out that a lot of research is simply wrong. A famous article about this problem in medical science is also called: ‘Why Most Published Research Findings are False’. And the same is said about social science. The use of the p-value is one of the causes of this problem.”

What goes wrong with the p-value?

“There are all kinds of pitfalls to using a p-value as a way to test a hypothesis. The investigation must therefore proceed according to strict rules. Scientists do not always adhere to this, because they do not understand exactly how the p-value works.

“Questionnaires have been sent to doctors and psychologists, among others, which show that many people actually do not know what you calculate with the p-value. And you have to remember: doctors read articles about their field every week. They are full of statements about p-values. Yet less than half of the doctors gave the correct answer to the question of what the p-value means. Even math teachers often don’t know the right answer.”

Something that researchers often do, but which is actually not allowed, is adding extra data afterwards

So what are scientists doing wrong when it comes to statistics?

“Something that researchers often do, but which is actually not allowed, is that they add extra data afterwards. Suppose researchers investigate whether a drug can lower blood pressure and they investigate this in a group of thirty test subjects. It may be that the blood pressure does drop in many subjects, but it is not enough to get a p-value that is less than 0.05. Researchers often think: let’s add some more test subjects to make the result statistically significant.”

“This is called ‘optional stopping’. In principle, it is a logical intuition that you want to increase the amount of data. But with the p-value this is not allowed this way. It can be proven mathematically that the chance of a false positive becomes very high. So after adding test subjects you find a p-value below 0.05 and conclude that there is an effect, but in fact this effect is not there at all. In some cases the chance is even 100 percent.”

That sounds crazy. If you add test subjects are you sure you will get incorrect results?

“Yes, in some cases. If you do everything by the book, the chance of a false positive is only 5 percent, because the p-value is 0.05. But if you do optional stopping and you add a few more people after viewing one group, this chance increases. Often researchers do not mention that they have done this, or are not even aware that it is not allowed.

“Sometimes scientists consciously want to do optional stopping. For example, you conduct research per subject and stop if you see no effect. That is less expensive and often more ethical. For example, if you want to investigate whether a vaccine works. If you were to use the p-value, the chance of a false positive would really be 100 percent.”

A useful feature is that you can also combine e-values

Does this problem not exist with the new e-value that you propose?

“No, with the e-value you can simply do optional stopping. It has also already been used for research into the effectiveness of a vaccine. We also think that the e-value is generally easier to understand than the p-value and will therefore lead to fewer problems.”

How does this e-value work?

“The e-value indicates how much evidence there is against the idea that the effect you want to demonstrate does not exist. For example, if you are researching the drug that should lower blood pressure, the e-value indicates how much evidence there is against the idea that the drug does not lower blood pressure.

“The e-value is a positive number that can in principle become infinitely large. The larger the e-value, the more evidence against the idea that there is no effect. Just like with the p-value, you can determine a lower limit. If the e-value is greater than 20, you can speak of statistical significance, and you have reason to believe that in this example the drug lowers blood pressure.

“A useful feature is that you can also combine e-values. This allows you to indicate how two studies strengthen the evidence for a hypothesis. Simply by multiplying the e-values. If one research group finds an e-value of 5 and the other finds a value of 10, then together they can say they have a value of 50. This is not possible with the p-value.”

Correction (February 16, 2024): In an earlier version of this article, the e-value was not properly explained, which has been corrected above. That has been adjusted above.