New study shows how not to use statistics

In the latest issue of Journal of Organic Production Systems, a long-term toxicology study by the Australian scientist Judy Carman purports to show that genetically modified (GM) feed grains cause severe stomach inflammation in pigs. With help from outlets like Reuters Health, the study’s claims have been disseminated widely and have become fodder for anti-GM activists looking to scare the broader public.

Yet the study has also encountered criticism from the scientific community. At the Biofortified Blog, Anastasia Bodnar argues that study was invalid because the researchers failed to make sure that the GM and non-GM feed had similar nutrient composition. Swine health management specialist Robert Friendship has also said that the authors were incorrect to use redness as a measure of inflammation.

A number of critics found have taken issue with the paper’s use of statistics. Several people have charged the authors with “statistical fishing,” and I plan to write about that charge in the very near future. For now, I want to focus on another criticism which has arisen: that the claimed relationship between inflammation and GM-feed was based on an ill-chosen statistical test. ##Why do we need statistical tests anyway? Here it is worth taking a step back and thinking about why we do science. When scientists conduct a study on 145 pigs, chances are they aren’t just interested in those 145 pigs. The goal of an experiment is to help us make predictions – perhaps about pigs in general. To this end, an experiment should be designed to test a hypothesis.

In the Carman study, one hypothesis considered is that a diet of GM corn and soy increases the incidence of severe inflammation in pig stomachs. It should come as no surprise that testing this hypothesis involves feeding GM corn and soy to pigs. However, that alone isn’t enough. If we just leave a few pigs to eat GM corn and soy, and they develop stomach inflammation, we won’t necessarily know that the GM feed was the cause of that inflammation because we won’t know whether that inflammation would have developed had the pigs eaten non-GM feed. Dealing with this possibility requires two groups of pigs, one of which is fed GM feed and the other – called the control – to be fed non-GM feed. Aside from the GM or non-GM status of their feed, the two groups of pigs should be as similar as possible so that if the incidence of inflammation in the GM-fed pigs is higher than in the non-GM-fed pigs, we won’t be left wondering whether the difference is due to some other factor.

Looking at the study’s data, we do see a higher incidence of severe stomach inflammation among the GM-fed group than the control (non-GM-fed) group. However, this is still not enough to conclude that GM feed increases incidence of severe stomach inflammation. To see why, it might help to consider a simple example. Suppose that I have a coin, and I want to know whether my coin is “fair,” that is, whether the chance of it coming up “heads” when flipped is exactly one-half. If I flip my coin twice and it come up heads both times, I wouldn’t conclude that the coin was unfair and that it would come up heads every time I flipped it. Even the fairest coin will not come up heads exactly half the time in every trial! In fact, even if I flip a fair coin 100 times, it is very likely¹ that it won’t come up heads exactly 50 times. So I should not assume that my coin is unfair just because I get a few more heads than tails. Some noise in experimental data is to be expected, and making sound inferences requires filtering that noise out.

Returning to the Carman study, 9 of 73 pigs (12%) in the control group developed severe stomach inflammation, compared to 23 of 72 (32%) in the GM-fed group. That might seem like a big difference, but it could be that the difference is due to chance and small sample size, rather than a causal relation between GM feed and stomach inflammation. To account for that possibility, scientists perform a test of statistical significance on the experimental data.

Testing for significance

A test for statistical significance starts from the assumption that the factors under consideration are not related. This assumption is called the null hypothesis. The statistical test will tell us whether we should reject the null hypothesis and conclude that there is evidence of a relationship between the factors. In the case of the Carman study, the null hypothesis is that there’s no relationship between severe stomach inflammation and GM feed.

Different tests work in different ways, but it is common to summarize the results with a number called a p-value. The p-value is an approximation of the probability that, under the assumption of the null hypothesis, we’d get data which are at least as “surprising” (i.e. as far from what the null hypothesis predicts) as the observed data². In the case of the Carman study, a small p-value would indicate that the experimental data would be very surprising if there were no relationship between severe inflammation and GM feed. By convention, a relationship is considered statistically significant if the p-value is less than 0.05.

But there’s a catch. To arrive at a p-value, a test needs to make some assumptions beyond the null hypothesis, and different tests make different assumptions. Generally, a tests will at least assume that each observation comes from the same distribution (each GM-fed pig has the same probability of developing any given inflammation level, and likewise for the non-GM-fed pigs), and that the observations are independent of each other (one pig’s chance of having a certain inflammation level doesn’t depend on any other pig’s inflammation level). Many tests will require further assumptions about the distribution underlying individual data points (that is, how likely any pig is to end up at each level of inflammation). Part of choosing a good test is choosing a test whose assumptions are likely to be satisfied by the data under consideration.

Stat fight

Different tests are appropriate for different data, and this is where the controversy has arisen. The inflammation data that Carman and her co-authors collected looked like this³:

	Non-GM-fed	GM-fed
Nil inflammation	4	8
Mild inflammation	31	23
Moderate inflammation	29	18
Severe inflammation	9	23

Actual stomach inflammation data (Table 3)

To test the severe inflammation data for statistical significance, Carman et al first condensed the inflammation data into two categories, “Severe inflammation” and “Not severe inflammation” to get a table that looked like this:

	Non-GM-fed	GM-fed
Not severe inflammation	64	49
Severe inflammation	9	23

Actual data, with categories combined

Next, they performed a test called the chi-squared test⁴ on this table.

Andrew Kniss, a University of Wyoming weed scientist, wrote on the Weed Control Freaks blog that he believed that the researchers had not chosen the correct test for the kind of data they were analyzing. His objection was that the choice of test did not account for the fact that the four inflammation categories are ordered. That is, “nil inflammation” is the lowest level of inflammation, “mild inflammation” is second lowest, “moderate inflammation” is third lowest, and “severe inflammation” is the highest level of inflammation. That means the data is very different from, say, brands of socks, which are not ordered in any natural way.

Kniss wrote that he had been taught that for such data, called “ordinal categorical data,” it was better to use the Wilcoxon test or a t-test. Though he later acknowledged that a t-test wasn’t ideal for the data, he noted that both of these tests showed the results to be statistically insignificant (the Wilcoxon test yielded a p-value of 0.2081) and concluded that the evidence of harm from GM grain was therefore weak.

Shortly after Kniss posted his critique, a response appeared on the website GMO Judy Carman. The post is attributed to the site’s editors, one of whom is Howard Vlieger, a co-author on the pig paper. Here’s the blog post’s response to Kniss’s comments on the choice of test: >For example, he acknowledges that the stomach data are categorical in nature. >But he then suggests using statistical tests that should never be used on categorical data, such as a t-test. >In order to do that, he had tried to change categorical data into continuous data so that he can apply statistics that are only applicable to continuous data. >Categorical data are data that fit into categories, such as male / female or pregnant / not pregnant. >He has tried to turn this sort of data into data that is continuous, like you get with body weight or height.

The argument made here is that the tests Kniss used, the t-test and the Wilcoxon test, would only be suitable if inflammation data were continuous, i.e. represented by decimal numbers rather than names like “moderate inflammation.” Kniss acknowledged in an update to his post that the t-test was not ideal for the data under consideration. As for the Wilcoxon test, it simply isn’t true that it is only for use with continuous data⁵. Instead of requiring continuous data, it requires the data to have an order, so that given two different values, we can say that one is larger than the other.⁶ This clearly is the case for categories like “nil inflammation” and “moderate inflammation.” Rather than trying to “change categorical data into continuous data,” Kniss simply used the natural ordering on the categories.

The blog post at GMO Judy Carman continues:

This is really bad statistical methodology. It is like taking pregnant / not pregnant data and trying to twist that data into groups that could be described as: pregnant, half pregnant and fully pregnant. And you are right, it doesn’t make sense to even try to do something like that.

This is a bizarre analogy. It is true that the term “half pregnant” makes no sense, but inflammation is different from pregnancy. It does make sense to talk about “nil inflammation,” “mild inflammation,” “moderate inflammation,” and “severe inflammation.” Anybody who objects to these categories should take it up with Carman and her co-authors, because they chose to use the categories in their study!

Why Carman’s test is a poor choice

Recall that the p-value obtained from a test of statistical significance should tell us how likely we are to get data which is at least as “surprising” (under the assumption that there is no relationship between GM feed and inflammation) as our experimental data. There is not a single correct notion of “surprisingness,” and different tests measure the extent to which data is anomalous in different ways. Yet the test used by Carman and her co-authors is particularly problematic.

Because the authors’ test combined categories, it made no distinction between a stomach with nil inflammation and one with moderate inflammation. Their test considers the data they reported to be exactly as surprising as the following fictitious data:

	Non-GM-fed	GM-fed
Nil inflammation	45	0
Mild inflammation	9	24
Moderate inflammation	10	25
Severe inflammation	9	23

Hypothetical data 1

This is because when you combine the nil, mild, and moderate inflammation categories, you get the same table that came out of combining those categories with the experimental data.

Under these hypothetical data, we see more than twice as many GM-fed pigs as non-GM-fed pigs at each inflammation level other than nil inflammation. It seems intuitive that this is quite unlikely to happen by chance if there’s no relation between GM feed and inflammation. This contrasts the actual observed data, where GM-fed pigs have a lower incidence of mild and moderate inflammation, and a higher incidence of no inflammation. If we knew that GM feed and inflammation were unrelated, then we’d be more surprised to see the hypothetical data than the actual data. A good test ought to capture this, but the test used by Carman does not.

One way to improve upon the authors’ test would be simply to refrain from merging three categories into one. The chi-squared test – the test which the authors applied to the 2 by 2 table – can also be applied to larger contingency tables, such as the full table of inflammation data. This test gives a p-value of about 0.01, which is higher than the value of 0.004 reported by Carman but still considered statistically significant.

Even if you don’t combine categories, however, the chi-squared test still doesn’t account for the ordinal nature of the data. That means, for instance, that it does not distinguish between the data reported by Carman and the following (hypothetical) table, which is arrived at by changing the order of the rows of the Carman data:

	Non-GM-fed	GM-fed
Nil inflammation	29	18
Mild inflammation	31	23
Moderate inflammation	4	8
Severe inflammation	9	23

Hypothetical data 2

It makes sense that these hypothetical data should be more strongly indicative of a relationship between GM feed and inflammation. Unlike the reported data, here we see that GM-fed pigs have higher rates of both severe and moderate inflammation, and a lower incidence of nil inflammation. A statistical test for ordinal categorical data would account for this.

There are numerous tests that work well with ordinal categorical data. These tests can use the fact that (for instance) “severe inflammation” is closer to “moderate inflammation” than it is to “nil inflammation” in deciding how anomalous a particular set of data is. One such test is the Wilcoxon test, which works by analyzing the ranks of the various data points from smallest to largest. As Kniss pointed out, the Wilcoxon test indicated that the inflammation data was not significant. A more complicated test is ordinal logistic regression, which also shows the data to be insignificant. ⁷

Conclusion

In short, the trouble with the purported relationship between GM feed and inflammation is that the authors need to have things both ways to claim that it exists. The statement of the claim depends on data expressed in terms of the tiered inflammation scale; it pertains specifically to “severe inflammation.” But to claim that the relationship is statistically significant, the researchers had to ignore a good chunk of the information they had collected. Looking at all of the data and choosing a reasonable test shows that the data are not particularly anomalous under the null hypothesis. The claimed relationship is not statistically significant and does not provide sound basis for predicting harm to pigs from GM feed.

The chance of getting exactly 50 heads when flipping a fair coin 100 times is less than 8%.↩
Note that a p-value doesn’t tell us the probability that the null hypothesis is true! If you want to answer that kind of question, you’ll need to study Bayesian statistics.↩
The authors also reported similar findings for male and female pigs separately (Table 4). Here I refer to the aggregate data in the interest of conciseness. The same things could be said about the data for the male pigs or female pigs separately.↩
How exactly the chi square statistic is calculated is beyond the scope of this post, but there are a number of good resources available.↩
It is true that an early form of the test, which was introduced by Mann and Whitney in 1947, was only suitable for continuous data. However, the field of statistics has come a long way since 1947.↩
The Wilcoxon test is also an example of a nonparametric test, which means that it does not assume a particular distribution behind the data.↩
A more detailed treatment of analysis of ordinal categorical data can be found in a book by Alan Agresti which is titled Analysis of Ordinal Categorical Data.↩

Inexact Change

Thoughts on science, politics, and social progress.

New Study Shows How Not to Use Statistics

Testing for significance

Stat fight

Why Carman’s test is a poor choice

Conclusion

Comments