Our ANOVA showed that we had strong evidence to conclude the cholesterol level of the B6 mice was not the same for all three diet groups. However, we typically want to know more: in general, we would like to know which groups are different to each other, and by how much. In our example, this means asking which diet groups have different cholesterol levels to each other, and how big the increase or decrease in cholesterol level is.
One approach that might occur to you at this point is to test each pair of diets with a t-test: i.e. we could perform a t-test for the HF diet against the Chow diet, a t-test for the LF diet against the Chow diet, and a third t-test for the HF diet against the LF diet. However, there is a fundamental problem with this approach. For each t-test, we would obtain a p-value. If that p-value were below a pre-determined threshold (usually 0.05), we would conclude we had evidence to reject the corresponding null hypothesis. Our false positive rate – the probability that we obtain a false positive result, if the null hypothesis is true – is supposed to be controlled by the threshold we set for p. The problem here is that because we perform multiple t-tests, we greatly increase our chances of a false positive. In fact, if there were no difference between the cholesterol level in each of the three diet groups, there is a 14.3% chance that at least one of the three p-values would be less than 0.05. In general, if you have six or more groups, and test every group against every other group with a t-test, there is a better than 50% chance you will see at least one p-value less than 0.05, if all null hypotheses are true. This is an example of the problem of multiple hypothesis testing, which we'll examine in a later section of the course.
There are various ways to identify which groups are different to which other groups, depending on the exact question you want to ask. These are called post hoc tests, and they typically compute a p-value and a confidence interval for each comparison, which accounts for the fact that we are performing multiple comparisons. The p-values and confidence intervals are interpreted family wise, i.e. if we set a threshold for our p-value at 0.05, there is a 0.05 chance of any false positives if the null hypotheses are true. And if we calculate 95% confidence intervals, we are 95% confident that all the intervals contain the actual population value of the difference in means.
We'll look at two kinds of post-hoc tests: Tukey's Honest Significant Differences (Tukey's HSD), which compares every group to every other group, and Dunnett's Test, which considers one group to be a reference/control and compares every other group to it. Because Dunnett's Test makes fewer comparisons, it is less conservative (i.e. it will generally give lower p-values and smaller confidence intervals). On the other hand, our design, or question of interest may not be answered by a Dunnett's Test.