BMR 617: Statistical Techniques for the Biomedical Sciences

Hypothesis Testing for a Single Proportion

Last time we introduced Hypothesis Testing. The framework is as follows:

We form a hypothesis we want to test. We state this as a competition between two hypotheses:
- The null hypothesis is the "default"; it's what we would believe without any evidence to the contrary.
- The alternative hypothesis is the opposite of the null hypothesis. It's (almost always) the hypothesis for which we want to provide strong evidence.
We collect data from an experiment or study designed to test the hypothesis.
We compute a test statistic from our data. The test statistic is designed to have a specific value and known distribution if the null hypothesis is true.
We calculate the probability of obtaining data at least as extreme as the data we actually obtained, were the null hypothesis to be true. This is called the p-value.
If the p-value is "small", i.e. if it would be very unlikely we would obtain the data we actually obtained if the null hypothesis were true, we conclude the null hypothesis is false and the alternative hypothesis is true.

Let's look at a fairly simple case. We'll consider looking at a single proportion and testing if it is the same as, or different to, some fixed value.

Single Proportion Example

In a study of Neonatal Abstinence Syndrome, DNA samples were taken from 34 pregnant women in West Virginia with substance use disorder and sequenced.

For the variant rs6972158 in the gene NPSR1, 27 of the 68 alleles were found to be the variant allele G (41 were the reference allele A).

According to the 1000 Genomes Project, the G allele frequency in the American population is 0.235.

We want to know if the variant allele frequency in this population of pregnant women in West Virginia with substance use disorder is different to that of the general American population

Formulating the question as null and alternative hypotheses.

The null hypothesis is the hypothesis we would tend to assume with no other data. The alternative hypothesis is typically the hypothesis we would like to "prove" (or provide evidence to support).

The null hypothesis here is:

The G allele frequency in the study population is 0.235

The alternative hypothesis is:

The G allele frequency in the study population is not 0.235

R code for testing a value of a single proportion

In R, we can test the null hypothesis that a proportion is equal to some given value using the prop.test() function.

To see how to use the function, type ?prop.test in the console, or search for prop.test in the help tab.

Which parameters are required for the prop.test function, and which are optional?

x, n, and p are required, alternative, conf.level, and correct are optional.

x and n are required, p, alternative, conf.level, and correct are optional.

x, n, p, alternative, conf.level, and correct are all required.

x, n, p, alternative, conf.level, and correct are all optional.

Incorrect

Remember that optional parameters have a default value assigned in the function definition; required parameters do not.

Correct

Read the "Details" section of the help, in particular the third paragraph (in our example, we only have one group). Should we provide a value for p here. If so, what should it be?

We should not provide a value for p

We should provide a value for p of 0.5

We should provide a value for p of 0.235

We should provide a value for p of NULL

Incorrect

The Help says "the null tested is that the underlying probability of success is p." Our null hypothesis is that the probability of a G allele is 0.235, so p should be 0.235.

Correct

In our case, we have 27 G alleles out of a total of 68 alleles, and the probability of a G allele under the null hypothesis is 0.235:


prop.test(27,68,0.235)

Run this test. The output you should see is


	1-sample proportions test with continuity correction

data:  27 out of 68, null probability 0.235
X-squared = 9.053, df = 1, p-value = 0.002623
alternative hypothesis: true p is not equal to 0.235
95 percent confidence interval:
 0.2826780 0.5231249
sample estimates:
        p 
0.3970588

Interpreting the output

The sample estimate (i.e. the estimate of the probability of a G allele from the sample) is 0.397.

You can check that this is just the number of G alleles in the sample divided by the total number of alleles: 27/68

The 95% confidence interval for the proportion is [0.283, 0.523]

We are 95% confident that the interval[0.283, 0.523] contains the true value of the proportion of G alleles for West Virginian pregnant women with substance use disorder

The p-value is 0.002623. This means that if the null hypothesis were true, there would be a 0.2623% chance of seeing data "this extreme"

i.e. if the proportion of G alleles among West Virginia pregnant women with SUD were 0.235, there would be a 0.2623% chance of seeing data this different to that in a study of 34 women from this population.

Interpreting the result

Since the p-value of 0.002623 is less than our predetermined threshold of 0.05, we would reject the null hypothesis and conclude that the proportion of G alleles in this population is different to 0.235.

Since the entire 95% confidence interval is above 0.235, we'd conclude it's more than 0.235

Some possible explanations:

The allele frequency in West Virginians is greater than that of the general US population.
The allele frequency among women likely to become pregnant is greater than that of the general US population.
The allele frequency among those with SUD is greater than that of the general US population.
Some combination of the above.