A statistical distribution is a rule (or function) that describes the probability that a variable takes on its possible values.
The prevalence of diabetes (type I or type II) in the US is 10.5%. If the variable X is the diabetes status of a randomly selected person in the US, then X has two possible values: "diabetic", and "not diabetic". The probability distribution is simply \[ P(X=\text{diabetic}) = 0.105\] \[ P(X=\text{not diabetic}) = 0.895 \]
In the BMI example above, we made X into a discrete (categorical) variable by breaking it into (essentially arbitrary) ranges. In reality, BMI is a continuous, quantitative variable which can take on any positive value, not necessarily a whole number. In this situation we can no longer talk about the "probability a variable is equal to" some particular number. Why not?
Consider trying to establish if an individual's BMI is equal to 26. BMI is weight (in kilograms) divided by the square of height (in meters). In theory, we could measure both of these quantities to an arbitrary degree of precision. So while an individual might have a BMI of 26 when rounded to the nearest whole number, and maybe even 26.0 when rounded to one decimal place, if we increase the precision enough we will find some difference, no matter how small, between the individual's BMI and the value 26. Consequently, the probability their BMI is exactly equal to 26 is mathematically zero. The same is true for any other value to which we want to compare their BMI!
Instead, we can talk about the probability that an individual's BMI is in some range of values. In this case, the graph of the distribution is interpreted as having an area in any given range equal to the probability the variable lies in that range.
We've already seen that the probability the BMI lies in the range 25 to 30 is 0.34 (i.e. 34%), so the distribution
of BMI as a continuous variable might look like:
Thinking of the probability that the BMI lies in a range, instead of being exactly equal to some number, solves the problem of being equal, given a specific precision. For example, the probability that an individual's BMI is equal to 26 when rounded to the nearest whole number is equivalent to the probability their BMI lies in the range 25.5 < BMI < 26.5. And similarly, the probability it is equal to 26.0 to one decimal place is the probability it lies in the range 25.95 < BMI < 26.05.
When working with quantitative data, it's useful to be able to provide some brief summaries of a data set which describe its distribution. Typically we want to know:
Computing some kind of average is really a way of finding "the most typical" value in our data set. For quantitative data, there are two averages we typically compute:
We’ll load a data set that we’ll use frequently throughout this course.
This data set comprises metabolic data from a mouse experiment in Dr. Kim’s lab. Two strains of mouse, C57BL/6 ("B6"") and Tallyho ("TH"") were fed three different diets (standard Chow, the control; a low-fat, high-calorie diet ”LF”, and a high-fat diet “HF”).
Various metabolic measurements were taken from each mouse after 16 weeks on the diet.
We will use the tidyverse library throughout the course to manage our data.
Remember tidyverse is a package (a collection of functions). Packages must
be installed once and once only, with install.pacakges("tidyverse").
We did that last time, so there is no need to do it again (unless you reinstalled R since the last class).
Packages must be loaded once for each R session with the library function:
library(tidyverse)
The data are stored in a comma-separated value (CSV) file.
Once you have loaded tidyverse, you can load the data with the read_csv function:
met <- read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv")
This will read the CSV file, divide it up into columns at each comma, and rows at each newline,
creating a table. We called the R object holding the table met (of course, you
can call it anything you like). You will now see met in the environment tab:
Click on met in the Environment tab to view the data:
Note that the MouseID column contains the strain (TH or B6), diet (Chow, HF, or LF),
along with a mouse id. We'd like to have access to the strain and diet in a more convenient way.
We can separate that column out into three columns using the tidyverse separate
function. We need to specify what character to use to separate out the new columns (here it is -),
and what the new columns should be called. We do this with:
met <- separate(met, MouseID, sep="-", into=c("Strain", "Diet", "Id"))
Run that command and view the data again.
Let’s focus on just one group of mice. We can do this by filtering the data: (another tidyverse function):
th_chow <- filter(met, Strain=="TH" & Diet=="Chow")
Explain this code
The filter function takes a data table (met in our case)
and a condition. It will create a new data table containing only the rows
in the original table for which the condition is true.
The condition in our case is
Strain=="TH" & Diet=="Chow"
There are two things to explain here. The == (a double equal mark)
is a comparison for equality. It is true if the left hand side is equal to
the right hand side, and false otherwise. Don't confuse this with the single equal
mark =, which means the same as <- (i.e. it assigns a
value to an object.)
The other operator is &, which is "and". It results in a true value
only if the left hand side and the right hand side are both true.
So our filter gets only the rows which have "TH" for Strain and "Chow" for Diet, and
puts them in a new data table which we called th_chow.
Run the code and view the new data table th_chow. Make sure it has only the expected rows.
We can “pull” the cholesterol values from this filtered table:
th_chow_chol <- pull(th_chow, Cholesterol)
Find th_chow_chol in the Environment tab. What data type is this? Is that what you
expected?
To find the mean cholesterol for this group, use
mean(th_chow_chol)
To find the median, use
median(th_chow_chol)
Repeat the previous steps to find the mean and median Cholesterol for the TH HF group.
Which measure of central tendency should we use? Mean or median?
As well as knowing what a “typical” value in our data set looks like, we should also ask how representative this typical value is of the data set.
To do this, we can measure the “spread” of the data: how far is a value in the data set from our average. There are two ways to do this:
The standard deviation is computed by taking the sum of the squares of the difference between each data point and the mean, dividing by the number of data points, and then taking the square root.
In R, we can do
sd(th_chow_chol)
The interquartile range is the difference between the value that is the 25th percentile and the 75th percentile in the data.
Experiment with the following in R:
quantile(th_chow_chol, 0.25)
quantile(th_chow_chol, 0.75)
IQR(th_chow_chol)
Note that the IQR is the difference between the first two values.
Repeat these calculations for the TH mice fed the HF diet.