BMR 617: Statistical Techniques for the Biomedical Sciences

Exploring data and relationships between variables: Review and Summary

This review includes statistical concepts and R commands from the following lectures:

Types of Variables
Exploring Distributions, Averages, and Spread
The Normal Distribution
Data Wrangling
Introduction to Graphing with ggplot2
Relative and Attributable Risks
Correlation

Types of Data

Categorical (C) data

Nominal (N): have no order
Ordinal (O): have order; no scale

Quantitative (Q) data

Interval (I): have order and scale; no meaningful zero
Ratio (R): have scale and meaningful zero

Do not forget the NOIR mnemonic.

Roles of Variables

Response Variables

They are the particular focus of a question in the study or experiment.

Explanatory variables

Variables that predict or explain changes in the response variable.

The role of a variable is defined by the experimental design.

The type and role of variables determine much of the analysis.

Relationships between Two Variables

Recall the four possible relationships between two variables.

Categorical explanatory variable, quantitative response : C-Q
Categorical explanatory variable, categorical response : C-C
Quantitative explanatory variable, quantitative response : Q-Q
Quantitative explanatory variable, categorical response : Q-C

C-Q Case

Explanatory variable is categorical (Q) and the response variable (outcome) is quantitative (Q).

Example: mouse data

Categorical explanatory variables: mouse strain [C57BL/6 (B6) or TALLYHO (TH)], and diet (Chow, High-fat, Low-fat high calorie)
Quantitative response variables: cholesterol level, fat mass, body weight, triglyceride level, glucose level

Visualize the data using box plots, column scatter plots, or bar charts

Compute the mean or median, along with measures of spread, such as standard deviation, standard error of the mean, or interquartile range.

C-C Case

Explanatory and response variables are both categorical (C).

Example: vaccine trial

Categorical explanatory variable: treatment (placebo and vaccine)
Categorical response variable: SARS-CoV-2 status (infected or not infected)

We usually present the data using a contingency table.

We typically compute the risk for each group and the relative risk.

Other measures include the attributable risk and number needed to treat (NNT).

Q-Q Case

Explanatory and response variables are both quantitative (Q).

Example: insulin sensitivity data

We usually present the data using a scatterplot.

We computed the correlation coefficient

We will also discuss linear regression later in the course.

Distributions

A distribution describes the probability a variable will take on a value, or a range of values.

The "Normal" or "Gaussian" distribution is a particular distribution with "nice" mathematical properties:

Symmetric about the mean
Unimodal
Determined entirely by the mean and standard deviation
Software (or statistical tables) can be used to find the probability a normally-distributed value takes on any given range of values

Summary of R Commands

Let's recap all the R and tidyverse commands we've seen. You can always check these commands in the help in RStudio if you want the full details.

R commands

rep: Replicates the elements of vectors and lists.
rnorm: Gives random values from the normal distribution.
class: Used to ask what type a variable is.
install.packages(): Download and install packages from CRAN-like repositories or from local files. We used this to install tidyverse.
library(): Used for loading/attaching and listing of packages. We use this to load tidyverse.
sum: Calculates the sum of all the values present in its arguments.
round: Rounds the values in its first argument to the specified number of decimal places (default is 0).

cor

Calculates the correlation coefficient.

getwd()

Obtains the absolute filepath representing the current working directory of the R process.

setwd()

Sets the working directory.

dir()

Lists the files in your working directory or in a named directory.

Tidyverse commands

read_csv(filename): Reads a comma-separated value file, and creates a table from it. Note the filename can be a file on your computer, or a resource from the web.
separate(table, column, sep, into): Separates a column in a table into multiple columns, using sep as the separator character.
%>%: "And then". Take the result of the previous operation and pass it into the next operation. This is also called a "pipe".
group_by(table, columns): Adds Groups to a data table. The groups should be variable (columns) in the table. This will not change any data, but will change how other functions, such as summarize() behave.
summarize(table, summaryFunctions): Computes summaries of the data in the entire table. If the table has Groups, it will compute the summaries for each group. The special function n() just counts the number of rows in the table or in each group. You can also perform functions on the columns, e.g. mean(Cholesterol) will compute the mean of the Cholesterol column for the table or for each Group.
filter(table, condition): Creates a new table with only the rows from the original table for which the condition is true. Remember that you can "pull" the values from the filtered table by specifying the column, i.e., pull(table, column_name).
write_csv(table, filename): Writes the data in the table to the file, in comma separated value format.
ggplot(dataTable, aes(x=..., y=...)): Starts a plot using the ggplot2 library, which is part of tidyverse. Add layers by using geom functions.

Averages and spread

mean: Calculates the arithmetic mean.
median: Finds the median.
sd: Calculates the standard deviation.
quantile(pulled_filtered_table, percentile): Produces sample quantiles corresponding to the given probabilities.
IQR(pulled_filtered_table): Computes interquartile range of x values, i.e., quantile(x, 3/4) - quantile(x, 1/4).
pnorm: Calculates probabilities.