BMR 617: Statistical Techniques for the Biomedical Sciences

Exploring data and relationships between variables: Review and Summary

This review includes statistical concepts and R commands from the following lectures:

Types of Data

Categorical (C) data

Quantitative (Q) data

Do not forget the NOIR mnemonic.

Roles of Variables

Response Variables

Explanatory variables

The role of a variable is defined by the experimental design.

The type and role of variables determine much of the analysis.

Relationships between Two Variables

Recall the four possible relationships between two variables.

C-Q Case

Explanatory variable is categorical (Q) and the response variable (outcome) is quantitative (Q).

Example: mouse data

Visualize the data using box plots, column scatter plots, or bar charts

Compute the mean or median, along with measures of spread, such as standard deviation, standard error of the mean, or interquartile range.

C-C Case

Explanatory and response variables are both categorical (C).

Example: vaccine trial

We usually present the data using a contingency table.

We typically compute the risk for each group and the relative risk.

Other measures include the attributable risk and number needed to treat (NNT).

Q-Q Case

Explanatory and response variables are both quantitative (Q).

Example: insulin sensitivity data

We usually present the data using a scatterplot.

We computed the correlation coefficient

Distributions

A distribution describes the probability a variable will take on a value, or a range of values.

The "Normal" or "Gaussian" distribution is a particular distribution with "nice" mathematical properties:

Summary of R Commands

Let's recap all the R and tidyverse commands we've seen. You can always check these commands in the help in RStudio if you want the full details.

R commands

rep
Replicates the elements of vectors and lists.
rnorm
Gives random values from the normal distribution.
class
Used to ask what type a variable is.
install.packages()
Download and install packages from CRAN-like repositories or from local files. We used this to install tidyverse.
library()
Used for loading/attaching and listing of packages. We use this to load tidyverse.
sum
Calculates the sum of all the values present in its arguments.
round
Rounds the values in its first argument to the specified number of decimal places (default is 0).
cor
Calculates the correlation coefficient.
getwd()
Obtains the absolute filepath representing the current working directory of the R process.
setwd()
Sets the working directory.
dir()
Lists the files in your working directory or in a named directory.

Tidyverse commands

read_csv(filename)
Reads a comma-separated value file, and creates a table from it. Note the filename can be a file on your computer, or a resource from the web.
separate(table, column, sep, into)
Separates a column in a table into multiple columns, using sep as the separator character.
%>%
"And then". Take the result of the previous operation and pass it into the next operation. This is also called a "pipe".
group_by(table, columns)
Adds Groups to a data table. The groups should be variable (columns) in the table. This will not change any data, but will change how other functions, such as summarize() behave.
summarize(table, summaryFunctions)
Computes summaries of the data in the entire table. If the table has Groups, it will compute the summaries for each group. The special function n() just counts the number of rows in the table or in each group. You can also perform functions on the columns, e.g. mean(Cholesterol) will compute the mean of the Cholesterol column for the table or for each Group.
filter(table, condition)
Creates a new table with only the rows from the original table for which the condition is true. Remember that you can "pull" the values from the filtered table by specifying the column, i.e., pull(table, column_name).
write_csv(table, filename)
Writes the data in the table to the file, in comma separated value format.
ggplot(dataTable, aes(x=..., y=...))
Starts a plot using the ggplot2 library, which is part of tidyverse. Add layers by using geom functions.
Configure axis with xlab, ylab.
Manipulate axes ticks, numbers by using theme.
Configure title with gtitle.
Change color using scale_fill_xxx functions. The functions position_dodge and position_jitterdodge are very useful in "dodging" and adding "jitter" to your plots.

Averages and spread

mean
Calculates the arithmetic mean.
median
Finds the median.
sd
Calculates the standard deviation.
quantile(pulled_filtered_table, percentile)
Produces sample quantiles corresponding to the given probabilities.
IQR(pulled_filtered_table)
Computes interquartile range of x values, i.e., quantile(x, 3/4) - quantile(x, 1/4).
pnorm
Calculates probabilities.