Exploring data and relationships between variables: Review and Summary
This review includes statistical concepts and R commands from the following lectures:
- Types of Variables
- Exploring Distributions, Averages, and Spread
- The Normal Distribution
- Data Wrangling
- Introduction to Graphing with ggplot2
- Relative and Attributable Risks
- Correlation
Types of Data
Categorical (C) data
- Nominal (N): have no order
- Ordinal (O): have order; no scale
Quantitative (Q) data
- Interval (I): have order and scale; no meaningful zero
- Ratio (R): have scale and meaningful zero
Do not forget the
NOIR mnemonic.
Roles of Variables
Response Variables
- They are the particular focus of a question in the study or experiment.
Explanatory variables
- Variables that predict or explain changes in the response variable.
The role of a variable is defined by the experimental design.
The type and role of variables determine much of the analysis.
Relationships between Two Variables
Recall the four possible relationships between two variables.
- Categorical explanatory variable, quantitative response : C-Q
- Categorical explanatory variable, categorical response : C-C
- Quantitative explanatory variable, quantitative response : Q-Q
- Quantitative explanatory variable, categorical response : Q-C
C-Q Case
Explanatory variable is categorical (Q) and the response variable (outcome) is quantitative (Q).
Example: mouse data
- Categorical explanatory variables: mouse strain [C57BL/6 (B6) or TALLYHO (TH)], and diet (Chow, High-fat, Low-fat high calorie)
- Quantitative response variables: cholesterol level, fat mass, body weight, triglyceride level, glucose level
Visualize the data using box plots, column scatter plots, or bar charts
Compute the mean or median, along with measures of spread, such as standard deviation, standard error of the mean, or interquartile range.
C-C Case
Explanatory and response variables are both categorical (C).
Example: vaccine trial
- Categorical explanatory variable: treatment (placebo and vaccine)
- Categorical response variable: SARS-CoV-2 status (infected or not infected)
We usually present the data using a contingency table.
We typically compute the risk for each group and the relative risk.
Other measures include the attributable risk and number needed to treat (NNT).
Q-Q Case
Explanatory and response variables are both quantitative (Q).
Example: insulin sensitivity data
We usually present the data using a scatterplot.
We computed the correlation coefficient
- We will also discuss linear regression later in the course.
Distributions
A distribution describes the probability a variable will take on a value, or a range of values.
The "Normal" or "Gaussian" distribution is a particular distribution with "nice" mathematical properties:
- Symmetric about the mean
- Unimodal
- Determined entirely by the mean and standard deviation
- Software (or statistical tables) can be used to find the probability a normally-distributed value takes on any given range of values
Summary of R Commands
Let's recap all the R and tidyverse commands we've seen. You can always check these commands in the help in RStudio if you want the full details.
R commands
rep
- Replicates the elements of vectors and lists.
rnorm
- Gives random values from the normal distribution.
class
- Used to ask what type a variable is.
install.packages()
- Download and install packages from CRAN-like repositories or from local files. We used this to install tidyverse.
library()
- Used for loading/attaching and listing of packages. We use this to load tidyverse.
sum
- Calculates the sum of all the values present in its arguments.
round
- Rounds the values in its first argument to the specified number of decimal places (default is 0).
cor
Calculates the correlation coefficient.
getwd()
Obtains the absolute filepath representing the current working directory of the R process.
setwd()
Sets the working directory.
dir()
Lists the files in your working directory or in a named directory.
Tidyverse commands
read_csv(filename)
- Reads a comma-separated value file, and creates a table from it. Note the filename
can be a file on your computer, or a resource from the web.
separate(table, column, sep, into)
- Separates a column in a table into multiple columns, using
sep as the
separator character.
%>%
- "And then". Take the result of the previous operation and pass it into the next operation.
This is also called a "pipe".
group_by(table, columns)
- Adds Groups to a data table. The groups should be variable (columns) in the table.
This will not change any data, but will change how other functions, such as
summarize()
behave.
summarize(table, summaryFunctions)
- Computes summaries of the data in the entire table. If the table has Groups,
it will compute the summaries for each group. The special function
n()
just counts the number of rows in the table or in each group. You can also perform
functions on the columns, e.g. mean(Cholesterol) will compute the mean
of the Cholesterol column for the table or for each Group.
filter(table, condition)
- Creates a new table with only the rows from the original table for which the condition is
true. Remember that you can "pull" the values from the filtered table by specifying the column, i.e., pull(table, column_name).
write_csv(table, filename)
- Writes the data in the table to the file, in comma separated value format.
ggplot(dataTable, aes(x=..., y=...))
- Starts a plot using the ggplot2 library, which is part of tidyverse. Add layers by using geom functions.
- geom_boxplot
- geom_point
- geom_bar
Configure axis with xlab, ylab.
Manipulate axes ticks, numbers by using theme.
Configure title with gtitle.
Change color using scale_fill_xxx functions.
The functions position_dodge and position_jitterdodge are very useful in "dodging" and adding "jitter" to your plots.
Averages and spread
mean
- Calculates the arithmetic mean.
median
- Finds the median.
sd
- Calculates the standard deviation.
quantile(pulled_filtered_table, percentile)
- Produces sample quantiles corresponding to the given probabilities.
IQR(pulled_filtered_table)
- Computes interquartile range of x values, i.e., quantile(x, 3/4) - quantile(x, 1/4).
pnorm
- Calculates probabilities.