BMR 617: Statistical Techniques for the Biomedical Sciences

Data Wrangling

We're going to take a little break from statistical ideas and explore some techniques for managing data using R and the Tidyverse package.

This is known as data wrangling. It's important to note here that all this means is organizing and managing the data so that it's easy to work with. We are not changing the data values in any way. Manipulating data so that it changes the values and results is fraudulent, and we have to be vigilant in avoiding doing that either intentionally or by accidentally.

Tidy Data

We want our data to be organized in a way that is coherent both to us and to the software we're going to use to analyze and display it. Data that is organized this way is said to be "tidy".

Tidy data is organized as individual rectangular tables of data, with any table having a number of rows and columns. Tidy data has the following characteristics:

As an example, let's look at the mouse metabolic data again:

	
	met <- read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv")
	
	
View met by clicking on it in the Environment tab. Do you think this is tidy?

Every row in the table represents one mouse, or one observation. We measure six variables for each mouse, represented by the BodyWeight, Insulin, TG (triglycerides), Cholesterol, Glucose, and FatMass columns.

However, the MouseID column really represents three different pieces of data in a single column. It encapsulates the Strain, Diet, and a unique ID. Since this single column represents three variables, the data are not "tidy"

Separate

We've already seen how to tidy these data. The separate function takes a column in a data table and separates one column into multiple columns, given a separator character. In our case the column we want to separate is MouseID and the separator character is -. The columns into we want to separate this are Strain, Diet, and Id. So the command is:
	
	met <- separate(met, MouseID, sep="-", into=c("Strain", "Diet", "Id"))
	
	

Piping

So far we have
	
	met <- read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv")
	met <- separate(met, MouseID, sep="-", into=c("Strain", "Diet", "Id"))
	
	
We read the CSV file and assigned it to our object, met. And then we separated the MouseID column into three columns, reassigning it back to the same object.

Since it's quite common to have a series of steps like this, Tidyverse supplies a special operator, which we can think of as meaning "and then". So another style of these two steps, which accomplishes the exact same thing, is

	
	met <- read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv") %>%
		separate(MouseID, sep="-", into=c("Strain", "Diet", "Id"))
	
	
Note that the first argument to separate has been removed. The "and then" operator, %>% means "take the result of the previous operation and use it as the first argument to the next operation". So this basically says, "read the CSV file, and then separate the MouseID column".

This style is entirely optional, but I recommend you try it and see if you get a feel for it. It is called a "fluent programming style".

summarize

The >summarize function takes a table and produces summaries of the data in it. Try the following:
	
	summarize(met, Count=n())
	
	
What does this do? Notice that the result is presented as a table with one row and one column.

Try

	
	summarize(met, MeanCholesterol=mean(Cholesterol), Count=n())
	
	

Grouping

Try the following:
	
	metByStrain <- group_by(met, Strain)
	
	
View metByStrain. Does it look any different to met?

Type metByStrain directly in the console. Note that it has a Groups attribute displayed above the table.

Grouping doesn't change any data. However, it changes the way summarize behaves:

	
	summarize(metByStrain, MeanCholesterol=mean(Cholesterol), Count=n())
	
	

Use these techniques to find the mean cholesterol across each Diet.

Can you find the standard deviation of the cholesterol for each Diet group? Can you display this in a table with the mean cholesterol and the count for each diet?

Grouping by multiple variables

We can group by more than one variable. What happens if you do:

	
	metGrouped <- group_by(met, Strain, Diet)
	summarize(metGrouped, Count=n(), MeanCholesterol=mean(Cholesterol), CholesterolSD = sd(Cholesterol))
	
	

Writing CSV files

We can write (save) a CSV file using write_csv(table, filename). The following will save a data summary to your desktop (choose another folder if you like):
	
	metSummary <- summarize(metGrouped, 
		Count=n(), 
		MeanCholesterol=mean(Cholesterol), 
		CholesterolSD = sd(Cholesterol))
	write_csv(metSummary, "~/Desktop/MouseDataSummary.csv")
	
	
Find the file and double-click it; it should open in Excel or similar software.

Piping it all together

Read the following and see if you understand what it does. Run it to see if you are correct:

	
	read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv") %>%
	  separate(MouseID, sep="-", into=c("Strain", "Diet", "Id")) %>%
	  group_by(Strain, Diet) %>%
	  summarize(
		Count=n(), 
		MeanCholesterol=mean(Cholesterol), 
		CholesterolSD = sd(Cholesterol)
	  ) %>%
	  write_csv("~/Desktop/MouseDataSummary.csv")
	
	

Summary

Let's recap all the tidyverse commands we've seen. Remember you can look any of these up in the help in RStudio if you want the full details.

read_csv(filename)
Reads a comma-separated value file, and creates a table from it. Note the filename can be a file on your computer, or a resource from the web.
separate(table, column, sep, into)
Separates a column in a table into multiple columns, using sep as the separator character.
%>%
"And then". Take the result of the previous operation and pass it into the next operation. This is also called a "pipe".
group_by(table, columns)
Adds Groups to a data table. The groups should be variable (columns) in the table. This will not change any data, but will change how other functions, such as summarize() behave.
summarize(table, summaryFunctions)
Computes summaries of the data in the entire table. If the table has Groups, it will compute the summaries for each group. The special function n() just counts the number of rows in the table or in each group. You can also perform functions on the columns, e.g. mean(Cholesterol) will compute the mean of the Cholesterol column for the table or for each Group.
filter(table, condition)
Creates a new table with only the rows from the original table for which the condition is true.
write_csv(table, filename)
Writes the data in the table to the file, in comma separated value format.