We're going to take a little break from statistical ideas and explore some techniques for managing data using R and the Tidyverse package.
This is known as data wrangling. It's important to note here that all this means is organizing and managing the data so that it's easy to work with. We are not changing the data values in any way. Manipulating data so that it changes the values and results is fraudulent, and we have to be vigilant in avoiding doing that either intentionally or by accidentally.
We want our data to be organized in a way that is coherent both to us and to the software we're going to use to analyze and display it. Data that is organized this way is said to be "tidy".
Tidy data is organized as individual rectangular tables of data, with any table having a number of rows and columns. Tidy data has the following characteristics:
As an example, let's look at the mouse metabolic data again:
met <- read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv")
View met by clicking on it in the Environment tab. Do you think this
is tidy?
Every row in the table represents one mouse, or one observation. We measure six variables for each mouse, represented by the BodyWeight, Insulin, TG (triglycerides), Cholesterol, Glucose, and FatMass columns.
However, the MouseID column really represents three different pieces of data in a single column. It encapsulates the Strain, Diet, and a unique ID. Since this single column represents three variables, the data are not "tidy"
Separateseparate function takes
a column in a data table and separates one column into multiple columns, given a separator
character. In our case the column we want to separate is MouseID and the
separator character is -. The columns into we want to separate
this are Strain, Diet, and Id. So the command
is:
met <- separate(met, MouseID, sep="-", into=c("Strain", "Diet", "Id"))
met <- read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv")
met <- separate(met, MouseID, sep="-", into=c("Strain", "Diet", "Id"))
We read the CSV file and assigned it to our object, met. And then
we separated the MouseID column into three columns, reassigning it back
to the same object.
Since it's quite common to have a series of steps like this, Tidyverse supplies a special operator, which we can think of as meaning "and then". So another style of these two steps, which accomplishes the exact same thing, is
met <- read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv") %>%
separate(MouseID, sep="-", into=c("Strain", "Diet", "Id"))
Note that the first argument to separate has been removed. The "and then"
operator, %>% means "take the result of the previous operation and use
it as the first argument to the next operation". So this basically says, "read the CSV
file, and then separate the MouseID column".
This style is entirely optional, but I recommend you try it and see if you get a feel for it. It is called a "fluent programming style".
summarizesummarize function takes a table and produces summaries of the
data in it. Try the following:
summarize(met, Count=n())
What does this do? Notice that the result is presented as a table with one row and
one column.
Try
summarize(met, MeanCholesterol=mean(Cholesterol), Count=n())
metByStrain <- group_by(met, Strain)
View metByStrain. Does it look any different to met?
Type metByStrain directly in the console. Note that it has a Groups
attribute displayed above the table.
Grouping doesn't change any data. However, it changes the way summarize
behaves:
summarize(metByStrain, MeanCholesterol=mean(Cholesterol), Count=n())
Use these techniques to find the mean cholesterol across each Diet.
Can you find the standard deviation of the cholesterol for each Diet group? Can you display this in a table with the mean cholesterol and the count for each diet?
We can group by more than one variable. What happens if you do:
metGrouped <- group_by(met, Strain, Diet)
summarize(metGrouped, Count=n(), MeanCholesterol=mean(Cholesterol), CholesterolSD = sd(Cholesterol))
write_csv(table, filename).
The following will save a data summary to your desktop (choose another folder if you like):
metSummary <- summarize(metGrouped,
Count=n(),
MeanCholesterol=mean(Cholesterol),
CholesterolSD = sd(Cholesterol))
write_csv(metSummary, "~/Desktop/MouseDataSummary.csv")
Find the file and double-click it; it should open in Excel or similar software.
Read the following and see if you understand what it does. Run it to see if you are correct:
read_csv("https://denvirlab.marshall.edu/BMR617-2021/data/TH-B6-metabolic.csv") %>%
separate(MouseID, sep="-", into=c("Strain", "Diet", "Id")) %>%
group_by(Strain, Diet) %>%
summarize(
Count=n(),
MeanCholesterol=mean(Cholesterol),
CholesterolSD = sd(Cholesterol)
) %>%
write_csv("~/Desktop/MouseDataSummary.csv")
Let's recap all the tidyverse commands we've seen. Remember you can look any of these up in the help in RStudio if you want the full details.
read_csv(filename)separate(table, column, sep, into)sep as the
separator character.%>%group_by(table, columns)summarize()
behave.
summarize(table, summaryFunctions)n()
just counts the number of rows in the table or in each group. You can also perform
functions on the columns, e.g. mean(Cholesterol) will compute the mean
of the Cholesterol column for the table or for each Group.filter(table, condition)write_csv(table, filename)