Today
Study : OCTO-TWIN
In focus: Personality and Dementia
Goals:
1. Practice RAW-TO-DRAW workflow
2. Reshaping data
- Understand how long-to-wide transofrmation works
- Understand how wide-to-long transofrmation works
First, we load the necessary packages.
base::require(base)
base::require(knitr)
base::require(markdown)
base::require(testit)
base::require(dplyr)
base::require(reshape2)
base::require(ggplot2)
base::require(sas7bdat)
The next section of the script (chunk) establishes the addresses for folders. Let’s read this line by line and understand why lines 4 and 5 are commented out and what they do.
pathDir <- getwd()
pathFile <- file.path(pathDir,"Data/Extract/octomult_021114la.sas7bdat")
path_ds0 <- file.path(pathDir, "Data/Derived/ds0.Rds")
path_dsL <- file.path("../Data/Derived/dsL.Rds")
# ds0 <- read.sas7bdat(pathFile, debug=TRUE) # Use this when runing for the first time
# saveRDS(object=ds0, file=path_ds0, compress="xz") # Use this when runing for the first time
### Either use ds0 definition above or below.
ds0<-readRDS(path_ds0) # This saves time # Use for subsequent run
############## Define temporary variable sets
We have just loaded the dataset ds0
which corresponds to a particular place in our data analysis project.
Remember the schematic road map for the workflow in data science projects:
Now we would like to understand how ds0
become dsL
in this particular project. Remember, that dsL datastage reflects the scope of the project: what you want your readers to be aware is availible. It is convenient to manage this proces
dsL <- ds0
during which dsL
is created from ds0
with a separate script.
./Scripts/Data/dsL.R
At this moment, in our workflow, however, we only at the stage of ds0. Let’s take a momonet to explore the dataset ds0
and recall some basic R vocabulary
class()
dim()
str()
summary()
table()
To help us handle ds0, which has way too many variables, we develop a meaningful set of variables that we’d like to subset from ds0. It is convenient to distinguish two types:
# Time invariant variables
TIvars <- c(
"Case", "PairID", "TwinID", "Zygosity", # Identification
"BirthDate", "DeadDate", "Female", # Demographics
"EducCat", "Educyrs", # Eduction
"Dead", "DeadAge", "YTDead", # Mortality
"TotDem", "DemEver", "DemType", "DemAge", "YTDem" # Dementia
)
# Small set of time invariant variables
TIsmall <- c("Case","PairID", "TwinID","Female")
## Select the variables which names begin with ...
# names(ds0)[grep("^Marital", names(ds0))]
## Select the variables which names contain ...
# names(ds0)[grep("arital", names(ds0))]
TVMarital <- names(ds0)[grep("Marital", names(ds0))] # Marital status
TVCompAge <- names(ds0)[grep("CompAge", names(ds0))] # Computed Age
TVbmi <- names(ds0)[grep("bmi", names(ds0))] # Body Mass Index
TVepi_e <- names(ds0)[grep("epi_e", names(ds0))] # Systolic blood pressure
TVepi_n <- names(ds0)[grep("epi_n", names(ds0))]# Disystolic blood pressure
TVsbp <- names(ds0)[grep("sbp", names(ds0))] # Extraversion
TVdbp <- names(ds0)[grep("dbp", names(ds0))] # Neuroticism
TVdemtime <- names(ds0)[grep("demtime", names(ds0))] # diagnosed dementia
# collect all TV variables into one string
TVvars <- c(TVMarital, TVCompAge, TVbmi, TVepi_e, TVepi_n, TVsbp, TVdbp, TVdemtime)
# Small set of time variant variables
TVsmall <- c(TVMarital, TVCompAge, TVepi_e, TVepi_n )
Let’s take a few minutes and understand how objects TIvars and TVvars were created, how they behave and how to use them.
Using these groups of variables and base subsetting grammar we can select the variables we’d like to select into our dsW stage.
## Select the variables for your dsW
dsW <- ds0[,c(TIvars,TVvars)]
dim(dsW)
[1] 702 57
Naturally, we can change what variables get selected. Let’s try to get a more manageable set.
dsW <- ds0[,c(TIsmall,TVsmall)]
dim(dsW)
[1] 702 24
Notice that each occasion at which data was collected is hosted in separate variable/collumn. This is a characterist of data in WIDE format. However, most graph and model functions take data in LONG format. We transform the data with melt
function of the reshape2
package:
#Transform the wide dataset into a long dataset
dsLong <- reshape2::melt(dsW, id.vars=TIsmall) ## id.vars are Measured
dsLong <- dsLong[order(dsLong$Case, dsLong$variable), ] #Sort for the sake of visual inspection.
head(dsLong, 11)
Case PairID TwinID Female variable value
1 1 28 1 0 Marital1 1.00
703 1 28 1 0 Marital2 NaN
1405 1 28 1 0 Marital3 NaN
2107 1 28 1 0 Marital4 NaN
2809 1 28 1 0 Marital5 NaN
3511 1 28 1 0 CompAge1 91.25
4213 1 28 1 0 CompAge2 NaN
4915 1 28 1 0 CompAge3 NaN
5617 1 28 1 0 CompAge4 NaN
6319 1 28 1 0 CompAge5 NaN
7021 1 28 1 0 epi_e1 NaN
Notice that ALL variables that change with time are stacked into two:
- variable
- value
However, some work may need to be done to clean up the new data
# Create variable that counts time
dsLong$time <- stringr::str_sub(dsLong$variable,-1,-1)
# head(dsLong, 13)
# Remove the counter suffix from the "variable"
timepattern <- as.character(c(1:5))
for (i in timepattern){
dsLong$variable <- gsub(pattern=i, replacement='', x=dsLong$variable)
}
head(dsLong, 11)
Case PairID TwinID Female variable value time
1 1 28 1 0 Marital 1.00 1
703 1 28 1 0 Marital NaN 2
1405 1 28 1 0 Marital NaN 3
2107 1 28 1 0 Marital NaN 4
2809 1 28 1 0 Marital NaN 5
3511 1 28 1 0 CompAge 91.25 1
4213 1 28 1 0 CompAge NaN 2
4915 1 28 1 0 CompAge NaN 3
5617 1 28 1 0 CompAge NaN 4
6319 1 28 1 0 CompAge NaN 5
7021 1 28 1 0 epi_e NaN 1
Now we transform the data with the reverse function to melt
in order to take the values of the column “variable” and make them into the names of the new individual variables. Let’s take a few minutes to interpret the syntax of the code and play with the options
dsL <- dcast(dsLong, Case + PairID + TwinID + Female + time ~ variable, value.var = "value")
At the end, we appy discriptive labels to the categorical variables
Now the data is ready to be graphed.
# standard of the fontsize in plots
baseSize <- 11
# add this theme to a ggplot to style it
themeLine <- ggplot2::theme_bw(base_size=baseSize) +
ggplot2::theme(title=ggplot2::element_text(colour="gray20",size = baseSize+1)) +
ggplot2::theme(axis.text=ggplot2::element_text(colour="gray40")) +
ggplot2::theme(axis.title=ggplot2::element_text(colour="gray40")) +
ggplot2::theme(panel.border = ggplot2::element_rect(colour="gray80")) +
ggplot2::theme(axis.ticks.length = grid::unit(0, "cm")) +
ggplot2::theme(text = element_text(size =baseSize+7))
We’ll use these style definitions in graphs.
We can start with a simple graph
dsM <- dsL %>%
dplyr::select(Case, PairID, TwinID, CompAge, epi_e, epi_n)
head(dsM, 11)
Case PairID TwinID CompAge epi_e epi_n
1 1 28 1 91.25 NaN NaN
2 1 28 1 NaN NaN NaN
3 1 28 1 NaN NaN NaN
4 1 28 1 NaN NaN NaN
5 1 28 1 NaN NaN NaN
6 2 28 2 91.23 4 0
7 2 28 2 93.35 NaN NaN
8 2 28 2 NaN NaN NaN
9 2 28 2 NaN NaN NaN
10 2 28 2 NaN NaN NaN
11 3 47 1 92.03 5 1
p <- ggplot(dsM, aes(x=CompAge, group=Case, color=factor(TwinID)))
p <- p + geom_line(aes(y=epi_e))
p
Warning: Removed 1349 rows containing missing values (geom_path).
and gradually evolve it into a more complex graph, as we’ve done before
dsM <- dsL %>%
dplyr::select(Case, PairID, TwinID, CompAge, epi_e, epi_n) %>%
dplyr::filter (TwinID %in% c(1,2))
head(dsM, 11)
Case PairID TwinID CompAge epi_e epi_n
1 1 28 1 91.25 NaN NaN
2 1 28 1 NaN NaN NaN
3 1 28 1 NaN NaN NaN
4 1 28 1 NaN NaN NaN
5 1 28 1 NaN NaN NaN
6 2 28 2 91.23 4 0
7 2 28 2 93.35 NaN NaN
8 2 28 2 NaN NaN NaN
9 2 28 2 NaN NaN NaN
10 2 28 2 NaN NaN NaN
11 3 47 1 92.03 5 1
p <- ggplot(dsM, aes(x=CompAge, group=Case, color=factor(TwinID)))
p <- p + geom_line(aes(y=epi_e),position="jitter")
p <- p + geom_point(aes(y=epi_e), position="jitter")
# p <- p + geom_line(aes(y=epi_n),position="jitter")
# p <- p + geom_point(aes(y=epi_n), position="jitter")
p <- p + themeLine
p
Warning: Removed 1349 rows containing missing values (geom_path).
Warning: Removed 2383 rows containing missing values (geom_point).
Next time, we’ll take a look at graph production in detail.