Today

Study : OCTO-TWIN
In focus: Personality and Dementia

Goals:
1. Practice RAW-TO-DRAW workflow
2. Reshaping data
- Understand how long-to-wide transofrmation works
- Understand how wide-to-long transofrmation works


Initializing the data workflow

First, we load the necessary packages.

base::require(base)
base::require(knitr)
base::require(markdown)
base::require(testit)
base::require(dplyr)
base::require(reshape2)
base::require(ggplot2)
base::require(sas7bdat)

The next section of the script (chunk) establishes the addresses for folders. Let’s read this line by line and understand why lines 4 and 5 are commented out and what they do.

pathDir <- getwd()
pathFile <- file.path(pathDir,"Data/Extract/octomult_021114la.sas7bdat")
path_ds0 <- file.path(pathDir, "Data/Derived/ds0.Rds")
path_dsL <- file.path("../Data/Derived/dsL.Rds")
# ds0 <- read.sas7bdat(pathFile, debug=TRUE)          # Use this when runing for the first time
# saveRDS(object=ds0, file=path_ds0, compress="xz")   # Use this when runing for the first time
### Either use ds0 definition above or below.
ds0<-readRDS(path_ds0) # This saves time              # Use for subsequent run


############## Define temporary variable sets

We have just loaded the dataset ds0 which corresponds to a particular place in our data analysis project.

Remember the schematic road map for the workflow in data science projects:
Load RStudio

ds0

  • verbatim input from raw files
  • minimum or no processing
  • “patient zero”

dsL

  • span of the project
  • subset of ds0
  • long (L) or wide (W) data formats, respectively
  • provides richer context
  • invites future research
  • experimentation

dsM

  • model ready
  • used for estimating models and producing graphs
  • printed for display
  • subset of dsL
  • sometimes adds columns to service models

Now we would like to understand how ds0 become dsL in this particular project. Remember, that dsL datastage reflects the scope of the project: what you want your readers to be aware is availible. It is convenient to manage this proces

dsL <- ds0

during which dsL is created from ds0 with a separate script.

./Scripts/Data/dsL.R

At this moment, in our workflow, however, we only at the stage of ds0. Let’s take a momonet to explore the dataset ds0 and recall some basic R vocabulary

class()
dim()
str()
summary()
table()

Creating groups of variables

To help us handle ds0, which has way too many variables, we develop a meaningful set of variables that we’d like to subset from ds0. It is convenient to distinguish two types:

Type 1 : Time Invariant variables - TI

  • don’t change with time
# Time invariant variables
TIvars <- c(
  "Case", "PairID", "TwinID", "Zygosity", # Identification
  "BirthDate", "DeadDate", "Female", # Demographics
  "EducCat", "Educyrs", # Eduction
  "Dead", "DeadAge", "YTDead", # Mortality
  "TotDem", "DemEver", "DemType", "DemAge", "YTDem" # Dementia
)
# Small set of time invariant variables
TIsmall <- c("Case","PairID", "TwinID","Female")

Type 2 : Time Variant variables - TC

  • their values DO change with time
## Select the variables which names begin with ...
# names(ds0)[grep("^Marital", names(ds0))]
## Select the variables which names contain ...
# names(ds0)[grep("arital", names(ds0))]

TVMarital <- names(ds0)[grep("Marital", names(ds0))] # Marital status
TVCompAge <- names(ds0)[grep("CompAge", names(ds0))] # Computed Age
TVbmi <- names(ds0)[grep("bmi", names(ds0))] # Body Mass Index
TVepi_e <- names(ds0)[grep("epi_e", names(ds0))] # Systolic blood pressure
TVepi_n <- names(ds0)[grep("epi_n", names(ds0))]# Disystolic blood pressure
TVsbp <- names(ds0)[grep("sbp", names(ds0))] # Extraversion
TVdbp <- names(ds0)[grep("dbp", names(ds0))] # Neuroticism
TVdemtime <- names(ds0)[grep("demtime", names(ds0))] # diagnosed dementia
# collect all TV variables into one string
TVvars <- c(TVMarital, TVCompAge, TVbmi, TVepi_e, TVepi_n, TVsbp, TVdbp, TVdemtime)

# Small set of time variant variables
TVsmall <- c(TVMarital, TVCompAge, TVepi_e, TVepi_n )

Let’s take a few minutes and understand how objects TIvars and TVvars were created, how they behave and how to use them.

Using these groups of variables and base subsetting grammar we can select the variables we’d like to select into our dsW stage.

## Select the variables for your dsW
dsW <- ds0[,c(TIvars,TVvars)]
dim(dsW)
[1] 702  57

Naturally, we can change what variables get selected. Let’s try to get a more manageable set.

dsW <- ds0[,c(TIsmall,TVsmall)]
dim(dsW)
[1] 702  24

Reshaping the data

Notice that each occasion at which data was collected is hosted in separate variable/collumn. This is a characterist of data in WIDE format. However, most graph and model functions take data in LONG format. We transform the data with melt function of the reshape2 package:

#Transform the wide dataset into a long dataset
dsLong <- reshape2::melt(dsW, id.vars=TIsmall)  ## id.vars are Measured
dsLong <- dsLong[order(dsLong$Case, dsLong$variable), ] #Sort for the sake of visual inspection.
head(dsLong, 11)
     Case PairID TwinID Female variable value
1       1     28      1      0 Marital1  1.00
703     1     28      1      0 Marital2   NaN
1405    1     28      1      0 Marital3   NaN
2107    1     28      1      0 Marital4   NaN
2809    1     28      1      0 Marital5   NaN
3511    1     28      1      0 CompAge1 91.25
4213    1     28      1      0 CompAge2   NaN
4915    1     28      1      0 CompAge3   NaN
5617    1     28      1      0 CompAge4   NaN
6319    1     28      1      0 CompAge5   NaN
7021    1     28      1      0   epi_e1   NaN

Notice that ALL variables that change with time are stacked into two:
- variable
- value

However, some work may need to be done to clean up the new data

# Create variable that counts time
dsLong$time <- stringr::str_sub(dsLong$variable,-1,-1) 
# head(dsLong, 13)
# Remove the counter suffix from the "variable"
timepattern <- as.character(c(1:5))
for (i in timepattern){
  dsLong$variable <- gsub(pattern=i, replacement='', x=dsLong$variable)
}
head(dsLong, 11)
     Case PairID TwinID Female variable value time
1       1     28      1      0  Marital  1.00    1
703     1     28      1      0  Marital   NaN    2
1405    1     28      1      0  Marital   NaN    3
2107    1     28      1      0  Marital   NaN    4
2809    1     28      1      0  Marital   NaN    5
3511    1     28      1      0  CompAge 91.25    1
4213    1     28      1      0  CompAge   NaN    2
4915    1     28      1      0  CompAge   NaN    3
5617    1     28      1      0  CompAge   NaN    4
6319    1     28      1      0  CompAge   NaN    5
7021    1     28      1      0    epi_e   NaN    1

Now we transform the data with the reverse function to melt in order to take the values of the column “variable” and make them into the names of the new individual variables. Let’s take a few minutes to interpret the syntax of the code and play with the options

dsL <- dcast(dsLong, Case + PairID + TwinID + Female + time ~ variable, value.var = "value")

At the end, we appy discriptive labels to the categorical variables

Start Graphing

Now the data is ready to be graphed.

# standard of the fontsize in plots
baseSize <- 11
# add  this theme to a ggplot to style it
themeLine <- ggplot2::theme_bw(base_size=baseSize) +
  ggplot2::theme(title=ggplot2::element_text(colour="gray20",size = baseSize+1)) +
  ggplot2::theme(axis.text=ggplot2::element_text(colour="gray40")) +
  ggplot2::theme(axis.title=ggplot2::element_text(colour="gray40")) +
  ggplot2::theme(panel.border = ggplot2::element_rect(colour="gray80")) +
  ggplot2::theme(axis.ticks.length = grid::unit(0, "cm")) +
  ggplot2::theme(text = element_text(size =baseSize+7))

We’ll use these style definitions in graphs.

We can start with a simple graph

dsM <- dsL %>%
  dplyr::select(Case, PairID, TwinID, CompAge, epi_e, epi_n)
head(dsM, 11)
   Case PairID TwinID CompAge epi_e epi_n
1     1     28      1   91.25   NaN   NaN
2     1     28      1     NaN   NaN   NaN
3     1     28      1     NaN   NaN   NaN
4     1     28      1     NaN   NaN   NaN
5     1     28      1     NaN   NaN   NaN
6     2     28      2   91.23     4     0
7     2     28      2   93.35   NaN   NaN
8     2     28      2     NaN   NaN   NaN
9     2     28      2     NaN   NaN   NaN
10    2     28      2     NaN   NaN   NaN
11    3     47      1   92.03     5     1
p <- ggplot(dsM, aes(x=CompAge, group=Case, color=factor(TwinID)))
p <- p + geom_line(aes(y=epi_e))
p
Warning: Removed 1349 rows containing missing values (geom_path).

plot of chunk SimplePlot

and gradually evolve it into a more complex graph, as we’ve done before

dsM <- dsL %>%
  dplyr::select(Case, PairID, TwinID, CompAge, epi_e, epi_n) %>%
  dplyr::filter (TwinID %in% c(1,2))
head(dsM, 11)
   Case PairID TwinID CompAge epi_e epi_n
1     1     28      1   91.25   NaN   NaN
2     1     28      1     NaN   NaN   NaN
3     1     28      1     NaN   NaN   NaN
4     1     28      1     NaN   NaN   NaN
5     1     28      1     NaN   NaN   NaN
6     2     28      2   91.23     4     0
7     2     28      2   93.35   NaN   NaN
8     2     28      2     NaN   NaN   NaN
9     2     28      2     NaN   NaN   NaN
10    2     28      2     NaN   NaN   NaN
11    3     47      1   92.03     5     1
p <- ggplot(dsM, aes(x=CompAge, group=Case, color=factor(TwinID)))
p <- p + geom_line(aes(y=epi_e),position="jitter")
p <- p + geom_point(aes(y=epi_e), position="jitter")
# p <- p + geom_line(aes(y=epi_n),position="jitter")
# p <- p + geom_point(aes(y=epi_n), position="jitter")
p <- p + themeLine
p
Warning: Removed 1349 rows containing missing values (geom_path).
Warning: Removed 2383 rows containing missing values (geom_point).

plot of chunk AdvancedPlot

Next time, we’ll take a look at graph production in detail.