The tutorial walks through a simple project in which we import OCTO-TWIN data and learn building line graphs of changes over time.
First, we need to load the packages that will be used during this session.
# Load the necessary packages.
base::require(base)
base::require(knitr)
base::require(markdown)
base::require(testit)
base::require(dplyr)
base::require(reshape2)
base::require(ggplot2)
base::require(sas7bdat)
Sometimes, it is useful to declare global objects, to be used throughout the report, like a font size for a collection of plots.
baseSize <- 11
Now, we need to locate the data file we would like to work with and bring it into R. Borrowing from epidemiology, we would like to establish a “dataset zero”, a milestone in the workflow of our reproducible project. This dataset, ds0
represents the state of the data immediately after the import from the secondary source.
pathDir <- getwd()
pathFile <- file.path(pathDir,"Data/Extract/octoall_021114.sas7bdat")
path_ds0 <- file.path(pathDir, "Data/Derived/ds0.Rds")
path_dsL <- file.path("../Data/Derived/dsL.Rds")
# ds0 <- read.sas7bdat(pathFile, debug=TRUE)
# saveRDS(object=ds0, file=path_ds0, compress="xz")
### Either use ds0 definition above or below.
ds0<-readRDS(path_ds0) # This saves time
To speed up the execution after the initial import, the file is converted into an native to RStudio .Rds file, which could be called in later at a much higher speed than importing from the foreign data source.
Sometimes it is useful to think of the data preparation process as a series of transformations of data on its way from raw digital records to graph production. We will identify three such stages: original input (ds0), project span (dsL or dsW), and modeling ready (dsM) datasets.
ds0
dsL
or dsW
dsM
*. Nameless dataset
- referred to as ds
- temporary - used to make code exchangeable
Now that we have the dataset imported, let’s explore a few ways to quickly inspect it before undertaking further data processing steps. We know that ds0
is a data frame:
class(ds0)
[1] "data.frame"
of the following dimensions ( rows, columns):
dim(ds0)
[1] 3510 592
With so many variables, even such basic function as
str(ds0)
becomes of little use, as its output floods the console. After identifying the key variables to understand the longitudinal structure of the data, we can select it from the original dataset.
ds <- ds0 %>%
dplyr::select(PairID, Case, TwinID, Wave, time, CompAge)
head(ds, 6)
PairID Case TwinID Wave time CompAge
1 28 1 1 1 0 91.25
2 28 1 1 2 NaN NaN
3 28 1 1 3 NaN NaN
4 28 1 1 4 NaN NaN
5 28 1 1 5 NaN NaN
6 28 2 2 1 0 91.23
PairID
- identifies the twin pair.Case
- unique personal identifierWave
- round of data collectiontime
- years since the initial measurementCompAge
- composite age in years To that we can add outcomes we’d like to model, the predictors we’d like to test, and other contextual variables. For example,ds <- ds0 %>%
dplyr::select(PairID, Case, TwinID, Wave, time, CompAge,prose,sbp, Female)
head(ds, 6)
PairID Case TwinID Wave time CompAge prose sbp Female
1 28 1 1 1 0 91.25 3 120 0
2 28 1 1 2 NaN NaN NaN NaN 0
3 28 1 1 3 NaN NaN NaN NaN 0
4 28 1 1 4 NaN NaN NaN NaN 0
5 28 1 1 5 NaN NaN NaN NaN 0
6 28 2 2 1 0 91.23 3 120 0
Variables like Female
and Marital
are known as factors: they have only a limited range of categorial values they can take
FemaleLevels<- c(0,1)
FemaleLabels<- c("Male","Female")
varlist<-c("Female")
for(i in varlist){
ds0[,paste0(i,"F")]<-ordered(ds0[,i],
levels = FemaleLevels,
labels = FemaleLabels)
}
Instead of replacing the existing variable, we create a double, marked by ending “F” for “FACTOR. It is typically useful to have quick access to both raw and labelled data.
ds <- ds0 %>%
dplyr::select(PairID, Case, TwinID, Wave, time, CompAge,prose,sbp, Female, FemaleF)
head(ds, 6)
PairID Case TwinID Wave time CompAge prose sbp Female FemaleF
1 28 1 1 1 0 91.25 3 120 0 Male
2 28 1 1 2 NaN NaN NaN NaN 0 Male
3 28 1 1 3 NaN NaN NaN NaN 0 Male
4 28 1 1 4 NaN NaN NaN NaN 0 Male
5 28 1 1 5 NaN NaN NaN NaN 0 Male
6 28 2 2 1 0 91.23 3 120 0 Male
We establish what variables should be in quick access and are likely to be used in the project. These variables are assembled in the dataset dsL
.
dsL <- ds0 %>%
dplyr::select(Case, PairID, TwinID, Wave, time,
BirthDate, DeadDate, CompAge, Female, FemaleF, Marital, # demographic
weight, height, bmi, # physical
prose, synnum, # cognitive
demsev, demsyn, DemAge, # dementia
diabYN, # diabetes
sbp,HTyn,HTyn1:HTyn5, # hypertension
SRhlth, SRhlth1:SRhlth5 # self-reported health
)
However, even this file is too big to be convenient for purposes of graph production and statistical modeling. We subset a more compact dataset, and might augment it with service variables if needed for modeling or graphing.
dsM <- dsL %>%
dplyr::filter(ave((!is.na(CompAge)), PairID, FUN = all)) %>% # select only cases with nonmissing values
dplyr::select(PairID, Case, TwinID, Wave, time, CompAge, bmi, weight, sbp, prose, Female, FemaleF)
head(dsM)
PairID Case TwinID Wave time CompAge bmi weight sbp prose Female FemaleF
1 915 49 1 1 0.000 87.32 23.38 66 150 NaN 0 Male
2 915 49 1 2 2.249 89.57 NaN NaN NaN NaN 0 Male
3 915 49 1 3 4.260 91.58 24.91 67 140 NaN 0 Male
4 915 49 1 4 6.173 93.50 NaN NaN NaN NaN 0 Male
5 915 49 1 5 8.126 95.45 22.04 60 120 NaN 0 Male
6 915 50 2 1 0.000 87.33 25.56 73 140 NaN 0 Male
Now that the dataset is more managable, we can get some descriptive statistics. Two funtions are particularly useful:
base::summary(dsM)
PairID Case TwinID Wave time CompAge bmi weight
Min. : 915 Min. : 49 Min. :1.0 Min. :1 Min. :0.00 Min. :79.4 Min. :15.2 Min. :35.0
1st Qu.:2407 1st Qu.:268 1st Qu.:1.0 1st Qu.:2 1st Qu.:2.00 1st Qu.:83.7 1st Qu.:21.6 1st Qu.:53.0
Median :3137 Median :416 Median :1.5 Median :3 Median :4.06 Median :86.5 Median :24.1 Median :60.0
Mean :3014 Mean :410 Mean :1.5 Mean :3 Mean :4.05 Mean :86.5 Mean :24.5 Mean :61.2
3rd Qu.:3613 3rd Qu.:543 3rd Qu.:2.0 3rd Qu.:4 3rd Qu.:6.17 3rd Qu.:89.0 3rd Qu.:26.8 3rd Qu.:69.8
Max. :4254 Max. :684 Max. :2.0 Max. :5 Max. :8.43 Max. :95.5 Max. :41.1 Max. :95.0
NA's :36 NA's :20
sbp prose Female FemaleF
Min. : 90 Min. : 0.0 Min. :0.000 Male : 90
1st Qu.:140 1st Qu.: 9.0 1st Qu.:1.000 Female:340
Median :150 Median :12.0 Median :1.000
Mean :153 Mean :10.3 Mean :0.791
3rd Qu.:160 3rd Qu.:13.0 3rd Qu.:1.000
Max. :260 Max. :16.0 Max. :1.000
NA's :8 NA's :96
# and
psych::describe(dsM)
vars n mean sd median trimmed mad min max range skew kurtosis se
PairID 1 430 3014.0000 803.6795 3137.000 3074.0116 830.2560 915.00 4254.000 3339.000 -0.63803 -0.31280 38.75687
Case 2 430 409.8721 171.3158 415.500 417.1919 190.5141 49.00 684.000 635.000 -0.32125 -0.96553 8.26158
TwinID 3 430 1.5000 0.5006 1.500 1.5000 0.7413 1.00 2.000 1.000 0.00000 -2.00465 0.02414
Wave 4 430 3.0000 1.4159 3.000 3.0000 1.4826 1.00 5.000 4.000 0.00000 -1.30790 0.06828
time 5 430 4.0502 2.8438 4.059 4.0442 3.0710 0.00 8.427 8.427 -0.02027 -1.29804 0.13714
CompAge 6 430 86.4836 3.5046 86.554 86.3933 3.9458 79.37 95.458 16.088 0.20040 -0.55270 0.16901
bmi 7 394 24.4919 4.0290 24.135 24.3227 3.9097 15.24 41.118 25.877 0.54786 0.62745 0.20298
weight 8 410 61.2073 11.2077 60.000 60.9512 11.8608 35.00 95.000 60.000 0.24364 -0.34238 0.55351
sbp 9 422 152.6185 24.3139 150.000 151.0947 14.8260 90.00 260.000 170.000 0.95702 2.21738 1.18358
prose 10 334 10.3473 3.9317 12.000 10.8470 2.9652 0.00 16.000 16.000 -1.03149 0.30772 0.21514
Female 11 430 0.7907 0.4073 1.000 0.8634 0.0000 0.00 1.000 1.000 -1.42417 0.02835 0.01964
FemaleF* 12 430 1.7907 0.4073 2.000 1.8634 0.0000 1.00 2.000 1.000 -1.42417 0.02835 0.01964
A graph or a model is meaningful only to the degree that the data being manifested is meaningful. Understanding the precise origin of your data, therefore, is a key component in construction of any graph. To simplify the exposition of this geneology we have introduced three data stages (see above). This section walks the path from the initiation of R session to the declaration of the data geneology immediately before the graph.
Imaging that you are loading a fresh RStudio session, with Hypertension and Cognition project open. If you’d like to quickly get to the play with graphs and models of your data, open from the project root
./Scripts/Templates/explore-dsL.R
which has the following as default text:
# remove all elements for a clean start
rm(list=ls(all=TRUE))
## @knitr LoadPackages
# Load packages to use
base::require(base)
base::require(knitr)
base::require(markdown)
base::require(testit)
base::require(dplyr)
base::require(reshape2)
base::require(ggplot2)
base::require(sas7bdat)
## @knitr LoadData
# run the scripot that loads ds0 and dsL data stages.
source(file.path(getwd(),"Scripts/Data/dsL.R"))
rm(list=setdiff(ls(), c("ds0", "dsL"))) # remove all objects except
## @knitr FirstLook
dim(ds0)
dim(dsL)
str(dsL)
## @knitr DefinedsM
# define the modeling ready dataset
This scripts brings you to the first major milestone : dsL data stage, which defines the scope of you research session, e.i. gives you a mananagable, yet rich data object to work with.
For the details about how this data object (dsL) has been derived, you are directed to the script “dsL.R”, located in /Scripts/Data
in the project’s root. If you inspect the script dsL.R, you’ll see that as part of the data derivation it runs yet another remote script (LabelingFactorLevels.R), which organizes the measurements properties of our data. This way, we enter the modeling mental space fully equiped
Although we might use dsL as the datasource in the production of descriptive statistics, production of graphs and estimation of models frequently asks for modifications and augmentations of the data in order to serve particular tactical needs. Therefore we created and work with a disposable datastage dsM which is a direct derivation of dsL.
The idea is that at every moment of your modeling project you can in a few lines of code define precisely what data you are using. For the sake of consistency, I chose to employ dplyr syntax to declare such derivations. For example,
dsM <- dsL %>% # parent
dplyr::group_by(PairID) %>% # new unit: groups of obs with same PairID
dplyr::filter(CompAge=all(!is.na(CompAge))) %>% # select units, in which every value is not(!) missing (is.na) for ComAge variable
dplyr::select(PairID, Case, TwinID, Wave, time, CompAge, bmi, weight, sbp, prose, FemaleF) # select these variables
dim(dsM) # dimensions (rows, columns)
[1] 430 11
(For other ways of subsetting based on group condition see this StackOverflow post)
The origin of the dataset dsL
has already been provided, therefore the above dplyr
call is all that is necessary to offer the precise definition of the data. Every graph and model is expected to provide the account of such geneology.
Now that we know what’s in the data we can turn to drawing graphs.
We start with the most basic plot:
p <- ggplot2::ggplot(data=dsM,
aes(x=CompAge,y=sbp, group=Case))
p <- p + geom_line()
p
and through incremental augmentation during the lab session develop it the following graph:
themeLine <- ggplot2::theme_bw(base_size=baseSize) +
ggplot2::theme(title=ggplot2::element_text(colour="gray20",size = baseSize+1)) +
ggplot2::theme(axis.text=ggplot2::element_text(colour="gray40")) +
ggplot2::theme(axis.title=ggplot2::element_text(colour="gray40")) +
ggplot2::theme(panel.border = ggplot2::element_rect(colour="gray80")) +
ggplot2::theme(axis.ticks.length = grid::unit(0, "cm")) +
ggplot2::theme(text = element_text(size =baseSize+7))
dsM$TwinID <- factor(dsM$TwinID)
p <- ggplot2::ggplot(data=dsM,
aes(x=CompAge,y=sbp, group=Case, color=TwinID))
p <- p + geom_line()
p <- p + geom_point()
p <- p + themeLine
p <- p + labs(title="Dynamics of BP over time",
x="Age of Respondent",
y="Systolic Blood pressure")
p <- p + facet_grid(FemaleF ~ TwinID)
p
### Consider the following options to further customize your graphs.
# # geoms
# p <- p + geom_line(aes(x=,y=),colour="red",alpha=.5,size=2, na.rm=T)
# p <- p + geom_line(aes(y=), fill=NA, na.rm=T)
# # scales & coordinates
# p <- p + scale_x_continuous(breaks=c(0:15))
# p <- p + scale_y_continuous(breaks=seq(0, 8, 1))
# p <- p + coord_cartesian(xlim=c(-.5, 15.5), ylim=c(.5, 8.5))
#