Today
Study : OCTO-TWIN
In focus: Personality and Dementia
Goals:
1. To practice arriving at dsL
2. To learn two basic verbs of dplyr
- select()
- filter ()
3. Develop a line graph for a single twin pair
Loading packages gives us some idea about what we will be doing. One can think of this as sort of summary of the report.
base::require(base)
base::require(sas7bdat)
base::require(dplyr)
base::require(reshape2)
base::require(ggplot2)
base::require(knitr)
base::require(markdown)
base::require(testit)
Last time (Fri-26-Sep-2014) we’ve discussed in detail the first part of the road map for the workflow in our data analysis projects:
This is a general and simplified map: every project might modify its general structure. To illustrate, our current project has a separate script
./Scripts/Data/dsL.R
that produces a stable and familiar stage dsL, which we’ve annotated. Reviewing this script, one can see that the path from ds0 to dsL can be summarized by the following diagram:
The scripts ends with removing all the temporary ojbects except for those we’d like to keep.
rm(list=setdiff(ls(), c("ds0", "dsW", "dsLong", "dsL")))
Let’s take a moment and inspect the script ./Scripts/Data/dsL.R
, the Data Flow map above, and the contents of the four data stages.
Typically, you would want to have a separate report (.Rmd file) that annotates the production of dsL data stage. This function is partially fulfilled by the report
./LabLog/26-09-2014-Dementia-Personality-OCTO.Rmd
which we’ve covered last week. Having a dedicated narrative detailing the origin of the particular datastage saves times in modeling projects, and what is more important, allows us to quickly start graphing and modeling our data. To arrive at the dsL stage run the script responsible for its derivation
# run the script that loads ds0 and dsL data stages.
source(file.path(getwd(),"Scripts/Data/dsL.R"))
rm(list=setdiff(ls(), c("ds0", "dsL", "themeLine", "baseSize"))) # remove all objects except
The data stage dsL at which we’ve just arrived, is still a bit cumbersome to handle and to think about. Ultimately, we want to comment on the relationship of just a few variables, so let’s start with the bare minimum.
Selecting the variables from a dataset can be accomplished with the use of select()
function of the dplyr
package:
dsM <- dsL %>%
dplyr::select(Case, PairID, TwinID, Zygosity,time, CompAge, epi_e, epi_n, DemEver)
head(dsM,12)
Case PairID TwinID Zygosity time CompAge epi_e epi_n DemEver
1 1 28 1 2 1 91.25 NaN NaN 0
2 1 28 1 2 2 NaN NaN NaN 0
3 1 28 1 2 3 NaN NaN NaN 0
4 1 28 1 2 4 NaN NaN NaN 0
5 1 28 1 2 5 NaN NaN NaN 0
6 2 28 2 2 1 91.23 4 0 0
7 2 28 2 2 2 93.35 NaN NaN 0
8 2 28 2 2 3 NaN NaN NaN 0
9 2 28 2 2 4 NaN NaN NaN 0
10 2 28 2 2 5 NaN NaN NaN 0
11 3 47 1 1 1 92.03 5 1 1
12 3 47 1 1 2 94.29 NaN NaN 1
The current data stage dsM is completely defined by the dplyr call above (remember, we take the dsL as the stable milestone now). We read from the dplyr definition that dsM inherits als the rows from dsL, however, only variables “Case”, “PairID”, “TwinID”, and “Zygosity” are selected into dsM.
As part of the dsL derivation, we have assigned the labels to the factors (categorical variables), the definitions of which is located in
./Scripts/Data/LabelingFactorLevels.R
The labeled factors are stored in separate columns, which names are formed by adding a suffix “F” to the end of the original columns.
dsM <- dsL %>%
dplyr::select(Case, PairID, TwinID, TwinIDF)
head(dsM)
Case PairID TwinID TwinIDF
1 1 28 1 First-Born
2 1 28 1 First-Born
3 1 28 1 First-Born
4 1 28 1 First-Born
5 1 28 1 First-Born
6 2 28 2 Second-Born
We know that dsL contains data for
length(unique(dsL$Case))
[1] 702
individuals and/or
length(unique(dsL$PairID))
[1] 351
twin pairs.
The function filter()
of the dplyr
package allows to select rows that match the given criteria. Let’s select all the rows, in which the column “PairID” contains the numeric value “28”
dsM <- dsL %>%
dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver) %>%
dplyr::filter (PairID == 28)
dsM
Case PairID TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1 1 28 First-Born DZ 1 91.25 NaN NaN 0
2 1 28 First-Born DZ 2 NaN NaN NaN 0
3 1 28 First-Born DZ 3 NaN NaN NaN 0
4 1 28 First-Born DZ 4 NaN NaN NaN 0
5 1 28 First-Born DZ 5 NaN NaN NaN 0
6 2 28 Second-Born DZ 1 91.23 4 0 0
7 2 28 Second-Born DZ 2 93.35 NaN NaN 0
8 2 28 Second-Born DZ 3 NaN NaN NaN 0
9 2 28 Second-Born DZ 4 NaN NaN NaN 0
10 2 28 Second-Born DZ 5 NaN NaN NaN 0
We can have more than one condition applied. For example, if we’d like to select the first-born of this twin pair
dsM <- dsL %>%
dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver) %>%
dplyr::filter (PairID == 28, TwinIDF == "First-Born")
dsM
Case PairID TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1 1 28 First-Born DZ 1 91.25 NaN NaN 0
2 1 28 First-Born DZ 2 NaN NaN NaN 0
3 1 28 First-Born DZ 3 NaN NaN NaN 0
4 1 28 First-Born DZ 4 NaN NaN NaN 0
5 1 28 First-Born DZ 5 NaN NaN NaN 0
We can also use a more complex creteria, for example select only the observations for which the CompAge variable has a valid (non-missing) data:
dsM <- dsL %>%
dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver) %>%
dplyr::filter(CompAge=!is.na(CompAge))
head(dsM,12)
Case PairID TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1 1 28 First-Born DZ 1 91.25 NaN NaN 0
2 2 28 Second-Born DZ 1 91.23 4 0 0
3 2 28 Second-Born DZ 2 93.35 NaN NaN 0
4 3 47 First-Born MZ 1 92.03 5 1 1
5 3 47 First-Born MZ 2 94.29 NaN NaN 1
6 3 47 First-Born MZ 3 96.18 NaN NaN 1
7 3 47 First-Born MZ 4 98.12 NaN NaN 1
8 3 47 First-Born MZ 5 100.18 NaN NaN 1
9 4 47 Second-Born MZ 1 92.04 NaN NaN 0
10 5 188 First-Born MZ 1 92.94 NaN NaN 0
11 5 188 First-Born MZ 2 95.01 NaN NaN 0
12 5 188 First-Born MZ 3 97.02 NaN NaN 0
or use with group_by
clause to issue very specific filters. For example, the following code selects the observations for the twin pairs (PairID) which have a non-missing record of age for all time points:
dsM <- dsL %>%
dplyr::group_by(PairID) %>%
dplyr::filter(CompAge=all(!is.na(CompAge))) %>%
dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver)
head(dsM,12)
Source: local data frame [12 x 9]
Groups: PairID
Case PairID TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1 49 915 First-Born MZ 1 87.32 6 1 0
2 49 915 First-Born MZ 2 89.57 NaN NaN 0
3 49 915 First-Born MZ 3 91.58 NaN NaN 0
4 49 915 First-Born MZ 4 93.50 NaN NaN 0
5 49 915 First-Born MZ 5 95.45 NaN NaN 0
6 50 915 Second-Born MZ 1 87.33 NaN NaN 1
7 50 915 Second-Born MZ 2 89.49 NaN NaN 1
8 50 915 Second-Born MZ 3 91.47 NaN NaN 1
9 50 915 Second-Born MZ 4 93.38 NaN NaN 1
10 50 915 Second-Born MZ 5 95.32 NaN NaN 1
11 81 1265 First-Born DZ 1 87.07 3 9 0
12 81 1265 First-Born DZ 2 89.22 2 7 0
Further increasing the strictness of the filter, we’d like to select the subset from the previous dataset, where observations for “Extraversion” is availible for the entire span of the study
dsM <- dsL %>%
dplyr::group_by(PairID) %>%
dplyr::filter(CompAge=all(!is.na(CompAge))) %>%
dplyr::filter(time %in% c(1:4)) %>%
dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver)
dsM <- dsM %>%
dplyr::group_by(PairID) %>%
dplyr::filter(epi_e=all(!is.na(epi_e)))
head(dsM,12)
Source: local data frame [12 x 9]
Groups: PairID
Case PairID TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1 81 1265 First-Born DZ 1 87.07 3 9 0
2 81 1265 First-Born DZ 2 89.22 2 7 0
3 81 1265 First-Born DZ 3 91.36 3 5 0
4 81 1265 First-Born DZ 4 93.32 5 8 0
5 82 1265 Second-Born DZ 1 87.07 8 3 0
6 82 1265 Second-Born DZ 2 89.26 7 2 0
7 82 1265 Second-Born DZ 3 91.37 6 1 0
8 82 1265 Second-Born DZ 4 93.32 8 5 0
9 369 2902 First-Born MZ 1 82.71 6 5 0
10 369 2902 First-Born MZ 2 84.87 3 3 0
11 369 2902 First-Born MZ 3 86.89 4 3 0
12 369 2902 First-Born MZ 4 88.88 4 3 0
Let’s work with this definition of dsM. It contains only
length(unique(dsM$PairID))
[1] 11
twin pairs.
Let’s take a pair for which age and extraversion measures are availible for both twins for the entire duration of the study
dsM <- dsL %>%
dplyr::filter(PairID == 1265, time %in% c(1:4)) %>%
dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver)
dsM
Case PairID TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1 81 1265 First-Born DZ 1 87.07 3 9 0
2 81 1265 First-Born DZ 2 89.22 2 7 0
3 81 1265 First-Born DZ 3 91.36 3 5 0
4 81 1265 First-Born DZ 4 93.32 5 8 0
5 82 1265 Second-Born DZ 1 87.07 8 3 0
6 82 1265 Second-Born DZ 2 89.26 7 2 0
7 82 1265 Second-Born DZ 3 91.37 6 1 0
8 82 1265 Second-Born DZ 4 93.32 8 5 0
We’d like to graph the changes in extraversion and neuroticism over time in these two individuals.
p <- ggplot(dsM, aes(x=CompAge, group=Case, color=TwinIDF))
p <- p + geom_line(aes(y=epi_e))
p <- p + geom_point(aes(y=epi_e))
p <- p + themeLine
p