Arriving at the dsL stage
Exploring dsL with dplyr
- dplyr::select()
- dplyr::filter()
Developing graphs

Today

Study : OCTO-TWIN
In focus: Personality and Dementia

Goals:
1. To practice arriving at dsL
2. To learn two basic verbs of dplyr
- select()
- filter ()
3. Develop a line graph for a single twin pair

Arriving at the dsL stage

Loading packages gives us some idea about what we will be doing. One can think of this as sort of summary of the report.

base::require(base)
base::require(sas7bdat)
base::require(dplyr)
base::require(reshape2)
base::require(ggplot2)
base::require(knitr)
base::require(markdown)
base::require(testit)

Last time (Fri-26-Sep-2014) we’ve discussed in detail the first part of the road map for the workflow in our data analysis projects:
Load RStudio
This is a general and simplified map: every project might modify its general structure. To illustrate, our current project has a separate script

./Scripts/Data/dsL.R

that produces a stable and familiar stage dsL, which we’ve annotated. Reviewing this script, one can see that the path from ds0 to dsL can be summarized by the following diagram: CurrentMap

The scripts ends with removing all the temporary ojbects except for those we’d like to keep.

rm(list=setdiff(ls(), c("ds0", "dsW", "dsLong", "dsL")))

Let’s take a moment and inspect the script ./Scripts/Data/dsL.R, the Data Flow map above, and the contents of the four data stages.

Typically, you would want to have a separate report (.Rmd file) that annotates the production of dsL data stage. This function is partially fulfilled by the report

./LabLog/26-09-2014-Dementia-Personality-OCTO.Rmd

which we’ve covered last week. Having a dedicated narrative detailing the origin of the particular datastage saves times in modeling projects, and what is more important, allows us to quickly start graphing and modeling our data.
To arrive at the dsL stage run the script responsible for its derivation

# run the script that loads ds0 and dsL data stages.
source(file.path(getwd(),"Scripts/Data/dsL.R"))
rm(list=setdiff(ls(), c("ds0", "dsL", "themeLine", "baseSize"))) # remove all objects except

Exploring dsL with dplyr

The data stage dsL at which we’ve just arrived, is still a bit cumbersome to handle and to think about. Ultimately, we want to comment on the relationship of just a few variables, so let’s start with the bare minimum.

dplyr::select()

Selecting the variables from a dataset can be accomplished with the use of select() function of the dplyr package:

dsM <- dsL %>%
  dplyr::select(Case, PairID, TwinID, Zygosity,time, CompAge, epi_e, epi_n, DemEver)
head(dsM,12)

   Case PairID TwinID Zygosity time CompAge epi_e epi_n DemEver
1     1     28      1        2    1   91.25   NaN   NaN       0
2     1     28      1        2    2     NaN   NaN   NaN       0
3     1     28      1        2    3     NaN   NaN   NaN       0
4     1     28      1        2    4     NaN   NaN   NaN       0
5     1     28      1        2    5     NaN   NaN   NaN       0
6     2     28      2        2    1   91.23     4     0       0
7     2     28      2        2    2   93.35   NaN   NaN       0
8     2     28      2        2    3     NaN   NaN   NaN       0
9     2     28      2        2    4     NaN   NaN   NaN       0
10    2     28      2        2    5     NaN   NaN   NaN       0
11    3     47      1        1    1   92.03     5     1       1
12    3     47      1        1    2   94.29   NaN   NaN       1

The current data stage dsM is completely defined by the dplyr call above (remember, we take the dsL as the stable milestone now). We read from the dplyr definition that dsM inherits als the rows from dsL, however, only variables “Case”, “PairID”, “TwinID”, and “Zygosity” are selected into dsM.

As part of the dsL derivation, we have assigned the labels to the factors (categorical variables), the definitions of which is located in

./Scripts/Data/LabelingFactorLevels.R

The labeled factors are stored in separate columns, which names are formed by adding a suffix “F” to the end of the original columns.

dsM <- dsL %>%
  dplyr::select(Case, PairID, TwinID, TwinIDF)
head(dsM)

  Case PairID TwinID     TwinIDF
1    1     28      1  First-Born
2    1     28      1  First-Born
3    1     28      1  First-Born
4    1     28      1  First-Born
5    1     28      1  First-Born
6    2     28      2 Second-Born

dplyr::filter()

We know that dsL contains data for

length(unique(dsL$Case))

[1] 702

individuals and/or

length(unique(dsL$PairID))

[1] 351

twin pairs.

The function filter() of the dplyr package allows to select rows that match the given criteria. Let’s select all the rows, in which the column “PairID” contains the numeric value “28”

dsM <- dsL %>%
  dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver) %>%
  dplyr::filter (PairID == 28)
dsM

   Case PairID     TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1     1     28  First-Born        DZ    1   91.25   NaN   NaN       0
2     1     28  First-Born        DZ    2     NaN   NaN   NaN       0
3     1     28  First-Born        DZ    3     NaN   NaN   NaN       0
4     1     28  First-Born        DZ    4     NaN   NaN   NaN       0
5     1     28  First-Born        DZ    5     NaN   NaN   NaN       0
6     2     28 Second-Born        DZ    1   91.23     4     0       0
7     2     28 Second-Born        DZ    2   93.35   NaN   NaN       0
8     2     28 Second-Born        DZ    3     NaN   NaN   NaN       0
9     2     28 Second-Born        DZ    4     NaN   NaN   NaN       0
10    2     28 Second-Born        DZ    5     NaN   NaN   NaN       0

We can have more than one condition applied. For example, if we’d like to select the first-born of this twin pair

dsM <- dsL %>%
  dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver) %>%
  dplyr::filter (PairID == 28, TwinIDF == "First-Born")
dsM

  Case PairID    TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1    1     28 First-Born        DZ    1   91.25   NaN   NaN       0
2    1     28 First-Born        DZ    2     NaN   NaN   NaN       0
3    1     28 First-Born        DZ    3     NaN   NaN   NaN       0
4    1     28 First-Born        DZ    4     NaN   NaN   NaN       0
5    1     28 First-Born        DZ    5     NaN   NaN   NaN       0

We can also use a more complex creteria, for example select only the observations for which the CompAge variable has a valid (non-missing) data:

dsM <- dsL %>%
  dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver) %>%
  dplyr::filter(CompAge=!is.na(CompAge))
head(dsM,12)

   Case PairID     TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1     1     28  First-Born        DZ    1   91.25   NaN   NaN       0
2     2     28 Second-Born        DZ    1   91.23     4     0       0
3     2     28 Second-Born        DZ    2   93.35   NaN   NaN       0
4     3     47  First-Born        MZ    1   92.03     5     1       1
5     3     47  First-Born        MZ    2   94.29   NaN   NaN       1
6     3     47  First-Born        MZ    3   96.18   NaN   NaN       1
7     3     47  First-Born        MZ    4   98.12   NaN   NaN       1
8     3     47  First-Born        MZ    5  100.18   NaN   NaN       1
9     4     47 Second-Born        MZ    1   92.04   NaN   NaN       0
10    5    188  First-Born        MZ    1   92.94   NaN   NaN       0
11    5    188  First-Born        MZ    2   95.01   NaN   NaN       0
12    5    188  First-Born        MZ    3   97.02   NaN   NaN       0

or use with group_by clause to issue very specific filters. For example, the following code selects the observations for the twin pairs (PairID) which have a non-missing record of age for all time points:

dsM <- dsL %>%
  dplyr::group_by(PairID) %>%
  dplyr::filter(CompAge=all(!is.na(CompAge))) %>%
  dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver)
head(dsM,12)

Source: local data frame [12 x 9]
Groups: PairID

   Case PairID     TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1    49    915  First-Born        MZ    1   87.32     6     1       0
2    49    915  First-Born        MZ    2   89.57   NaN   NaN       0
3    49    915  First-Born        MZ    3   91.58   NaN   NaN       0
4    49    915  First-Born        MZ    4   93.50   NaN   NaN       0
5    49    915  First-Born        MZ    5   95.45   NaN   NaN       0
6    50    915 Second-Born        MZ    1   87.33   NaN   NaN       1
7    50    915 Second-Born        MZ    2   89.49   NaN   NaN       1
8    50    915 Second-Born        MZ    3   91.47   NaN   NaN       1
9    50    915 Second-Born        MZ    4   93.38   NaN   NaN       1
10   50    915 Second-Born        MZ    5   95.32   NaN   NaN       1
11   81   1265  First-Born        DZ    1   87.07     3     9       0
12   81   1265  First-Born        DZ    2   89.22     2     7       0

Further increasing the strictness of the filter, we’d like to select the subset from the previous dataset, where observations for “Extraversion” is availible for the entire span of the study

dsM <- dsL %>%
  dplyr::group_by(PairID) %>%
  dplyr::filter(CompAge=all(!is.na(CompAge))) %>%
  dplyr::filter(time %in% c(1:4)) %>%
  dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver)
dsM <- dsM %>%
  dplyr::group_by(PairID) %>%
  dplyr::filter(epi_e=all(!is.na(epi_e)))
head(dsM,12)

Source: local data frame [12 x 9]
Groups: PairID

   Case PairID     TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1    81   1265  First-Born        DZ    1   87.07     3     9       0
2    81   1265  First-Born        DZ    2   89.22     2     7       0
3    81   1265  First-Born        DZ    3   91.36     3     5       0
4    81   1265  First-Born        DZ    4   93.32     5     8       0
5    82   1265 Second-Born        DZ    1   87.07     8     3       0
6    82   1265 Second-Born        DZ    2   89.26     7     2       0
7    82   1265 Second-Born        DZ    3   91.37     6     1       0
8    82   1265 Second-Born        DZ    4   93.32     8     5       0
9   369   2902  First-Born        MZ    1   82.71     6     5       0
10  369   2902  First-Born        MZ    2   84.87     3     3       0
11  369   2902  First-Born        MZ    3   86.89     4     3       0
12  369   2902  First-Born        MZ    4   88.88     4     3       0

Let’s work with this definition of dsM. It contains only

length(unique(dsM$PairID))

[1] 11

twin pairs.

Developing graphs

Let’s take a pair for which age and extraversion measures are availible for both twins for the entire duration of the study

dsM <- dsL %>%
  dplyr::filter(PairID == 1265, time %in% c(1:4)) %>%
  dplyr::select (Case, PairID, TwinIDF, ZygosityF, time, CompAge, epi_e, epi_n, DemEver)
dsM

  Case PairID     TwinIDF ZygosityF time CompAge epi_e epi_n DemEver
1   81   1265  First-Born        DZ    1   87.07     3     9       0
2   81   1265  First-Born        DZ    2   89.22     2     7       0
3   81   1265  First-Born        DZ    3   91.36     3     5       0
4   81   1265  First-Born        DZ    4   93.32     5     8       0
5   82   1265 Second-Born        DZ    1   87.07     8     3       0
6   82   1265 Second-Born        DZ    2   89.26     7     2       0
7   82   1265 Second-Born        DZ    3   91.37     6     1       0
8   82   1265 Second-Born        DZ    4   93.32     8     5       0

We’d like to graph the changes in extraversion and neuroticism over time in these two individuals.

p <- ggplot(dsM, aes(x=CompAge, group=Case, color=TwinIDF))
p <- p + geom_line(aes(y=epi_e))
p <- p + geom_point(aes(y=epi_e))
p <- p + themeLine
p

plot of chunk SimplePlot