Demonstrating the language of data manipulation in dplyr packages using dsL as an example
back to Appendix
For a more detailed discussion of basic verbs and operations consult the R-Studio guide or internal vignette
vignette("introduction",package="dplyr")
The following is a brief demonstration of dplyr syntax using dsL dataset as an example. I attach prefix dplyr:: to avoid possible conflicts with plyr package on which ggplot2 package relies. I recommend such practice in all dplyr expressions in sharable publications.
One of the innovations in dplyr is the ability to chain phrases in the data manipulationsentence. The operator %>% (or %.%), accomplishes this, turning x %>% f(y) into f(x, y) .
select()selects variables into a smaller data set
ds<-dsL
dim(ds)
[1] 134745 60
ds<- dsL %>%
dplyr::select(id,year, byear, attend, attendF)
head(ds,13)
id year byear attend attendF
1 1 1997 1981 NA <NA>
2 1 1998 1981 NA <NA>
3 1 1999 1981 NA <NA>
4 1 2000 1981 1 Never
5 1 2001 1981 6 About once/week
6 1 2002 1981 2 Once or Twice
7 1 2003 1981 1 Never
8 1 2004 1981 1 Never
9 1 2005 1981 1 Never
10 1 2006 1981 1 Never
11 1 2007 1981 1 Never
12 1 2008 1981 1 Never
13 1 2009 1981 1 Never
dim(ds)
[1] 134745 5
filter()Removes observations that do not meet criteria. The following code selects observation based on the type of sample
sample sampleF
1 1 Cross-Sectional
2 0 Oversample
and only between years 2000 and 2011, as only during those years the outcome of interest attend was recorded.
ds<- dsL %>%
dplyr::filter(sample==1, year %in% c(2000:2011))%>%
dplyr::select(id, year, attend, attendF)
head(ds,13)
id year attend attendF
1 1 2000 1 Never
2 1 2001 6 About once/week
3 1 2002 2 Once or Twice
4 1 2003 1 Never
5 1 2004 1 Never
6 1 2005 1 Never
7 1 2006 1 Never
8 1 2007 1 Never
9 1 2008 1 Never
10 1 2009 1 Never
11 1 2010 1 Never
12 1 2011 1 Never
13 2 2000 2 Once or Twice
arrange()Sorts observations
ds<- dsL %>%
dplyr::filter(sample==1, year %in% c(2000:2011)) %>%
dplyr::select(id, year, attend) %>%
dplyr::arrange(year, desc(id))
head(ds,13)
id year attend
1 9022 2000 1
2 9021 2000 2
3 9020 2000 2
4 9018 2000 4
5 9017 2000 6
6 9012 2000 5
7 9011 2000 6
8 9010 2000 1
9 9009 2000 2
10 9008 2000 6
11 8992 2000 NA
12 8991 2000 3
13 8987 2000 6
ds<- dplyr::arrange(ds, id, year)
head(ds, 13)
id year attend
1 1 2000 1
2 1 2001 6
3 1 2002 2
4 1 2003 1
5 1 2004 1
6 1 2005 1
7 1 2006 1
8 1 2007 1
9 1 2008 1
10 1 2009 1
11 1 2010 1
12 1 2011 1
13 2 2000 2
mutate()Creates additional variables from the values of existing.
ds<- dsL %>%
dplyr::filter(sample==1, year %in% c(2000:2011)) %>%
dplyr::select(id, byear, year, attend) %>%
dplyr::mutate(age = year-byear,
timec = year-2000,
linear= timec,
quadratic= linear^2,
cubic= linear^3)
head(ds,13)
id byear year attend age timec linear quadratic cubic
1 1 1981 2000 1 19 0 0 0 0
2 1 1981 2001 6 20 1 1 1 1
3 1 1981 2002 2 21 2 2 4 8
4 1 1981 2003 1 22 3 3 9 27
5 1 1981 2004 1 23 4 4 16 64
6 1 1981 2005 1 24 5 5 25 125
7 1 1981 2006 1 25 6 6 36 216
8 1 1981 2007 1 26 7 7 49 343
9 1 1981 2008 1 27 8 8 64 512
10 1 1981 2009 1 28 9 9 81 729
11 1 1981 2010 1 29 10 10 100 1000
12 1 1981 2011 1 30 11 11 121 1331
13 2 1982 2000 2 18 0 0 0 0
summarize()collapses data into a single value computed according to the aggregate functions.
require(dplyr)
ds<- dsL %>%
dplyr::filter(sample==1) %>%
dplyr::summarize(N= n_distinct(id))
ds
N
1 6747
Other functions one could use with summarize() include:
From base
min() max() mean() sum() sd() median() IQR() Native to dplyr
n() - number of observations in the current group n_distinct(x) - count the number of unique values in x. first(x) - similar to x[ 1 ] + control over NA last(x) - similar to x[length(x)] + control over NA nth(x, n) - similar to x[n] + control over NAThe function group_by() is used to identify groups in split-apply-combine (SAC) procedure: it splits the initial data into smaller datasets (according to all possible interactions between the levels of supplied variables). It is these smaller datasets that summarize() will individually collapse into a single computed value according to its formula.
ds<- dsL %>%
dplyr::filter(sample==1, year %in% c(2000:2011)) %>%
dplyr::select(id, year, attendF) %>%
dplyr::group_by(year,attendF) %>%
dplyr::summarise(count = n()) %>%
dplyr::mutate(total = sum(count),
percent= count/total)
head(ds,10)
Source: local data frame [10 x 5]
Groups: year
year attendF count total percent
1 2000 Never 1580 6747 0.234178
2 2000 Once or Twice 1304 6747 0.193271
3 2000 Less than once/month 775 6747 0.114866
4 2000 About once/month 362 6747 0.053653
5 2000 About twice/month 393 6747 0.058248
6 2000 About once/week 1101 6747 0.163184
7 2000 Several times/week 463 6747 0.068623
8 2000 Everyday 36 6747 0.005336
9 2000 NA 733 6747 0.108641
10 2001 Never 1626 6747 0.240996
To verify :
dplyr::summarize( filter(ds, year==2000), should.be.one=sum(percent))
Source: local data frame [1 x 2]
year should.be.one
1 2000 1
Generally, we can compose any desired dataset by using matrix calls. The general formula is of the form: ds[ rowCond , colCond ], where ds is a dataframe, and rowCond and colCond are conditions for including rows and columns of the new dataset, respectively. One can also call a variable by attaching $ followed variable name to the name of the dataset: ds$variableName.
ds<-dsL[dsL$year %in% c(2000:2011),c('id',"byear","year","agemon","attendF","ageyearF")]
print(ds[ds$id==1,])
id byear year agemon attendF ageyearF
4 1 1981 2000 231 Never 19
5 1 1981 2001 243 About once/week 20
6 1 1981 2002 256 Once or Twice 21
7 1 1981 2003 266 Never 22
8 1 1981 2004 279 Never 23
9 1 1981 2005 290 Never 24
10 1 1981 2006 302 Never 25
11 1 1981 2007 313 Never 26
12 1 1981 2008 325 Never 27
13 1 1981 2009 337 Never 28
14 1 1981 2010 350 Never 29
15 1 1981 2011 360 Never 29
The following is a list of operatiors that can be used in these calls.
+, -, *, /, %%, ^
abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
<, <=, !=, >=, >, ==, %in%
&, &&, |, ||, !, xor
mean, sum, min, max, sd, var
dplyr can translate all of these into SQL. For more of on dplyr and SQL compatibility consult another built-in vignette
vignette("database",package="dplyr")
The following unary and binary operators are defined for base. They are listed in precedence groups, from highest to lowest.
:: ::: - access variables in a namespace $ @ - component / slot extraction [ [[ - indexing ^ - exponentiation (right to left) - + - unary minus and plus : - sequence operator %any% - special operators (including %% and %/%) * / - multiply, divide + - - (binary) add, subtract < > <= >= == != - ordering and comparison ! - negation & && - and | || - or ~ - as in formulae -> ->> - rightwards assignment <- <<- - assignment (right to left) = - assignment (right to left) ? - help (unary and binary)While merge works just fine , joining data frames with dplyr might offer some additional conveniences:
dplyr implements the four most useful joins from SQL:
inner_join - similar to merge(…, all.x=F, all.y=F)ileft_join - similar to merge(…, all.x=T, all.y=F)isemi_join - no equivalent in merge() unless y includes only join fieldsianti_join - no equivalent in merge(), this is all x without a match in y