Establishing Data Access to the data of a specific study.
Today we will try start on the process of study curation. Ever member of the lab will “adopt” a study, which s/he works or would like to work and practice delivering the study’s data to the dsL stage, as discussed described here and at COAG colloquium.
Each study will be placed in a separate repository that would contain the code for preparing the study’s data for analysis. These repositories will be the first step in the research projects based on these studies.
You can create a new repository either at the github.com/user-name webpage ( in this case, at the IALSA organization account github.com/ialsa) or inside the GitHub client. I will describe the former option. With it implemented, the cloning of the created repository could be accomplished by the sequence of steps described in the part I of from RAW do DRAW session.
open github.com/ialsa. Make sure you are logged on.
click “+ New Repository” button in the middle of the screen. The menu will appear:
Use the abbreviation of the study to name your repository. If data is sensitive and cannot be shared publically, select a Private repository. If you are working on a paper for which you’d rather not disclose details yet, select Private too. Do however, prefer public repos. Private repos are valuable. We, as an organization get 20 private repoes. You, as an academician can request up to 5. Ask for an individual account.
Add a brief description. In our current case, it will be a full name of the study, years of collection and the nature of the data. We are creating data access point to the data collected and preserved by a particular study.
Check “Initialize this repository with a README” box
The purpose of this repository is to make data in a particular study approachable. This means providing clear and thorough documentation. Therefore managing project description will be just as important as developing the code itselt. We’ll be developing and storing our documentation in a format called markdown. Files, stored in this format have extention .md The first opportunity to practice markdown editing will be with the description of this project and the associated study.
A repository is basically a folder, which you will populate with other folders, data, scripts, images, and other files. At this moment a single file exists in your repository, called README.md.
Notice that README is spelled with all caps. Files of this name are automatically rendered by GitHub into html which you see down below at the root folder of your newly created repository. When you click on README.md, you’re are taken to the menu where you can view or edit the contents of this file.
With editing mode activated by clicking the pencil icon we can start typing in additional information we’d like the researchers to see the first thing when they open this GitHub repository. The goal of this repository is to deliver access to the data of a particular study, so don’t write 4000 words worth of description. Write in the “inverted pyramid” format : give the essense first formulated as a sentence, then a paragraph, and then if, necessary in a series of paragraphs. The key information, however, ideally should be delivered withing 10 seconds of looking at the page.
README.md is actually a script that instructs how the webpage should be assembled. The language used in this script is call markdown, which is probably the simplest programming language out there. Markdown is text that has been somehow “marked” to impress additional information on it, such as size or color of the font, bullets, numbers, tables, etc. Very simple, yet fundamental building blocks of textual information in practice. See this 3 minute read on what is markdwon and its basic syntax.
One can add a hyperlink by eclosing the target name in square brackets followed by target url in parentheses
[target name](http://target.com/target-url.html)
Notice thet there are no blanks between “]” and “(”. Aslo make sure that you don’t leave out “http://”, otherwise it will start looking for files in your local folder.
same as for links, but now preceded by “!” before the square brackets:
![target name](http://target.com/target-url.png)
See 3 minute read for more markdown sytax.
Ideally, each folder must contain a readme file with a brief description. This adds to the transparency of the project and directly influences its reproducibility. Do take time to write up and update REAMDME files.
dsL.R
.dsL.R
) and
containing service scripts. A particular script to take notice of is
which you would run in the beginning of the RStudio session to install all necessary packages.
/Scripts/Utility/InstallPackages.R
in some other repository and save a copy in the appropriate foldercontains instructions on what content in the project to prevent from uploading into the cloud and making public. YOU ARE PERSONALLY RESPONSIBLE FOR KEEPING YOUR DATA SECURE BY EDITING .gitignore. Sorry for yelling, but his is very important. Here’s an example on how to assign permissions:
Data/Raw/*
!Data/Raw/README.md
All files in the folder
will NOT be uploaded, which an exception (!) of “README.md” file. Here’s a starter code with some referenced documentation
# R and RStudio files.
.Rhistory
.RData
.Rproj.user/
# Protected Information; see cubuspl42's answer in http://stackoverflow.com/questions/2415873/exceptions-in-gitignore
Data/Derived/ds0.Rds
# Keep the README files in the unshared directories. They provide documentation, and also force the directories to be created on other machines.
!DataUnshared/README.md
!DataUnshared/Contents.md
!DataUnshared/Raw/README.md
!DataUnshared/Derived/README.md
# Example code in package build process with a wildcard.
*-Ex.R
# Exclude caches in case they contain PHI
cache/
*/cache/
*/*/cache/