Skip to content

Latest commit

 

History

History
192 lines (125 loc) · 10.4 KB

README.md

File metadata and controls

192 lines (125 loc) · 10.4 KB

sophialogo

dsHelpersSophia

This package provides a series of helper functions intended to make it easier to work with the SOPHIA federated database. The goal is to lower the barrier of entry by (1) having the user install a single package that in turn makes sure that all other required packages are installed, and (2) providing functions for easy access to the federated database and its resources.

The SOPHIA federated database is built and maintained by Vital-IT at the Swiss Institute of Bioinformatics, and is based around DataSHIELD and Opal. More information about the SOPHIA project itself is available here and here.

Installation

Using devtools: devtools::install_github("carldelfin/dsHelpersSophia")

Using remotes: remotes::install_github("carldelfin/dsHelpersSophia")

For Windows users, that should be enough. For Linux users, some system packages are required. On a Debian-based system, install these using:

sudo apt install -y cmake libxml2-dev libcurl4-openssl-dev libssl-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev

Adapt accordingly for non-Debian Linux flavours.

Several additional R packages are installed via dependencies. The following are the most important:

Usage

Login credentials

It is never a good idea to keep sensitive credentials such as login information in R scripts. Although it is possible to enter credentials manually (see dshSophiaPrompt), it is usually more convenient to store them in an environment file. Basically, an environment file is list of variables that are automatically read when you start your R session.

Thus, for a more streamlined experience, most functions in dsHelpersSophia will look for user credentials in the .Renviron file. If you don't already have one, you can either create it manually (R will look for it in the current user's home directory, which on Linux would be ~/.Renviron and on Windows C:\Users\USERNAME\Documents\.Renviron) or use the usethis package: usethis::edit_r_environ().

Enter your credentials like this:

fdb_user="username"
fdb_password="password"

dshSophiaConnect

This function allows the user to connect to the SOPHIA federated database, with the option of including and excluding specific nodes (a node in this context is a server that hosts a database).

Connecting to the federated database is a two-step process: First, the user connects to each individual node in order to retrieve a list of all cohorts that are hosted on that specific node (cohorts in this context refer to a specific dataset associated with a study or research project). Then, the user is disconnected and reconnects to each individual cohort. Note that some nodes only host a single cohort.

Importantly, this function does not return anything in the conventional sense, but assigns two objects (opals and nodes_and_cohorts) to the Global environment using the superassignment operator (<<-). These two objects are necessary in order to load and assign database resources and do further work on the federated database. (Note that superassignment in R is generally considered a bad idea.)

  • Connect to all available nodes, assuming user credentials are available in .Renviron:
dshSophiaConnect()
  • Connect to all available nodes, manually providing credentials:
dshSophiaConnect(username = "username", password = "password")
  • Connect to specific nodes:
dshSophiaConnect(include = c("node1", "node2"))
  • Omit specific nodes:
dshSophiaConnect(exclude = c("node1", "node2"))

dshSophiaLoad

This function loads and assigns all database resources into the current R DataSHIELD session. As with dshSophiaConnect, this is done on a per cohort basis. The two objects assigned by dshSophiaConnect (opals and nodes_and_cohorts) are expected to exist in the Global environment, and if either are missing the user will be prompted to run dshSophiaConnect (via dshSophiaPrompt).

This function takes no arguments:

dshSophiaLoad()

When run successfully, the user can continue to work with the federated database.

dshSophiaExit

This is a wrapper function that allows the user to disconnect from the federated database. It takes no arguments:

dshSophiaExit()

dshSophiaShow

This function gathers information about available nodes and cohorts and returns this in a data frame. It will look for login credentials in .Renviron, but the user may also supply credentials manually.

  • Show all nodes and cohorts, assuming username and password is specified in .Renviron:
dshSophiaShow()
  • Show all nodes and cohorts, manually providing username and password:
dshSophiaShow(username = "username", password = "password")

dshSophiaPrompt

This function prompts the user for login details (if those are not available via Sys.getenv()) and then connects via dshSophiaConnect. The user is also given the option to supply a single character or a list of characters separated by a single space denoting the nodes to either include or exclude. The function is primarily a fallback used within dshSophiaLoad when the user has not logged in to the federated system.

dshSophiaOverview

This function gathers information about the data in each available cohorts and returns this in a data frame. It assumes that the user has connected via dshSophiaConnect and loaded database resources via dshSophiaLoad(). The function takes no arguments:

overview <- dshSophiaOverview()
head(overview)

dshSophiaMeasureDesc

This function gathers descriptive information (e.g., N, mean, SD, median, IQR) about a variable in the measurement table. If the data is longitudinal it does so for every time point available. The results are returned in a data frame. The function assumes that the user has connected via dshSophiaConnect and loaded database resources via dshSophiaLoad, and takes a single, valid Concept ID as argument:

# get a descriptive overview of BMI measurements
df <- dshSophiaMeasureDesc(concept_id = 3038553)
print(df)

dshSophiaCreateBaseline

This function creates a 'baseline' data frame on the federated cohort(s). Note that 'baseline' here refers to the first available date for that variable in the measurement table. The function also calculates an approximate age at baseline variable, based on date of birth and date of first available measurement. Note that several cohorts have nonsensical or incorrect dates for date of birth and/or date of first measurements, which renders the age at baseline variable incorrect. The function also creates a gender column based on data from the person table.

The function assumes that the user has connected via dshSophiaConnect and loaded database resources via dshSophiaLoad, and takes a vector of valid Concept IDs as argument:

# connect to the federated system
dshSophiaConnect()

# load database resources
dshSophiaLoad()

# create a 'baseline' data frame on the federated node
dshSophiaCreateBaseline(concept_id = c(3038553, 3025315, 37020574))
 
# check result
dsBaseClient::ds.summary("baseline")

dshSophiaMergeLongMeas

This function takes a variable in the measurement table, pivots it to wide format, and merges it with the 'baseline' federated data frame. If multiple measurements are available, the are all included and the variable name is appended with t_x for each time point. In addition, the raw difference as well as percentage change between t1 and tx is also calculated and added to 'baseline'. Note that as with dshSophiaCreateBaseline, t_1 refers to the first available date for that variable.

The function assumes that the user has connected via dshSophiaConnect, loaded database resources via dshSophiaLoad, and created a federated baseline data frame with dshSophiaCreateBaseline. It takes a single valid Concept IDs as argument.

# connect to the federated system
dshSophiaConnect()

# load database resources
dshSophiaLoad()

# create a 'baseline' data frame on the federated node
dshSophiaCreateBaseline(concept_id = c(4111665, 3004410, 3001308))

# add a longitudinal measure
dshSophiaMergeLongMeas(concept_id = 3038553)

# check result
dsBaseClient::ds.summary("baseline")

dshSophiaGetBeta

dshSophiaGetCor