Introduction
magmaR, magma, and the UCSF Data Library
This vignette focuses on how to explore, query, and retrieve data
from magma via its R-client,
magmaR.
Magma is the data warehouse of the UCSF Data
Library.
The Data Library holds various research data sets, broadly broken up
into “projects”, and provides tools for adding to,
organizing, viewing and analyzing these data sets.
Internally, the system is composed of a set of applications that each
provides a different piece of the Data Library pie. Through the Magma
application, one can query and retrieve data from, or update data
within, “projects” that exist in the library.
We provide some more detail below, but for an even deeper overview of
the structure of magma and the Data Library system than is provided
here, you can refer to the main source of documentation, https://mountetna.github.io/magma.html.
Organization of data within magma
Data types within magma projects are
organized into models, and individual data
then make up the records of those models.
For example, information & data for 3 tubes run on a flow
cytometer might make up 3 individual records of a flow cytometry
model
Each record might have multiple
attributes, such as the “gene_counts” matrix,
the “cell_number”, or sorted cell “fraction” attributes of
records that are part of an “rna_seq” model.
The set of attributes which a record might possess, are defined
separately for each model. Thus, records of a “flow” model might have an
“fcs_file” attribute, but records of the “rna_seq” model likely would
not.
Hierarchically, the root of a project is always the “project” model,
and every other model must have a single parent
model.
Thus, the data graph is like a tree.
(Technically, link-type attributes may be used to indicate additional
one-to-one or one-to-many relationships between models other than the
tree-like parent <- model <- “children” relationships,
which allows the graph to be more like a directed acyclic graph (DAG)
than a tree… but imagining projects as trees is certainly easier than as
an abstract blob.)
Here is a sketch of what an example project might look like. Quite
literally though, it is in fact the layout of the “example” project
which we will be playing with later on in this vignette:
example_project_map
This “example” project has 6 different models,
including the project model itself. Each model holds different
chunks of information (attributes) about data in the “example”
project. For example, the subject
model contains
information about individuals (records) for whom biospecimens
exist; the biospecimen
model would then contain information
about specific specimens obtained from each subject; and the
flow
and rna_seq
models contain data and
information from individual flow cytometry or rna_seq assays that were
run on an individual biospecimen.
Each project in the data library system might have its own
distinct modeling layout – as where to split up the information sharing
scheme is highly dependent on a project’s data collection and
experimental plans. However, in general, one can think of
records of a model at the bottom of the tree as
inheriting attributes from the parent records of their parent
models. So for example, although we can also think of each
model as it’s own independent set of data, “rna_seq”-model
records are ultimately linked to individual “subject”-model records.
Thus, even though attributes of the “subject”-model are not directly
included in the “rna_seq”-model, all “subject”-model attributes of
“subject”-model records do apply to linked “rna_seq”-model records. In
magmaR, we include a function retrieveMetadata() for retrieving such
linked data. More on that later.
At this point, you should know enough about the structure of magma
projects to start using magmaR. But more information exists within
magma’s own documentation: https://mountetna.github.io/magma.html.
How magmaR functions work
In general, magmaR functions will:
- Take in inputs from the user.
- Make a curl request that calls on a magma function to either send or
receive desired data.
- Restructure the received data, typically minimally, to be more
accessible for downstream analyses.
- Return the output.
Data Restructuring Details
The goal of magmaR is to allow users as direct as possible, yet also
as ready-to-analyze as possible, access to data that exists within
magma. Thus, some minor restructuring is performed by magmaR functions
which does not change the underlying data, but does reorganize that data
into more efficient formats for downstream analysis within R.
The two main output structures of magma returns
There are two main output structures for returns from magma:
- Tab Separated Value (tsv) tables
- JavaScript Object Notation (json) objects
Both formats are received as character strings, but then:
- tsv format returns are converted to
data.frames.
- json format returns are converted to a
nested lists.
The data.frame format tends to be easier to work with, but both of
these can be fit quite readily into downstream applications.
Installation
magmaR will be submitted to CRAN soon. Once accepted, built, and
hosted by CRAN, users will be able to install the package with just…
install.packages("magmaR")
Alternatively/currently, development versions of magmaR can be
installed via the GitHub with:
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("mountetna/monoetna", subdir = "etna/packages/magmaR")
After either of the above, one can check proper installation
with:
Authorization process, a.k.a. janus token utilization:
In order to access data in magma, a user needs to be authorized to do
so. How this is achieved is via provision of a user-specific, temporary,
string which we call a token. This token can
be obtained from https://janus.ucsf.edu/.
Providing a token
Within magmaR, the token is provided as part of the
target
input which can be constructed with the
magmaRset()
function.
To this function, a user’s token can be provided in one of two
ways.
1) Via an interactive prompt (Recommended when coding
interactively):
When not provided explicitly, as is the other method, the user will
be prompted to provide their token via the interactive console. It is
recommended that you store the output of your call to
magmaRset()
as a variable, and then provide this variable
within each subsequent call to a magmaR function, as below.
# Method1: User will be prompted to give their token in the R console
prod <- magmaRset()
ids_subject <- retrieveProjects(
# Now, we give the output of magmaRset() to the 'target' input of any
# other magmaR function.
target = prod)
If you run the above code, you should be prompted to
Enter your Janus TOKEN (without quotes):
To fill this in, navigate to Janus via your favorite browser, click
the Copy Token
button, then paste the value into your
console.
2) Give your token explicitly
Users can alternatively fill their token in by providing it
explicitly to the token
input of magmaRset()
.
This is not the generally recommended method because it is not ideal to
have authorization values saved within potentially share-able locations.
However the tokens are short-lived and methods of mitigating risk of
such token exposure exist, see below.
prod <- magmaRset(token = "<your-token-here>")
ids_subject <- retrieveProjects(
# Now, we give the output of magmaRset() to the 'target' input of any
# other magmaR function.
target = prod)
NOTE: Instead of adding your token directly to any file which
you might save, it is recommended that you utilize your
.Renviron
file to store your token. To do so, you
can:
- Utilize the convenient
usethis::edit_r_environ()
function to open your .Renviron
file. (Install
usethis
with install.packages("usethis")
first.)
- Then add this line to the opened file:
TOKEN="<your_token>"
.
- Save the file & restart your R session.
- Now you can provide
magmaRset(token = Sys.getenv("TOKEN"))
, but when you save
your script or .Rmd, the token itself will not be included.
- Repeat these steps whenever your token refreshes. Tokens normally
refresh every 24 hours, but you’ll know when this happens because you
will get the error message below.
When magma thinks you are unauthorized
If a request to magma returns that “You are unauthorized”, magmaR
will provide extra info so that users can fix this issue:
# Error message when magma sends back that user is unauthorized:
You are unauthorized. If you think this is a mistake, re-run `?magmaRset` to update your 'token' input, then retry.
Helper functions
These functions allow exploration of what data exists within a given
project.
Although it is possible to rely on
timur.ucsf.edu/<projectName>/map, or on Timur’s search
functionality, in order to determine options for
projectName
, modelName
,
recordNames
or attributeName(s)
inputs, magmaR provides these helper functions to allow users
to achieve these goals without leaving R.
# projectName options:
retrieveProjects(
target = prod)
## project_name project_name_full role privileged
## 1 mvir1 COMET editor FALSE
## 2 ipi Immunoprofiler Initiative editor FALSE
## 3 dscolab Data Science CoLab editor FALSE
## 4 coprojects_template Coprojects Template administrator FALSE
## 5 example Example project administrator FALSE
## 6 xcrs1 Human-Mouse Cancer Translator administrator FALSE
# modelName options:
retrieveModels(
target = prod,
projectName = "example")
## [1] "subject" "rna_seq" "population" "project" "flow"
## [6] "biospecimen"
# recordNames options:
retrieveIds(
target = prod,
projectName = "example",
modelName = "subject")
## [1] "EXAMPLE-HS1" "EXAMPLE-HS10" "EXAMPLE-HS11" "EXAMPLE-HS12" "EXAMPLE-HS2"
## [6] "EXAMPLE-HS3" "EXAMPLE-HS4" "EXAMPLE-HS5" "EXAMPLE-HS6" "EXAMPLE-HS7"
## [11] "EXAMPLE-HS8" "EXAMPLE-HS9"
# attributeName(s) options:
retrieveAttributes(
target = prod,
projectName = "example",
modelName = "subject")
## [1] "name" "project" "biospecimen" "group"
For more complex needs like a complicated query()
request, you might require accessing the project’s template itself. That
can be achieved via the retrieveTemplate()
function:
# To retrieve the project template:
temp <- retrieveTemplate(
target = prod,
projectName = "example")
To explore the return, I recommend starting with the
str()
function looking only a few levels in. You should see
something like this:
## List of 1
## $ models:List of 6
## ..$ subject :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ rna_seq :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ population :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ project :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 3
## ..$ flow :List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
## ..$ biospecimen:List of 2
## .. ..$ documents: Named list()
## .. ..$ template :List of 4
Then, followup by looking into the $template
of
individual models further, perhaps as below:
# For the "subject" model:
str(temp$models$subject$template)
## List of 4
## $ name : chr "subject"
## $ attributes:List of 6
## ..$ created_at :List of 8
## .. ..$ name : chr "created_at"
## .. ..$ attribute_name: chr "created_at"
## .. ..$ display_name : chr "Created At"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi TRUE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "date_time"
## ..$ updated_at :List of 8
## .. ..$ name : chr "updated_at"
## .. ..$ attribute_name: chr "updated_at"
## .. ..$ display_name : chr "Updated At"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi TRUE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "date_time"
## ..$ name :List of 8
## .. ..$ name : chr "name"
## .. ..$ attribute_name: chr "name"
## .. ..$ display_name : chr "Name"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "identifier"
## ..$ project :List of 10
## .. ..$ name : chr "project"
## .. ..$ attribute_name : chr "project"
## .. ..$ model_name : chr "project"
## .. ..$ link_model_name: chr "project"
## .. ..$ display_name : chr "Project"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type : chr "parent"
## ..$ biospecimen:List of 10
## .. ..$ name : chr "biospecimen"
## .. ..$ attribute_name : chr "biospecimen"
## .. ..$ model_name : chr "biospecimen"
## .. ..$ link_model_name: chr "biospecimen"
## .. ..$ display_name : chr "Biospecimen"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type : chr "collection"
## ..$ group :List of 8
## .. ..$ name : chr "group"
## .. ..$ attribute_name: chr "group"
## .. ..$ display_name : chr "Group"
## .. ..$ restricted : logi FALSE
## .. ..$ read_only : logi FALSE
## .. ..$ hidden : logi FALSE
## .. ..$ validation : NULL
## .. ..$ attribute_type: chr "string"
## $ identifier: chr "name"
## $ parent : chr "project"
Main data download functions:
Finally, the meat of why we’re here.
magma has two main data output functions,
/retrieve
and
/query
.
magmaR provides methods for
both.
retrieve() & retrieveJSON()
retrieve()
is probably the main workhorse function of
magmaR
. If your goal is to download “subject” data for a
specific patient of a project, or for all patients of the project, this
is the function to start with.
The basic structure is to provide which project,
projectName
and which model, modelName
, that
you want data for.
df <- retrieve(
target = prod,
projectName = "example",
modelName = "subject")
head(df)
## name project biospecimen group
## 1 EXAMPLE-HS1 example EXAMPLE-HS1-WB1 g1
## 2 EXAMPLE-HS10 example EXAMPLE-HS10-WB1 g3
## 3 EXAMPLE-HS11 example EXAMPLE-HS11-WB1 g3
## 4 EXAMPLE-HS12 example EXAMPLE-HS12-WB1 g3
## 5 EXAMPLE-HS2 example EXAMPLE-HS2-WB1 g1
## 6 EXAMPLE-HS3 example EXAMPLE-HS3-WB1 g1
Optionally, a set of recordNames
or
attributeNames
can be given as well to grab a more specific
subset of data from the given project-model pair.
df <- retrieve(
target = prod,
projectName = "example",
modelName = "subject",
recordNames = c("EXAMPLE-HS1", "EXAMPLE-HS2"),
attributeNames = "group")
head(df)
## name group
## 1 EXAMPLE-HS1 g1
## 2 EXAMPLE-HS2 g1
(You can use the retrieveIDs()
and
retrieveAttributes()
functions described above in the
Helper functions section to determine options for the
recordNames
and attributeNames
inputs,
respectively.)
Unfortunately, for certain attribute data types, matrix
and table
, the literal data are not actually given via
magma/retrieve when format = "tsv"
. Instead only a pointer
is returned. For such attributes, the retrieveJSON()
function can retrieve such data (via a magma/retrieve call with
format = "json"
) and a wrapper that makes efficient use of
retrieveJSON()
specifically for matrix data retrieval is
also included. Users should not typically need to make use of
retrieveJSON()
directly, as when the desired data is a
matrix, retrieveMatrix()
is recommended instead. More
details on that function follow.
json <- retrieveJSON(
target = prod,
projectName = "example",
modelName = "rna_seq",
recordNames = c("EXAMPLE-HS1-WB1-RSQ1", "EXAMPLE-HS2-WB1-RSQ1"),
attributeNames = "gene_counts")
retrieveMatrix()
Because matrices are a very common and important data structure, but
are not accessible via retrieve()
, we provide this
function. For a single matrix-type attribute, it will obtain data from
magma in the required json structure, and then automatically reorganize
said data into the matrix structure that a user would typically
expect.
In the example below, we obtain the transcripts-per-million(-reads)
normalized counts data for all records/samples of the example project.
In this matrix, columns will be the individual records, and rows will be
features. Specifically, for the example data here, those row names are
“gene1”, “gene2”, and so on, but for real rna_seq data, those row names
would typically be the Ensembl gene ids that each row of the matrix
represents.
mat <- retrieveMatrix(
target = prod,
projectName = "example",
modelName = "rna_seq",
recordNames = "all",
attributeNames = "gene_tpm")
head(mat, n = c(6,3))
## EXAMPLE-HS10-WB1-RSQ1 EXAMPLE-HS11-WB1-RSQ1 EXAMPLE-HS12-WB1-RSQ1
## gene1 0.5187 7.9960 4.8278
## gene2 29.9572 42.9785 31.2540
## gene3 111.6587 154.9225 114.0897
## gene4 269.3555 426.7866 302.6299
## gene5 0.3891 1.9990 0.0000
## gene6 0.0000 0.0000 0.0000
Most user need not worry about the internal method, but for those
that are curious: Under the hood, data is grabbed via
retrieveJSON()
for 10 records at a time. The relevant data
are then extracted from the complex list output of this retrieval route,
then they are converted into a matrix structure where column names are
the recordNames
. Row names are then grabbed from the
model’s template for what this data should represent.
query()
The Magma Query API lets you pull data out of Magma through an
expressive query interface. Often, if you want a specific set of data
from model-X, but only, say, for records where linked records of model-Y
have data for attribute-Z, then this is the endpoint you want.
But note: the format of query()
calls can be a bit
complicated, so it is recommended to check if
retreiveMetadata()
might better serve your purposes first.
We’ll describe that function a bit later.
For guidance on how to format query()
calls, see
?query
and https://mountetna.github.io/magma.html#query.
query_out <- query(
target = prod,
projectName = "example",
queryTerms =
list('rna_seq',
'::all',
'biospecimen',
'::identifier')
)
Details: The default output of this function is a list conversion of
the direct json output returned by magma/query. This list will contain
either 2 or 3 parts:
## [1] "answer" "type" "format"
answer, type (optional), and format.
Alternatively, the output can be reformatted as a dataframe if
format = "df"
is given.
subject_ids_of_rnaseq_records <- query(
target = prod,
projectName = "example",
queryTerms =
list('rna_seq',
'::all',
'biospecimen',
'::identifier'),
format = "df"
)
head(subject_ids_of_rnaseq_records)
## example::rna_seq#tube_name example::biospecimen#name
## 1 EXAMPLE-HS10-WB1-RSQ1 EXAMPLE-HS10-WB1
## 2 EXAMPLE-HS11-WB1-RSQ1 EXAMPLE-HS11-WB1
## 3 EXAMPLE-HS12-WB1-RSQ1 EXAMPLE-HS12-WB1
## 4 EXAMPLE-HS1-WB1-RSQ1 EXAMPLE-HS1-WB1
## 5 EXAMPLE-HS2-WB1-RSQ1 EXAMPLE-HS2-WB1
## 6 EXAMPLE-HS3-WB1-RSQ1 EXAMPLE-HS3-WB1
Details: When format = "df"
is added, the list output
will be converted to a data.frame where data comes from the
answer
and column names come from the format
pieces.