--- title: "Downloading data from magma with the R-client magmaR" author: - name: Daniel Bunis affiliation: Data Science CoLab, University of California San Francisco, San Francisco, CA date: "Updated: April 7, 2021" output: BiocStyle::html_document: toc_float: true package: magmaR vignette: > %\VignetteIndexEntry{Downloading data from magma} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide", message=FALSE} knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE) library(BiocStyle) library(vcr) library(magmaR) TOKEN <- magmaR:::.get_sysenv_or_mock("TOKEN") URL <- magmaRset("")$url vcr_configure( filter_sensitive_data = list("<<>>" = TOKEN), dir = "../tests/fixtures" ) insert_cassette(name = "Download-vignette") ``` # Quick Start ```{r, eval = FALSE} # Installation options (Choose one. CRAN method is recommended for most users.) ## 1. Most recent release version via CRAN install.packages("magmaR") ## 2. Development version via GitHub remotes::install_github("mountetna/monoetna", subdir = "etna/packages/magmaR") # Check installation and load the package library(magmaR) # Set up your authorization token and where to find magma magma <- magmaRset() ## Note: run as above, you will be prompted in the console to provide your token. ## This token can be obtained from Janus. # Now, you're ready to retrieve some data! retrieve( target = magma, projectName = "example", modelName = "subject" ) ``` # Introduction ## magmaR, magma, and the UCSF Data Library This vignette focuses on how to explore, query, and retrieve data from **magma** via its R-client, **magmaR**. Magma is the *data warehouse* of the **UCSF Data Library**. The Data Library holds various research data sets, broadly broken up into **“projects”**, and provides tools for adding to, organizing, viewing and analyzing these data sets. Internally, the system is composed of a set of applications that each provides a different piece of the Data Library pie. Through the Magma application, one can query and retrieve data from, or update data within, "projects" that exist in the library. We provide some more detail below, but for an even deeper overview of the structure of magma and the Data Library system than is provided here, you can refer to the main source of documentation, https://mountetna.github.io/magma.html. ## Organization of data within magma Data types within magma ***projects*** are organized into ***models***, and individual data then make up the ***records*** of those models. For example, information & data for 3 tubes run on a flow cytometer might make up 3 individual records of a flow cytometry model Each *record* might have multiple ***attributes***, such as the "gene_counts" matrix, the "cell_number", or sorted cell "fraction" *attributes* of *records* that are part of an "rna_seq" *model*. The set of attributes which a record might possess, are defined separately for each model. Thus, records of a "flow" model might have an "fcs_file" attribute, but records of the "rna_seq" model likely would not. Hierarchically, the root of a project is always the "project" model, and every other model must have a single `parent` model. Thus, the data graph is like a tree. (Technically, link-type attributes may be used to indicate additional one-to-one or one-to-many relationships between models other than the tree-like *parent <- model <- "children"* relationships, which allows the graph to be more like a directed acyclic graph (DAG) than a tree... but imagining projects as trees is certainly easier than as an abstract blob.) Here is a sketch of what an example project might look like. Quite literally though, it is in fact the layout of the "example" project which we will be playing with later on in this vignette: ![example_project_map](example_map.jpg) This "example" *project* has 6 different *models*, including the project model itself. Each *model* holds different chunks of information (*attributes*) about data in the "example" project. For example, the `subject` model contains information about individuals (*records*) for whom biospecimens exist; the `biospecimen` model would then contain information about specific specimens obtained from each subject; and the `flow` and `rna_seq` models contain data and information from individual flow cytometry or rna_seq assays that were run on an individual biospecimen. Each *project* in the data library system might have its own distinct modeling layout -- as where to split up the information sharing scheme is highly dependent on a project's data collection and experimental plans. However, in general, one can think of *records* of a *model* at the bottom of the tree as inheriting attributes from the parent *records* of their parent *models*. So for example, although we can also think of each *model* as it's own independent set of data, "rna_seq"-model records are ultimately linked to individual "subject"-model records. Thus, even though attributes of the "subject"-model are not directly included in the "rna_seq"-model, all "subject"-model attributes of "subject"-model records do apply to linked "rna_seq"-model records. In magmaR, we include a function retrieveMetadata() for retrieving such linked data. More on that later. At this point, you should know enough about the structure of magma projects to start using magmaR. But more information exists within magma's own documentation: https://mountetna.github.io/magma.html. ## How magmaR functions work In general, magmaR functions will: 1. Take in inputs from the user. 2. Make a curl request that calls on a magma function to either send or receive desired data. 3. Restructure the received data, typically minimally, to be more accessible for downstream analyses. 4. Return the output. ### Data Restructuring Details The goal of magmaR is to allow users as direct as possible, yet also as ready-to-analyze as possible, access to data that exists within magma. Thus, some minor restructuring is performed by magmaR functions which does not change the underlying data, but does reorganize that data into more efficient formats for downstream analysis within R. #### The two main output structures of magma returns There are two main output structures for returns from magma: - Tab Separated Value (**tsv**) tables - JavaScript Object Notation (**json**) objects Both formats are received as character strings, but then: - **tsv** format returns are converted to **data.frames**. - **json** format returns are converted to a **nested lists**. The data.frame format tends to be easier to work with, but both of these can be fit quite readily into downstream applications. # Installation magmaR will be submitted to CRAN soon. Once accepted, built, and hosted by CRAN, users will be able to install the package with just... ```{r, eval = FALSE} install.packages("magmaR") ``` Alternatively/currently, development versions of magmaR can be installed via the GitHub with: ```{r, eval = FALSE} if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes") remotes::install_github("mountetna/monoetna", subdir = "etna/packages/magmaR") ``` After either of the above, one can check proper installation with: ```{r} library(magmaR) ``` # Authorization process, a.k.a. janus token utilization: In order to access data in magma, a user needs to be authorized to do so. How this is achieved is via provision of a user-specific, temporary, string which we call a ***token***. This token can be obtained from https://janus.ucsf.edu/. ## Providing a token Within magmaR, the token is provided as part of the `target` input which can be constructed with the `magmaRset()` function. To this function, a user's token can be provided in one of two ways. **1) Via an interactive prompt (Recommended when coding interactively):** When not provided explicitly, as is the other method, the user will be prompted to provide their token via the interactive console. It is recommended that you store the output of your call to `magmaRset()` as a variable, and then provide this variable within each subsequent call to a magmaR function, as below. ```{r, hide = TRUE, echo=FALSE} prod <- magmaRset(token = TOKEN, url = URL) ``` ```{r, eval = FALSE} # Method1: User will be prompted to give their token in the R console prod <- magmaRset() ids_subject <- retrieveProjects( # Now, we give the output of magmaRset() to the 'target' input of any # other magmaR function. target = prod) ``` If you run the above code, you should be prompted to ``` Enter your Janus TOKEN (without quotes): ``` To fill this in, navigate to Janus via your favorite browser, click the `Copy Token` button, then paste the value into your console. **2) Give your token explicitly** Users can alternatively fill their token in by providing it explicitly to the `token` input of `magmaRset()`. This is not the generally recommended method because it is not ideal to have authorization values saved within potentially share-able locations. However the tokens are short-lived and methods of mitigating risk of such token exposure exist, see below. ```{r, eval = FALSE} prod <- magmaRset(token = "") ids_subject <- retrieveProjects( # Now, we give the output of magmaRset() to the 'target' input of any # other magmaR function. target = prod) ``` **NOTE: Instead of adding your token directly to any file which you might save, it is recommended that you utilize your `.Renviron` file to store your token.** To do so, you can: 1. Utilize the convenient `usethis::edit_r_environ()` function to open your `.Renviron` file. (Install `usethis` with `install.packages("usethis")` first.) 2. Then add this line to the opened file: `TOKEN=""`. 3. Save the file & restart your R session. 4. Now you can provide `magmaRset(token = Sys.getenv("TOKEN"))`, but when you save your script or .Rmd, the token itself will not be included. 6. Repeat these steps whenever your token refreshes. Tokens normally refresh every 24 hours, but you'll know when this happens because you will get the error message below. ## When magma thinks you are unauthorized If a request to magma returns that "You are unauthorized", magmaR will provide extra info so that users can fix this issue: ``` # Error message when magma sends back that user is unauthorized: You are unauthorized. If you think this is a mistake, re-run `?magmaRset` to update your 'token' input, then retry. ``` # Controlling which version of magma to target For advanced users with access to the staging or development versions of magma, switching can be achieved by adding `url = "production/staging/development-url"` when setting up your `target` with `magmaRset()`. ```{r, eval = FALSE} dev <- magmaRset(url = "http://magma.development.local") # When calling to magma... ids_subject <- retrieveIds( # Now give this to 'target': target = dev, # ^^ projectName = "example", modelName = "subject", url.base = "http://magma.development.local") ``` # Helper functions These functions allow exploration of what data exists within a given project. Although it is possible to rely on timur.ucsf.edu/\/map, or on Timur's search functionality, in order **to determine options for `projectName`, `modelName`, `recordNames` or `attributeName(s)` inputs**, magmaR provides these helper functions to allow users to achieve these goals without leaving R. ```{r} # projectName options: retrieveProjects( target = prod) # modelName options: retrieveModels( target = prod, projectName = "example") # recordNames options: retrieveIds( target = prod, projectName = "example", modelName = "subject") # attributeName(s) options: retrieveAttributes( target = prod, projectName = "example", modelName = "subject") ``` For more complex needs like a complicated `query()` request, you might require accessing the project's template itself. That can be achieved via the `retrieveTemplate()` function: ```{r} # To retrieve the project template: temp <- retrieveTemplate( target = prod, projectName = "example") ``` To explore the return, I recommend starting with the `str()` function looking only a few levels in. You should see something like this: ```{r} str(temp, max.level = 3) ``` Then, followup by looking into the `$template` of individual models further, perhaps as below: ```{r} # For the "subject" model: str(temp$models$subject$template) ``` # Main data download functions: Finally, the meat of why we're here. **magma** has two main data output functions, **`/retrieve`** and **`/query`**. **magmaR** provides methods for **both**. ## retrieve() & retrieveJSON() `retrieve()` is probably the main workhorse function of `magmaR`. If your goal is to download "subject" data for a specific patient of a project, or for all patients of the project, this is the function to start with. The basic structure is to provide which project, `projectName` and which model, `modelName`, that you want data for. ```{r} df <- retrieve( target = prod, projectName = "example", modelName = "subject") head(df) ``` Optionally, a set of `recordNames` or `attributeNames` can be given as well to grab a more specific subset of data from the given project-model pair. ```{r} df <- retrieve( target = prod, projectName = "example", modelName = "subject", recordNames = c("EXAMPLE-HS1", "EXAMPLE-HS2"), attributeNames = "group") head(df) ``` (You can use the `retrieveIDs()` and `retrieveAttributes()` functions described above in the *Helper functions* section to determine options for the `recordNames` and `attributeNames` inputs, respectively.) Unfortunately, for certain attribute data types, `matrix` and `table`, the literal data are not actually given via magma/retrieve when `format = "tsv"`. Instead only a pointer is returned. For such attributes, the `retrieveJSON()` function can retrieve such data (via a magma/retrieve call with `format = "json"`) and a wrapper that makes efficient use of `retrieveJSON()` specifically for matrix data retrieval is also included. Users should not typically need to make use of `retrieveJSON()` directly, as when the desired data is a matrix, `retrieveMatrix()` is recommended instead. More details on that function follow. ```{r retJSON} json <- retrieveJSON( target = prod, projectName = "example", modelName = "rna_seq", recordNames = c("EXAMPLE-HS1-WB1-RSQ1", "EXAMPLE-HS2-WB1-RSQ1"), attributeNames = "gene_counts") ``` ## retrieveMatrix() Because matrices are a very common and important data structure, but are not accessible via `retrieve()`, we provide this function. For a single matrix-type attribute, it will obtain data from magma in the required json structure, and then automatically reorganize said data into the matrix structure that a user would typically expect. In the example below, we obtain the transcripts-per-million(-reads) normalized counts data for all records/samples of the example project. In this matrix, columns will be the individual records, and rows will be features. Specifically, for the example data here, those row names are "gene1", "gene2", and so on, but for real rna_seq data, those row names would typically be the Ensembl gene ids that each row of the matrix represents. ```{r matrix} mat <- retrieveMatrix( target = prod, projectName = "example", modelName = "rna_seq", recordNames = "all", attributeNames = "gene_tpm") head(mat, n = c(6,3)) ``` Most user need not worry about the internal method, but for those that are curious: Under the hood, data is grabbed via `retrieveJSON()` for 10 records at a time. The relevant data are then extracted from the complex list output of this retrieval route, then they are converted into a matrix structure where column names are the `recordNames`. Row names are then grabbed from the model's template for what this data should represent. ## query() The Magma Query API lets you pull data out of Magma through an expressive query interface. Often, if you want a specific set of data from model-X, but only, say, for records where linked records of model-Y have data for attribute-Z, then this is the endpoint you want. But note: the format of `query()` calls can be a bit complicated, so it is recommended to check if `retreiveMetadata()` might better serve your purposes first. We'll describe that function a bit later. For guidance on how to format `query()` calls, see `?query` and https://mountetna.github.io/magma.html#query. ```{r query} query_out <- query( target = prod, projectName = "example", queryTerms = list('rna_seq', '::all', 'biospecimen', '::identifier') ) ``` Details: The default output of this function is a list conversion of the direct json output returned by magma/query. This list will contain either 2 or 3 parts: ```{r} names(query_out) ``` answer, type (optional), and format. Alternatively, the output can be reformatted as a dataframe if `format = "df"` is given. ```{r query2} subject_ids_of_rnaseq_records <- query( target = prod, projectName = "example", queryTerms = list('rna_seq', '::all', 'biospecimen', '::identifier'), format = "df" ) head(subject_ids_of_rnaseq_records) ``` Details: When `format = "df"` is added, the list output will be converted to a data.frame where data comes from the `answer` and column names come from the `format` pieces. ## retrieveMetadata() This function attempts to simplify the process of obtaining "metadata" from model X for "target data" of model Y. For example, this function could be used to extract "subject"-model data from the "example" project that is linked to "rna_seq"-model records. ```{r meta} meta <- retrieveMetadata( target = prod, projectName = "example", meta_modelName = "subject", meta_attributeNames = "all", target_modelName = "rna_seq", target_recordNames = "all") head(meta, n = c(6,10)) ``` General Details: The function determines how `target_modelName` and `meta_modelName` models relate to each other, then obtains data for `meta_attributeNames`-attributes, from the `meta_modelName`-model, for records of this model that are linked to `target_recordName`-records of the `target_modelName`-model. Data is then output as a data.frame with one row per `target_recordName`. Specific Details: The function first determines the model -> model path for navigating between the meta and target models. (At the moment, ONLY parent links are used for this purpose, but utilization of link attributes is planned for the future.) Then, `query()`s based on these paths are utilized to obtain how target-model and meta-model recordNames are linked. Data is then `retrieve()`d from the `meta_modelName`-model for `meta_attributeNames`-attributes of records that are linked to `target_recordNames`-records of the `target_modelName`-model. Next, if there is more than 1:1 mapping between meta-model records to target-model records, the metadata is reorganized rightwards in order to have one output row per "target"-record. Finally, this data is output as a dataframe with rows = `target_recordNames` and columns of linkage record identifiers followed by columns of each requested meta-model attribute. # Putting it all together In our example code for `retrieveMatrix()` and `retrieveMetadata()`, we obtained RNAseq data from the example project, and the linked metadata from the subject-model level. Now that we have these metadata for our rna_seq records, we could use them to start exploring our rna_seq data: ```{r, eval=TRUE, include=FALSE} ditto_available <- requireNamespace("dittoSeq") ``` ```{r, eval = ditto_available} library(dittoSeq) # Explore RNAseq data with dittoSeq sce <- importDittoBulk( list(tpm = mat), # mat was obtained with retrieveMatrix() metadata = meta # meta was obtained with retrieveMetadata() ) dittoBoxPlot(sce, "gene1", group.by = "group") ``` # Session information ```{r, include = FALSE} eject_cassette() ``` ```{r} sessionInfo() ```