Uploading data to magma with the R-client magmaR

Introduction

Mini overview of magmaR, magma, and the UCSF Data Library system

This vignette focuses on how to upload data to magma via its R-client, magmaR.

Magma is the data warehouse of the UCSF Data Library.

For a deeper overview of the structure of magma and the UCSF Data Library system, please see the download-focused vignette, vignette("Download", package = "magmaR"), or the Data Library’s own documentation, here https://mountetna.github.io/magma.html.

Scope of this vignette

This vignette assumes that you have already gone through the download-focused vignette, vignette("Download", package = "magmaR"), which covers how to 1) install magmaR, 2) use a token for authentication, and 3) switch, if needed, between the production / staging / development magma environments.

This vignette focuses on use-cases where a user wishes to push data, from their own system, to magma.

Not all Data Library users have write privileges, so not all magmaR users will have need for this vignette.

For those that do, please note: Sending data to magma is an advanced use-case which needs to be treated with due care. The functions involved have the ability to overwrite data, so it is imperative, for data integrity purposes, that inputs to these functions are double-checked in order to make sure that they target only the intended records & attributes.

Also note that a users’ write-privileges are project-specific, so it is unlikely that you will be able to run any code, exactly as it exists in this vignette, without getting an authorization error. (That also means you don’t run the risk of breaking our download vignette by testing out any fun alterations of the code in here… trade-offs =] .)

How magmaR functions work

In general, magmaR functions will:

  1. Take in inputs from the user.
  2. Perform some amount of input parsing and transformation.
  3. Make a curl request to magma via magma’s own retrieve, query, or update functions to send or receive desired data.
  4. Restructure any received data, typically minimally, to be more accessible for downstream analyses.
  5. Return the output.

Steps 4&5 are very simple for upload functions because the only return from magma will be curl request attributes that indicate whether the call to magma/update worked.

So in this vignette, our singular focus will be on how to input your data so that magmaR can send it to magma properly.

Upload Functions

Overview

The base function:

  • updateValues() = a few-frills wrapper of magma’s sole data input function, /update. It’s quite flexible, but that flexibility owes largely to a rigid input structure, so we provide a few wrapper functions which make some common upload needs easier.

Convenient wrapper functions:

  • updateFromDF() allows update of a set of attributes, of multiple records, via a simple data.frame structure, or via a csv/tsv file encoding such data.
  • updateMatrix() allows update of a matrix-type attribute, of multiple records, directly from an R matrix, data.frame, or a csv/tsv file encoding such data.

Common features

Jumping in, the first thing to know is that all magmaR update functions will:

  • Summarize updates to be made, prior to payload transfer to magma.
    • Currently, the summary describes record IDs for each mode that would be created or updated, but that summary may grow in the future.
  • Prompt the user to check that the summary matches their intended updates before proceeding, unless the input auto.proceed is set to FALSE.
  • Perform the Update and Report Status, success/failure, of the post to magma/update, including any error messages returned.

For example, after running some magmaR update code, you might see a summary like this:

For model "rna_seq", this update() will create 3 NEW records:
    ID1
    ID2
    ID3
WARNING: Check the above carefully. Once created, there is no easy way to remove records from magma.
For model "biospecimen", this update() will update 2 records:
    EXAMPLE-HS1-WB1
    EXAMPLE-HS2-WB1
For model "rna_seq", this update() will update 1 records:
    EXAMPLE-HS1-WB1-RSQ1

Proceed, Y/n?

It is highly recommended that these summaries be checked carefully for accuracy before proceeding as updates have the power to overwrite magmaR data! There is currently no history tracking in magma, so updates cannot be rolled back.

Also, contrary to the “update” portion of the function names, these functions can add totally new data to magma records.

Once created, it is not easy to fully remove records with an incorrect ID.

So, especially if you get a WARNING message like in the example above, we recommend to always double-check the summary output before proceeding!

That said, it is possible to bypass the user-prompt step.

To continue with your upload, simply enter y, yes, or the like, and hit enter to proceed. Or, enter anything else to stop.

After a successful update() users will see this message (unless verbose has been set to FALSE):

/update: successful.

Running non-interactively

In cases where you are trying to automate uploads with a script, inputting “yes” to proceed is not possible. In such cases, the user-prompt step can also be turned off by adding the input auto.proceed = TRUE.

Of course, extra care should be taken to ensure proper payload generation.

Example:

revs <- list(
    "biospecimen" = list(
        "EXAMPLE-HS1-WB1" = list(biospecimen_type = "Whole Blood"),
        "EXAMPLE-HS2-WB1" = list(biospecimen_type = "Whole Blood")
        ),
    "rna_seq" = list(
        "EXAMPLE-HS1-WB1-RSQ1" = list(fraction = "Tcells")
    )
)
updateValues(
    target = prod,
    project = "example",
    revisions = revs,
    auto.proceed = TRUE) ## <---
## For model "biospecimen", this update() will update 2 records:
##     EXAMPLE-HS1-WB1
##     EXAMPLE-HS2-WB1
## For model "rna_seq", this update() will update 1 records:
##     EXAMPLE-HS1-WB1-RSQ1
## /update: successful.

Example Data

Examples in this vignette will target the same “example” project used in the download vignette.

To refresh, the model map for that project is below.

example_project_map
example_project_map

The “biospecimen” and “rna_seq” models will be our particular targets, and have attributes…

library(magmaR)
retrieveAttributes(target = prod, "example", "biospecimen")
## [1] "subject"          "name"             "biospecimen_type" "rna_seq"         
## [5] "flow"
retrieveAttributes(target = prod, "example", "rna_seq")
## [1] "biospecimen"     "tube_name"       "expression_type" "cell_number"    
## [5] "gene_tpm"        "gene_counts"     "fraction"

Function Details

updateMatrix()

As the name suggests, updateMatrix() is meant specifically for matrix data. It allows a user to point magmaR to either a file containing matrix data, or to a readily constructed matrix, without needing to perform the manual conversion of such data to the complicated revisions-input format of the updateValues() function.

Internally, the function performs some necessary validations, adjusts the matrix into the revisions-input format of updateValues(), then passes its payload along to updateValues() where all the common update functionality is housed.

Necessary inputs:

In addition to the normal target and projectName inputs here, modelName, and attributeName are also used to direct where to upload your matrix data.

A separate matrix input takes in the target data. This can be given a matrix or data.frame directly, or a string representing the path to a csv or tsv file containing such data. In all cases, matrix must be formatted to have:

  • Column Names (or the first row when providing a file) = the identifiers of target records
  • Column Data = the values for each record
  • Row Names (or the first column when providing a file) = the names of what data is in each row, ex: gene ids for rna_Seq data. These must be among the allowed ‘options’ for the target attribute. (See the ‘See Also’ section of ?updateMatrix for one method of finding out what these ‘options’ are.)

When matrix points to a file, the optional separator input can be used to indicate between csv (default assumption) versus tsv parsing.

Example:

To update the raw counts of our “rna_seq” model from either a csv, a tsv, or directly from a matrix, we could use the code below:

### From a csv
updateMatrix(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    attributeName = "gene_counts",
    matrix = "path/to/rna_seq_counts.csv")

### From a tsv, set the 'separator' input to "\t"
updateMatrix(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    attributeName = "gene_counts",
    matrix = "path/to/rna_seq_counts.tsv",
    # Use separator to adjust parsing for tab-separated values
    separator = "\t")

### From an already loaded matrix:
matrix <- retrieveMatrix(target = prod, "example", "rna_seq", "all", "gene_counts")
updateMatrix(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    attributeName = "gene_counts",
    matrix = matrix)

Let’s explore the structure of matrix a little bit, noting that:

  1. Column names for the matrix are record identifiers for the target “rna_seq” model.
  2. Row names for the matrix are the feature-names for the matrix. So for rna_seq data, for example, these will be gene names of some type, typically Ensembl IDs.
head(matrix, n = c(6,2))
##       EXAMPLE-HS10-WB1-RSQ1 EXAMPLE-HS11-WB1-RSQ1
## gene1                     4                     8
## gene2                   231                    43
## gene3                   861                   155
## gene4                  2077                   427
## gene5                     3                     2
## gene6                     0                     0

updateFromDF()

This function allows easy update of a set of attributes for of multiple records, via a rectangular data input structure.

Utility: When aiming to update the same set of attributes for many different records, which is probably the most common case when wanting to upload data, use this function.

Internally, the function performs some necessary validations, adjusts the df into the rigid revisions-input format of updateValues(), then passes its payload along to updateValues() where all the common update functionality is housed.

Necessary inputs:

In addition to the normal target and projectName inputs, here a modelName is also needed in order to direct where to upload your data.

A separate df input then takes in the target data. df can be given as data.frame directly, or as a string representing the path to a csv or tsv file containing such data. In all cases, df must be formatted to have:

  • Rows = records, with the first column indicating the record identifiers.
  • Columns = represent the data desired to be given for each attribute.
  • Column Names (or the first row when providing a file) = attribute names. Except for the first column (ignored as this column’s data are used as identifiers), all column names must be valid attribute names of the target modelName.

When df points to a file, the optional separator input can be used to indicate between csv (default assumption) versus tsv parsing.

Example:

To update multiple attributes of multiple records of our “rna_seq” model from either a csv, a tsv, or directly from a data.frame, we could use the code below:

### From a csv
updateFromDF(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    df = "path/to/rna_seq_attributes.csv")

### From a tsv, set the 'separator' input to "\t"
updateFromDF(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    df = "path/to/rna_seq_attributes.tsv",
    # Use separator to adjust parsing for tab-separated values
    separator = "\t")

### From an already loaded data.frame:
df <- retrieve(target = prod, "example", "rna_seq", "all",
               c("tube_name", "cell_number", "fraction"))
updateFromDF(
    target = prod,
    projectName = "example",
    modelName = "rna_seq",
    df = df)

Let’s explore the structure of df a little bit, noting that:

  1. The left-most column holds record identifiers.
  2. Column 2 onwards are data columns which hold intended data for attributes of each target record.
  3. Column names for column 2 onwards are the target attribute-names.
head(df, n = c(6,3))
##               tube_name cell_number fraction
## 1 EXAMPLE-HS10-WB1-RSQ1       50000   Tcells
## 2 EXAMPLE-HS11-WB1-RSQ1       50000   Tcells
## 3 EXAMPLE-HS12-WB1-RSQ1       50000   Tcells
## 4  EXAMPLE-HS1-WB1-RSQ1       50000   Tcells
## 5  EXAMPLE-HS2-WB1-RSQ1       50000   Tcells
## 6  EXAMPLE-HS3-WB1-RSQ1       50000   Tcells

One final note: The identifier column’s name, though accurately the name of the identifier attribute here, does not actually matter. ‘tube_name’ is the name of the identifier attribute of this “rna_seq” model, but we could have named the column anything. This is an intended feature as it allows users to adjust records’ identifiers with this function by providing a separate data column, of new identifiers, with the identifier attribute name as its name!

updateValues()

updateValues() is the main workhorse function of magmaR’s data upload capabilities. It largely mimics magma/update, where each target model, record, and attribute are indicated individually, thus giving very flexible control. The function relies on a rigid nested list structure for providing such data. This revisions input structure can feel a bit clunky, but is ultimately very powerful.

Utility: Useful for smaller updates, and for any updates that cannot be handled with the updateFromDF() and updateMatrix() wrappers.

Necessary inputs:

In addition to the normal target and projectName inputs, revisions is the primary input of updateValues():

  • revisions includes information about which model(s), which record(s), and which attribute(s) to update, and with what value(s). Each of these levels are encoded as a nested list where the format looks something like:
revisions = list(
    modelName = list(
        recordName = list(
            attributeName = value(s)
        )
    )
)

To make more than one update within a single call, you can simply add an additional index at any of these levels.

For example, the below revisions would encode some very different updates

# 2 attributes for the same record
revisions = list(
    modelName = list(
        recordName = list(
            attributeName1 = value(s),
            attributeName2 = value(s)
            )
        )
    )

# The same attribute for 2 different records
revisions = list(
    modelName = list(
        recordName1 = list(
            attributeName1 = value(s)
            ),
        recordName2 = list(
            attributeName1 = value(s)
            )
        )
    )

# Some attribute for 2 different records of two different models
revisions = list(
    modelName1 = list(
        recordName1 = list(
            attributeName1 = value(s)
            )
        ),
    modelName2 = list(
        recordName2 = list(
            attributeName2 = value(s)
            )
        )
    )

Say we wanted to update the “biospecimen_type” attribute of 2 records from the “biospecimen” model, and the “fraction” attribute for 1 record from the “rna_seq” model. The code for this could be:

# Create 'revisions'
revs <- list(
    "biospecimen" = list(
        "EXAMPLE-HS1-WB1" = list(biospecimen_type = "Whole Blood"),
        "EXAMPLE-HS2-WB1" = list(biospecimen_type = "Whole Blood")
        ),
    "rna_seq" = list(
        "EXAMPLE-HS1-WB1-RSQ1" = list(fraction = "Tcells")
    )
)

# Run update()
updateValues(
    target = prod,
    project = "example",
    revisions = revs)

A user would then see a summary of models/records to be updated, followed by a prompt to proceed or not:

For model "biospecimen", this update() will update 2 records:
    EXAMPLE-HS1-WB1
    EXAMPLE-HS2-WB1
For model "rna_seq", this update() will update 1 records:
    EXAMPLE-HS1-WB1-RSQ1

Proceed, Y/n?

As mentioned previously, it is highly recommended that summary outputs be checked carefully for accuracy before proceeding.

However, for running update*() code in non-interactive modes, like scripts or .Rmd knits, this user-prompt step can also be turned off by adding the input auto.proceed = TRUE. Example:

updateValues(
    target = prod,
    project = "example",
    revisions = revs,
    auto.proceed = TRUE)
## For model "biospecimen", this update() will update 2 records:
##     EXAMPLE-HS1-WB1
##     EXAMPLE-HS2-WB1
## For model "rna_seq", this update() will update 1 records:
##     EXAMPLE-HS1-WB1-RSQ1
## /update: successful.

After a successful update() a user should see this message (unless verbose has been set to FALSE):

/update: successful.

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dittoSeq_1.19.0  ggplot2_3.5.1    magmaR_1.0.3     vcr_1.6.0       
## [5] BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.37.0 gtable_0.3.6               
##  [3] xfun_0.49                   bslib_0.8.0                
##  [5] httr2_1.0.6                 ggrepel_0.9.6              
##  [7] lattice_0.22-6              Biobase_2.67.0             
##  [9] vctrs_0.6.5                 tools_4.4.2                
## [11] generics_0.1.3              stats4_4.4.2               
## [13] curl_6.0.1                  tibble_3.2.1               
## [15] fansi_1.0.6                 pkgconfig_2.0.3            
## [17] pheatmap_1.0.12             Matrix_1.7-1               
## [19] RColorBrewer_1.1-3          webmockr_1.0.0             
## [21] ggridges_0.5.6              S4Vectors_0.45.2           
## [23] lifecycle_1.0.4             GenomeInfoDbData_1.2.13    
## [25] farver_2.1.2                compiler_4.4.2             
## [27] munsell_0.5.1               GenomeInfoDb_1.43.1        
## [29] htmltools_0.5.8.1           sys_3.4.3                  
## [31] buildtools_1.0.0            sass_0.4.9                 
## [33] yaml_2.3.10                 crayon_1.5.3               
## [35] pillar_1.9.0                jquerylib_0.1.4            
## [37] whisker_0.4.1               SingleCellExperiment_1.29.1
## [39] DelayedArray_0.33.2         cachem_1.1.0               
## [41] abind_1.4-8                 digest_0.6.37              
## [43] labeling_0.4.3              maketools_1.3.1            
## [45] cowplot_1.1.3               fastmap_1.2.0              
## [47] grid_4.4.2                  SparseArray_1.7.2          
## [49] colorspace_2.1-1            cli_3.6.3                  
## [51] magrittr_2.0.3              S4Arrays_1.7.1             
## [53] base64enc_0.1-3             triebeard_0.4.1            
## [55] crul_1.5.0                  utf8_1.2.4                 
## [57] withr_3.0.2                 scales_1.3.0               
## [59] UCSC.utils_1.3.0            rappdirs_0.3.3             
## [61] rmarkdown_2.29              XVector_0.47.0             
## [63] httr_1.4.7                  matrixStats_1.4.1          
## [65] gridExtra_2.3               evaluate_1.0.1             
## [67] knitr_1.49                  GenomicRanges_1.59.1       
## [69] IRanges_2.41.1              rlang_1.1.4                
## [71] urltools_1.7.3              Rcpp_1.0.13-1              
## [73] glue_1.8.0                  httpcode_0.3.0             
## [75] BiocManager_1.30.25         fauxpas_0.5.2              
## [77] BiocGenerics_0.53.3         jsonlite_1.8.9             
## [79] R6_2.5.1                    MatrixGenerics_1.19.0      
## [81] zlibbioc_1.52.0