This vignette focuses on how to upload data to magma via its R-client, magmaR.
Magma is the data warehouse of the UCSF Data Library.
For a deeper overview of the structure of magma and the UCSF Data
Library system, please see the download-focused vignette,
vignette("Download", package = "magmaR")
, or the Data
Library’s own documentation, here https://mountetna.github.io/magma.html.
This vignette assumes that you have already gone through the
download-focused vignette,
vignette("Download", package = "magmaR")
, which covers how
to 1) install magmaR, 2) use a token for authentication, and 3)
switch, if needed, between the production / staging / development magma
environments.
This vignette focuses on use-cases where a user wishes to push data, from their own system, to magma.
Not all Data Library users have write privileges, so not all magmaR users will have need for this vignette.
For those that do, please note: Sending data to magma is an advanced use-case which needs to be treated with due care. The functions involved have the ability to overwrite data, so it is imperative, for data integrity purposes, that inputs to these functions are double-checked in order to make sure that they target only the intended records & attributes.
Also note that a users’ write-privileges are project-specific, so it is unlikely that you will be able to run any code, exactly as it exists in this vignette, without getting an authorization error. (That also means you don’t run the risk of breaking our download vignette by testing out any fun alterations of the code in here… trade-offs =] .)
In general, magmaR functions will:
retrieve
,
query
, or update
functions to send or receive
desired data.Steps 4&5 are very simple for upload functions because the only
return from magma will be curl request attributes that indicate whether
the call to magma/update
worked.
So in this vignette, our singular focus will be on how to input your data so that magmaR can send it to magma properly.
The base function:
updateValues()
= a few-frills wrapper
of magma’s sole data input function,
/update
. It’s quite flexible, but that
flexibility owes largely to a rigid input structure, so we provide a few
wrapper functions which make some common upload needs easier.Convenient wrapper functions:
updateFromDF()
allows update of a set
of attributes, of multiple records, via a simple data.frame
structure, or via a csv/tsv file encoding such data.updateMatrix()
allows update of a
matrix-type attribute, of multiple records, directly from an R
matrix
, data.frame
, or a csv/tsv file encoding
such data.Jumping in, the first thing to know is that all magmaR update functions will:
auto.proceed
is set to FALSE.magma/update
, including any
error messages returned.For example, after running some magmaR update code, you might see a summary like this:
For model "rna_seq", this update() will create 3 NEW records:
ID1
ID2
ID3
WARNING: Check the above carefully. Once created, there is no easy way to remove records from magma.
For model "biospecimen", this update() will update 2 records:
EXAMPLE-HS1-WB1
EXAMPLE-HS2-WB1
For model "rna_seq", this update() will update 1 records:
EXAMPLE-HS1-WB1-RSQ1
Proceed, Y/n?
It is highly recommended that these summaries be checked carefully
for accuracy before proceeding as update
s have the power to
overwrite magmaR data! There is currently no history tracking in magma,
so updates cannot be rolled back.
Also, contrary to the “update” portion of the function names, these functions can add totally new data to magma records.
Once created, it is not easy to fully remove records with an incorrect ID.
So, especially if you get a WARNING message like in the example above, we recommend to always double-check the summary output before proceeding!
That said, it is possible to bypass the user-prompt step.
To continue with your upload, simply enter y
,
yes
, or the like, and hit enter
to proceed.
Or, enter anything else to stop.
After a successful update()
users will see this message
(unless verbose
has been set to FALSE
):
/update: successful.
In cases where you are trying to automate uploads with a script,
inputting “yes” to proceed is not possible. In such cases, the
user-prompt step can also be turned off by adding the input
auto.proceed = TRUE
.
Of course, extra care should be taken to ensure proper payload generation.
Example:
revs <- list(
"biospecimen" = list(
"EXAMPLE-HS1-WB1" = list(biospecimen_type = "Whole Blood"),
"EXAMPLE-HS2-WB1" = list(biospecimen_type = "Whole Blood")
),
"rna_seq" = list(
"EXAMPLE-HS1-WB1-RSQ1" = list(fraction = "Tcells")
)
)
updateValues(
target = prod,
project = "example",
revisions = revs,
auto.proceed = TRUE) ## <---
## For model "biospecimen", this update() will update 2 records:
## EXAMPLE-HS1-WB1
## EXAMPLE-HS2-WB1
## For model "rna_seq", this update() will update 1 records:
## EXAMPLE-HS1-WB1-RSQ1
## /update: successful.
Examples in this vignette will target the same “example” project used in the download vignette.
To refresh, the model map for that project is below.
The “biospecimen” and “rna_seq” models will be our particular targets, and have attributes…
## [1] "subject" "name" "biospecimen_type" "rna_seq"
## [5] "flow"
## [1] "biospecimen" "tube_name" "expression_type" "cell_number"
## [5] "gene_tpm" "gene_counts" "fraction"
As the name suggests, updateMatrix()
is meant
specifically for matrix data. It allows a user to point magmaR to either
a file containing matrix data, or to a readily constructed matrix,
without needing to perform the manual conversion of such data to the
complicated revisions
-input format of the
updateValues()
function.
Internally, the function performs some necessary validations, adjusts
the matrix into the revisions
-input format of
updateValues()
, then passes its payload along to
updateValues()
where all the common update functionality is
housed.
In addition to the normal target
and
projectName
inputs here, modelName
, and
attributeName
are also used to direct where to upload your
matrix data.
A separate matrix
input takes in the target data. This
can be given a matrix or data.frame directly, or a string representing
the path to a csv or tsv file containing such data. In all cases,
matrix
must be formatted to have:
?updateMatrix
for one method of finding
out what these ‘options’ are.)When matrix
points to a file, the optional
separator
input can be used to indicate between csv
(default assumption) versus tsv parsing.
Example:
To update the raw counts of our “rna_seq” model from either a csv, a tsv, or directly from a matrix, we could use the code below:
### From a csv
updateMatrix(
target = prod,
projectName = "example",
modelName = "rna_seq",
attributeName = "gene_counts",
matrix = "path/to/rna_seq_counts.csv")
### From a tsv, set the 'separator' input to "\t"
updateMatrix(
target = prod,
projectName = "example",
modelName = "rna_seq",
attributeName = "gene_counts",
matrix = "path/to/rna_seq_counts.tsv",
# Use separator to adjust parsing for tab-separated values
separator = "\t")
### From an already loaded matrix:
matrix <- retrieveMatrix(target = prod, "example", "rna_seq", "all", "gene_counts")
updateMatrix(
target = prod,
projectName = "example",
modelName = "rna_seq",
attributeName = "gene_counts",
matrix = matrix)
Let’s explore the structure of matrix
a little bit,
noting that:
## EXAMPLE-HS10-WB1-RSQ1 EXAMPLE-HS11-WB1-RSQ1
## gene1 4 8
## gene2 231 43
## gene3 861 155
## gene4 2077 427
## gene5 3 2
## gene6 0 0
This function allows easy update of a set of attributes for of multiple records, via a rectangular data input structure.
Utility: When aiming to update the same set of attributes for many different records, which is probably the most common case when wanting to upload data, use this function.
Internally, the function performs some necessary validations, adjusts
the df into the rigid revisions
-input format of
updateValues()
, then passes its payload along to
updateValues()
where all the common update functionality is
housed.
In addition to the normal target
and
projectName
inputs, here a modelName
is also
needed in order to direct where to upload your data.
A separate df
input then takes in the target data.
df
can be given as data.frame directly, or as a string
representing the path to a csv or tsv file containing such data. In all
cases, df
must be formatted to have:
When df
points to a file, the optional
separator
input can be used to indicate between csv
(default assumption) versus tsv parsing.
Example:
To update multiple attributes of multiple records of our “rna_seq” model from either a csv, a tsv, or directly from a data.frame, we could use the code below:
### From a csv
updateFromDF(
target = prod,
projectName = "example",
modelName = "rna_seq",
df = "path/to/rna_seq_attributes.csv")
### From a tsv, set the 'separator' input to "\t"
updateFromDF(
target = prod,
projectName = "example",
modelName = "rna_seq",
df = "path/to/rna_seq_attributes.tsv",
# Use separator to adjust parsing for tab-separated values
separator = "\t")
### From an already loaded data.frame:
df <- retrieve(target = prod, "example", "rna_seq", "all",
c("tube_name", "cell_number", "fraction"))
updateFromDF(
target = prod,
projectName = "example",
modelName = "rna_seq",
df = df)
Let’s explore the structure of df
a little bit, noting
that:
## tube_name cell_number fraction
## 1 EXAMPLE-HS10-WB1-RSQ1 50000 Tcells
## 2 EXAMPLE-HS11-WB1-RSQ1 50000 Tcells
## 3 EXAMPLE-HS12-WB1-RSQ1 50000 Tcells
## 4 EXAMPLE-HS1-WB1-RSQ1 50000 Tcells
## 5 EXAMPLE-HS2-WB1-RSQ1 50000 Tcells
## 6 EXAMPLE-HS3-WB1-RSQ1 50000 Tcells
One final note: The identifier column’s name, though accurately the name of the identifier attribute here, does not actually matter. ‘tube_name’ is the name of the identifier attribute of this “rna_seq” model, but we could have named the column anything. This is an intended feature as it allows users to adjust records’ identifiers with this function by providing a separate data column, of new identifiers, with the identifier attribute name as its name!
updateValues()
is the main workhorse function of
magmaR
’s data upload capabilities. It largely mimics
magma/update
, where each target model, record, and
attribute are indicated individually, thus giving very flexible control.
The function relies on a rigid nested list structure for providing such
data. This revisions
input structure can feel a bit clunky,
but is ultimately very powerful.
Utility: Useful for smaller updates, and for any updates that cannot
be handled with the updateFromDF()
and
updateMatrix()
wrappers.
In addition to the normal target
and
projectName
inputs, revisions
is the primary
input of updateValues()
:
revisions
includes information about which model(s),
which record(s), and which attribute(s) to update, and with what
value(s). Each of these levels are encoded as a nested list where the
format looks something like:revisions = list(
modelName = list(
recordName = list(
attributeName = value(s)
)
)
)
To make more than one update within a single call, you can simply add an additional index at any of these levels.
For example, the below revisions
would encode some very
different updates
# 2 attributes for the same record
revisions = list(
modelName = list(
recordName = list(
attributeName1 = value(s),
attributeName2 = value(s)
)
)
)
# The same attribute for 2 different records
revisions = list(
modelName = list(
recordName1 = list(
attributeName1 = value(s)
),
recordName2 = list(
attributeName1 = value(s)
)
)
)
# Some attribute for 2 different records of two different models
revisions = list(
modelName1 = list(
recordName1 = list(
attributeName1 = value(s)
)
),
modelName2 = list(
recordName2 = list(
attributeName2 = value(s)
)
)
)
Say we wanted to update the “biospecimen_type” attribute of 2 records from the “biospecimen” model, and the “fraction” attribute for 1 record from the “rna_seq” model. The code for this could be:
# Create 'revisions'
revs <- list(
"biospecimen" = list(
"EXAMPLE-HS1-WB1" = list(biospecimen_type = "Whole Blood"),
"EXAMPLE-HS2-WB1" = list(biospecimen_type = "Whole Blood")
),
"rna_seq" = list(
"EXAMPLE-HS1-WB1-RSQ1" = list(fraction = "Tcells")
)
)
# Run update()
updateValues(
target = prod,
project = "example",
revisions = revs)
A user would then see a summary of models/records to be updated, followed by a prompt to proceed or not:
For model "biospecimen", this update() will update 2 records:
EXAMPLE-HS1-WB1
EXAMPLE-HS2-WB1
For model "rna_seq", this update() will update 1 records:
EXAMPLE-HS1-WB1-RSQ1
Proceed, Y/n?
As mentioned previously, it is highly recommended that summary outputs be checked carefully for accuracy before proceeding.
However, for running update*()
code in non-interactive
modes, like scripts or .Rmd knits, this user-prompt step can also be
turned off by adding the input auto.proceed = TRUE
.
Example:
## For model "biospecimen", this update() will update 2 records:
## EXAMPLE-HS1-WB1
## EXAMPLE-HS2-WB1
## For model "rna_seq", this update() will update 1 records:
## EXAMPLE-HS1-WB1-RSQ1
## /update: successful.
After a successful update()
a user should see this
message (unless verbose
has been set to
FALSE
):
/update: successful.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dittoSeq_1.19.0 ggplot2_3.5.1 magmaR_1.0.3 vcr_1.6.0
## [5] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.37.0 gtable_0.3.6
## [3] xfun_0.49 bslib_0.8.0
## [5] httr2_1.0.6 ggrepel_0.9.6
## [7] lattice_0.22-6 Biobase_2.67.0
## [9] vctrs_0.6.5 tools_4.4.2
## [11] generics_0.1.3 stats4_4.4.2
## [13] curl_6.0.1 tibble_3.2.1
## [15] fansi_1.0.6 pkgconfig_2.0.3
## [17] pheatmap_1.0.12 Matrix_1.7-1
## [19] RColorBrewer_1.1-3 webmockr_1.0.0
## [21] ggridges_0.5.6 S4Vectors_0.45.2
## [23] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [25] farver_2.1.2 compiler_4.4.2
## [27] munsell_0.5.1 GenomeInfoDb_1.43.1
## [29] htmltools_0.5.8.1 sys_3.4.3
## [31] buildtools_1.0.0 sass_0.4.9
## [33] yaml_2.3.10 crayon_1.5.3
## [35] pillar_1.9.0 jquerylib_0.1.4
## [37] whisker_0.4.1 SingleCellExperiment_1.29.1
## [39] DelayedArray_0.33.2 cachem_1.1.0
## [41] abind_1.4-8 digest_0.6.37
## [43] labeling_0.4.3 maketools_1.3.1
## [45] cowplot_1.1.3 fastmap_1.2.0
## [47] grid_4.4.2 SparseArray_1.7.2
## [49] colorspace_2.1-1 cli_3.6.3
## [51] magrittr_2.0.3 S4Arrays_1.7.1
## [53] base64enc_0.1-3 triebeard_0.4.1
## [55] crul_1.5.0 utf8_1.2.4
## [57] withr_3.0.2 scales_1.3.0
## [59] UCSC.utils_1.3.0 rappdirs_0.3.3
## [61] rmarkdown_2.29 XVector_0.47.0
## [63] httr_1.4.7 matrixStats_1.4.1
## [65] gridExtra_2.3 evaluate_1.0.1
## [67] knitr_1.49 GenomicRanges_1.59.1
## [69] IRanges_2.41.1 rlang_1.1.4
## [71] urltools_1.7.3 Rcpp_1.0.13-1
## [73] glue_1.8.0 httpcode_0.3.0
## [75] BiocManager_1.30.25 fauxpas_0.5.2
## [77] BiocGenerics_0.53.3 jsonlite_1.8.9
## [79] R6_2.5.1 MatrixGenerics_1.19.0
## [81] zlibbioc_1.52.0