Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: aggregate by sample (i.e. column) #188

Open
csdaw opened this issue May 18, 2023 · 4 comments
Open

Feature request: aggregate by sample (i.e. column) #188

csdaw opened this issue May 18, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@csdaw
Copy link

csdaw commented May 18, 2023

Hi there,

I'm not sure if this functionality already exists, but I'd like to propose a function aggregateSamples() which is companion to aggregateFeatures(), and would do essentially the same thing but would combine columns instead of rows.
I'm currently using this to calculate log2 SILAC ratios, but it would have other general uses such as calculating row means.

Thanks!

Functions

suppressPackageStartupMessages({
  library(SummarizedExperiment)
  library(QFeatures)
})

# Define 'column-wise' version of MsCoreUtils::aggregate_by_vector()
aggregate_cols_by_vector <- function(x, INDEX, FUN, ...) {
  ## Input checks go here
  
  FUN <- match.fun(FUN)
  
  res <- lapply(
    split(seq_len(ncol(x)), INDEX),
    FUN = function(j) FUN(x[, j, drop = FALSE], ...)
  )
  nms <- names(res)
  res <- do.call(cbind, res)
  colnames(res) <- nms
  rownames(res) <- rownames(x)
  
  ## HDF5Matrix check goes here
  res
}

# .aggregateSamples, essentally a 'column-wise' version of .aggregateQFeatures
.aggregateSamples <- function(object, scol, fun, ...) {
  if (missing(scol))
    stop("'scol' is required")
  m <- assay(object, 1)
  cd <- colData(object)
  if (!scol %in% names(cd))
    stop("'scol' not found in the assay's colData.")
  groupBy <- cd[[scol]]
  
  ## Store class of assay i in case it is not a SummarizedExperiment
  ## so that the aggregated assay can be reverted to that class
  .class <- class(object)
  
  ## Message about NA values in quant/col data goes here...
  
  if (is.vector(groupBy)) {
    aggregated_assay <- aggregate_cols_by_vector(m, groupBy, fun, ...)
    aggcount_assay <- aggregate_cols_by_vector(m, groupBy, MatrixGenerics::rowCounts)
    aggregated_coldata <- QFeatures::reduceDataFrame(cd, cd[[scol]],
                                                     simplify = TRUE,
                                                     drop = TRUE,
                                                     count = TRUE)
    assays <- SimpleList(assay = aggregated_assay, aggcounts = aggcount_assay)
    coldata <- aggregated_coldata[colnames(aggregated_assay), , drop = FALSE]
  } else {
    stop("'scol' must be a vector")
  }
  se <- SummarizedExperiment(assays = assays,
                             colData = coldata,
                             rowData = rowData(object))
  
  ## SummarizedExperiment class check goes here...
  
  return(se)
}

# Define generic function and QFeatures method
setGeneric("aggregateSamples", function(object, ...) standardGeneric("aggregateSamples"))
#> [1] "aggregateSamples"

setMethod("aggregateSamples", "QFeatures",
          function(object, i, scol, name = "newAssay",
                   fun = base::rowSums, ...) {
            if (isEmpty(object))
              return(object)
            ## Check arguments
            if (any(present <- name %in% names(object)))
              stop("There's already one or more assays named: '",
                   paste0(name[present], collapse = "', '"), "'.")
            i <- QFeatures:::.normIndex(object, i)
            if (length(i) != length(name)) stop("'i' and 'name' must have same length")
            if (length(scol) == 1) scol <- rep(scol, length(i))
            if (length(i) != length(scol)) stop("'ii and 'scol' must have same length")
            ## Aggregate each assay
            for (j in seq_along(i)) {
              from <- i[[j]]
              to <- name[[j]]
              by <- scol[[j]]
              
              ## Create the aggregated assay
              aggAssay <- .aggregateSamples(object[[from]],
                                            by, fun, ...)
              ## Add the assay to the QFeatures object
              object <- addAssay(object, aggAssay, name = to)
              ## Link the input assay to the aggregated assay
              ## No new rows created, therefore linking is automatic!
            }
            object
          })

Usage example

# Make example SummarizedExperiment
m1 <- matrix(
  rep(1:4, each = 5),
  ncol = 4
)

sample_names <- c("c1", "c2", "t1", "t2")
feature_names <- paste0("pep", 1:5)

colnames(m1) <- sample_names
rownames(m1) <- feature_names

cd1 <- data.frame(
  treatment = rep(c("control", "treatment"), each = 2)
)

rownames(cd1) <- sample_names

se1 <- SummarizedExperiment(m1, colData = cd1)

assay(se1)
#>      c1 c2 t1 t2
#> pep1  1  2  3  4
#> pep2  1  2  3  4
#> pep3  1  2  3  4
#> pep4  1  2  3  4
#> pep5  1  2  3  4
colData(se1)
#> DataFrame with 4 rows and 1 column
#>      treatment
#>    <character>
#> c1     control
#> c2     control
#> t1   treatment
#> t2   treatment

# Make example QFeatures
qf1 <- QFeatures(list(peptides = se1))

qf1
#> An instance of class QFeatures containing 1 assays:
#>  [1] peptides: SummarizedExperiment with 5 rows and 4 columns

# Aggregate samples
qf1 <- aggregateSamples(qf1, i = "peptides", scol = "treatment", name = "sums", fun = rowSums)

qf1
#> An instance of class QFeatures containing 2 assays:
#>  [1] peptides: SummarizedExperiment with 5 rows and 4 columns 
#>  [2] sums: SummarizedExperiment with 5 rows and 2 columns
assay(qf1[['sums']])
#>      control treatment
#> pep1       3         7
#> pep2       3         7
#> pep3       3         7
#> pep4       3         7
#> pep5       3         7
@lgatto lgatto added the enhancement New feature or request label May 21, 2023
@lgatto
Copy link
Member

lgatto commented May 21, 2023

Hi Charlotte.

Thank you for the feature request - I indeed see some utility in the request, even though I am not sure that the current suggestion/implementation is adapted to QFeatures instances:

  1. One fundamental feature of the aggregateFeatures() function and the whole QFeatures class is that the link between features are recorded.
  2. After calling aggregateFeatures(), we still have features (precursors, peptides, proteins), and don't fundamentally change the nature of these rows.

Both of these rules would be broken with your request:

  1. There is no tracking along the columns (i.e. AssayLink)
  2. We don't have samples anymore.

The second point is merely a change in semantics (that would need some reflection), but the former seems essential to comply with the spirit of the package. It would be possible to implement (but probably not trivial), but we don't have the time at the moment and given point 2, I am not convinced it is advised (without more motivated use cases, at least).

However, in the light of the discussion above, I think it would make sense to add your request to the packge and to return a SummarizedExperiment, and leave it to the user to decide what to do with it, i.e. either continue with the SE, or start a new QFeatures object with a new set of columns/(aggregated) samples.

-> added to the TODO list.

@csdaw
Copy link
Author

csdaw commented May 25, 2023

Thank you for taking a look! On your points:

  1. I agree it would be more time/trouble than its worth to extend AssayLink to work on columns as well.

  2. I just wasn't sure there is a standard way to refer to 'columns in an assay' so I defaulted to 'sample' which seems to be what is used in MultiAssayExperiment docs.

I hope to have some time to submit a PR once I've wrapped up experiments.

@cvanderaa
Copy link
Collaborator

cvanderaa commented May 14, 2024

Hello @csdaw,

Thanks a lot for this concrete suggestion and implementation. I wanted you to know that I couldn't find the time to look into it yet, but this request is very relevant and I did not forget it.

To complement the discussion, here are two additional use cases where aggregating samples would come very handy:

  1. pseudo-bulking single-cell data, aggregate multiple cells within a cluster (eg cell type) within a patient (or other biological replication). This could facilitate assessing patient-level contrasts from single-cell data.
  2. Aggregate technical replicates (eg same sample acquired twice through LC-MS) into a single observation.

I'll be working on a project to evaluate the performance 1., so I plan to have a look at this issue soon.

@csdaw
Copy link
Author

csdaw commented May 30, 2024

Thanks @cvanderaa, good to hear there are other applications this could facilitate. I'm happy to test anything you come up with if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants