Normalization issues with data imported from bam files

perinom commented 3 years ago

Installed smoothly on 21/09/2021 after #192 had been addressed

Unfortunately 'exp <- normalize_counts(exp, data_type='tss', method="DESeq2")' fails with:

Warning in `[.data.table`(x, , !c("normalized_score")) : 
column(s) not removed because not found: [normalized_score]

exp <- normalize_counts(exp, data_type='tss', method="edgeR") fails with:

Aggregate function missing, defaulting to 'length'
Warning in `[.data.table`(x, , !c("normalized_score")) :  column(s) not removed because not found: [normalized_score]

exp <- normalize_counts(exp, data_type='tss', method="CMP") fails with:

Aggregate function missing, defaulting to 'length'

In all the cases the slot with normalised data is missing, only the raw counts are available in the exp object

Using the sample bam included method="DESeq2" and method="edgeR" fail with:

Error: count_matrix is not a matrix

method="CPM" fails with:

Warning in eval(jsub, SDenv, parent.frame()) :
  NAs introduced by coercion
Warning in `[.data.table`(x, , !c("normalized_score")) :
  column(s) not removed because not found: [normalized_score]

gzentner commented 2 years ago

Hi there, sorry for the long silence (we both have new jobs and have been quite busy!). I ran through the workflow using some of our in-house BAMs and while I do get

Warning in [.data.table(x, , !c("normalized_score")) : column(s) not removed because not found: [normalized_score]

When normalizing, the normalized counts are there. There isn't a specific slot for the normalized counts; rather, they are included in exp@counts$TSSs$raw to avoid duplicating all the information associated with the raw counts. We will discuss renaming that slot to avoid confusion.

I do get the same errors when working through the data from the vignette, we will look into that. Thanks for your patience!

perinom commented 2 years ago

Alright, thanks for the clarification, I'll double check the values before and after.

I was confused because many functions downstream have an argument normalized which led me to expect the raw data to be stored next to the normalised ones for the functions to choose depending on the call.

Since the raw data are overwritten should I expect downstream functions to use normalized counts even with normalized = FALSE, which is the default in all of them?

perinom commented 2 years ago

I do see the normalized_score column, indeed.

However

exp <- apply_threshold(exp, 
                       threshold=5
                       n_samples=1
                       use_normalized = TRUE) # default FALSE

results in Error in eval(jsub, SDenv, parent.frame()) : object 'normalized_score' not found

which, combined with the error from normalize_counts() in the original message led me to think the issue was with the normalization function.

use_normalized = FALSE runs w/o issues but I'm a bit hesitant to proceed as I'm not sure what's being used here.

Please let me know if you prefer to have this as a separate issue

gzentner commented 2 years ago

Both "score" and "normalized_score" are present in the exp@counts$TSSs$raw slot and so the raw data isn't overwritten; however, I do get the same issue when attempting to apply the threshold. We'll sort it out.

zentnerlab / TSRexploreR

Normalization issues with data imported from bam files #193