Questions about normalization

Xueyao0830 commented 4 months ago

Hi,

Thank you for the fantastic work. However i feel confused about some details:

I check the TableS2, the sums of each sample are different, like this: why would it happen, if i understand well, every sample has the same total count after normalization?

so i renormalized it again, here is my code:


counts_t1_copy = counts_t1.drop_duplicates()
counts_t1_copy['total_reads'] = counts_t1_copy.sum(axis=1)
counts_t1_copy= counts_t1_copy[counts_t1_copy['total_reads'] > 0]

counts_cpm = counts_t1_copy.div(counts_t1_copy['total_reads'], axis=0,).iloc[:,0:-1] * 1e6

cpm_sums = counts_cpm.iloc[:,0:-1].sum(axis=1)

cpm_sums


 and the results are:
![image](https://github.com/yge15/TCGA_Microbial_Content/assets/126157528/b702cf17-2529-4e92-be97-3810bddfd8bd)

I will be very glad if you can correct my understanding.

3. I checked the **TableS1**, there are 12 duplicates,in which 8 have 0 total counts. Why would it happen? it seems very rare to me. 
4. Regarding TableS3, did you correct it for each species by kilobases of genome? and how to get the genome sizes across species?

I would appreciate a lot if you can help me answer these questions!

Best,

yge15 commented 4 months ago

Hi,

TableS2 was normalized by taking TableS1 (raw counts) and divided by total number of reads in each sample (in millions), which can be found in TableS12. If you take the sum row-wise in TableS2, it simply reflects the different proportions of microbial content in each sample.
I would NOT attempt to drop_duplicates() for TableS1. Each sample has unique identifier according to TCGA, which is provided in TableS12. It's not reasonable to drop samples based on the same microbial profile.
Again for "duplicates", more specifically the all 0s rows, it means those samples do NOT contain any microbes according to our analysis. However, sterile samples should NOT be taken out of of the game! Say if you build ML models predicting one cancer vs. all others, or tumor vs normal, you MUST evaluate the performance by taking into account those microbe-free samples!
TableS13 provides a list of RefSeq accessions for the genomes we used to build the Microbial2023 database. Using this assembly summary https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt you can find their corresponding genome sizes.

Xueyao0830 commented 4 months ago

Thank you very much for your answer! Now i understand it better.

yge15 / TCGA_Microbial_Content

Questions about normalization #2