Thank you for the fantastic work. However i feel confused about some details:
I check the TableS2, the sums of each sample are different, like this:
why would it happen, if i understand well, every sample has the same total count after normalization?
and the results are:
![image](https://github.com/yge15/TCGA_Microbial_Content/assets/126157528/b702cf17-2529-4e92-be97-3810bddfd8bd)
I will be very glad if you can correct my understanding.
3. I checked the **TableS1**, there are 12 duplicates,in which 8 have 0 total counts. Why would it happen? it seems very rare to me.
4. Regarding TableS3, did you correct it for each species by kilobases of genome? and how to get the genome sizes across species?
I would appreciate a lot if you can help me answer these questions!
Best,
TableS2 was normalized by taking TableS1 (raw counts) and divided by total number of reads in each sample (in millions), which can be found in TableS12. If you take the sum row-wise in TableS2, it simply reflects the different proportions of microbial content in each sample.
I would NOT attempt to drop_duplicates() for TableS1. Each sample has unique identifier according to TCGA, which is provided in TableS12. It's not reasonable to drop samples based on the same microbial profile.
Again for "duplicates", more specifically the all 0s rows, it means those samples do NOT contain any microbes according to our analysis. However, sterile samples should NOT be taken out of of the game! Say if you build ML models predicting one cancer vs. all others, or tumor vs normal, you MUST evaluate the performance by taking into account those microbe-free samples!
Hi,
Thank you for the fantastic work. However i feel confused about some details:
I check the TableS2, the sums of each sample are different, like this: why would it happen, if i understand well, every sample has the same total count after normalization?
so i renormalized it again, here is my code:
counts_cpm = counts_t1_copy.div(counts_t1_copy['total_reads'], axis=0,).iloc[:,0:-1] * 1e6
cpm_sums = counts_cpm.iloc[:,0:-1].sum(axis=1)
cpm_sums