yge15 / TCGA_Microbial_Content

Comprehensive analysis of microbial content in whole-genome sequencing samples from The Cancer Genome Atlas project
4 stars 0 forks source link

Questions about the data in TableS3 #1

Open Jiangx285 opened 1 week ago

Jiangx285 commented 1 week ago

Hi, I am trying to reanalyze the data, but I am confused about the data in tableS3. Thank you ! Q1: Why are the results in Table S1 and Table S3 so different? Did you do additional filtering for possible contaminating bacteria?

Taking the esophageal cancer data as an example, in Table S1, the top species are Cutibacterium acnes (prevalence 92%) and Escherichia coli (prevalence 83%); but in Table S3: Cutibacterium acnes (prevalence 3%), Escherichia coli (prevalence 0%).

Q2: "Table S3 further normalizes the counts by dividing by the genome size of each species, in kilobases. " This normalization method is rarely used. Why did you choose to use this method?

yge15 commented 1 week ago

Hi there, TableS1 is provided as raw read counts; TableS3 takes those raw counts and normalizes by millions of reads sequenced, and kilobases of genome. There's no filtering of contaminants. The idea is to correct for sequencing depths across samples and genome sizes across species. Say you're comparing a 5Mb bacteria to a 5Kb virus. In term of raw counts, assuming the same coverage, you'd expect 1000x more reads from the bacteria simply because the virus has a much smaller genome. You can also think of TableS3 as counts/cell (particle). Note that most of the raw counts are small numbers, but bacterial genomes are in the magnitude of Mbs, thus resulting in normalized counts in TableS3 dropped below 0.01 (rounded to 0.00). If you're concerned about losing resolution for bacteria, please consider using TableS1 or TableS2 (counts/Mrs). This normalization scheme is similar to that used to quantify transcript abundance in RNA-seq data. Read more if you're interested: https://bioinformatics.ccr.cancer.gov/btep/questions/what-is-the-difference-between-rpkm-fpkm-and-tpm

Jiangx285 commented 1 week ago

Thanks for your reply! I will consider using the data in Table S1. Do you have any suggestions for removing contaminating microorganisms such as Cutibacterium acnes and Escherichia coli?·