Question about normalization

Valentin-Bio commented 1 year ago

Hello developer, thanks for this awesome tool,

after running stage_two I got several output tables. Apparently args_oap produces ARGs count tables with different normalization methods am I right ? (rpkm, tpm, ppm, 16S gene copy ). My question is regarding to the abundance calculation that is mentioned on the paper https://www.sciencedirect.com/science/article/pii/S2095809922008062

from what I see , the abundance of an ARG is corrected by first dividing the abundance by the length of the reference ARG sequence and finally dividing that value by the estimated number of cells.

On the article it is mentioned that instead of normalizing by the number of cells , the abundance value can be normalized by the 16S gene copy number. with this in mind and considering the output tables, in which one of the output tables the article-mentioned abundance calculation is performed ?

considering that I'm interested in the "type" classification level, I want to perform further analysis considering the abundance calculation method that is mentioned on the paper so to achieve that I must prefer to use the normalized_cell.type.txt am I right ?

Finally, considering that bacteria may possess multiple 16S gene copies inside their chromosome, it is preferred to use the normalized_cell.type.txt instead of the normalized_16S.type.txt isn't it ?

best regards,

Valentín.

xinehc commented 1 year ago

Hi

with this in mind and considering the output tables, in which one of the output tables the article-mentioned abundance calculation is performed ?

For PPM/RPKM/TPM table, the nominator is file unnormalized.count* (number of reads that mapped to ARGs). For cell/16S table, the nominator is file unnormalized.scov* (number of bp mapped to ARGs divided by the corresponding gene length).

If you are particularly interested in copies of ARGs per copies of 16S, you can go with normalized_16S.*, but yes some bacteria/archaea have several 16S per genomes and that's the main reason in the article it is suggested to use per cell as unit.

HTH, Xi

Valentin-Bio commented 1 year ago

Thanks so much for this clarification and for this excellent tool,

bests,

Valentín.

Biofarmer commented 1 year ago

Hi Xi, may I further ask a question about the file unnormalized.count*? As mentioned, these files are number of reads that mapped to ARGs, which I understand should be integer, however I see some decimal values from these tables, may I know the reasons? Additionally, the number of nCell is also decimal values as shown in your examples with samples "STAS" and "SWHAS104", why is it not integer? Thank you very much.

xinehc commented 1 year ago

see https://github.com/xinehc/args_oap/issues/28

Biofarmer commented 1 year ago

Hi Xi, I saw this issue with "unnormalized count are counts of reads aligned to each gene (aggregated using the same way as before)." However, I am still confused why the count of reads are sometimes not integer. May I have your further explanation? Thanks

xinehc commented 1 year ago

Some genes are part of a 'two-component system' or 'multi-component system' (e.g. required two or more separate genes to present otherwise cannot be considered as resistant), in such cases, the counts are weighted by 1/2 or 1/3, see the structure files in https://github.com/xinehc/args_oap/tree/main/src/args_oap/db.

Biofarmer commented 1 year ago

Hi Xi, thanks for your answer. The "aggregated using the same way as before" in "unnormalized count are counts of reads aligned to each gene (aggregated using the same way as before)." refers to the three different files (gene,subtype,*type) in the output folder? Also, some counts are weighted by 1/6, right?

xinehc commented 1 year ago

refers to the three different files (gene,subtype,*type) in the output folder?

yes

HTH, Xi

Biofarmer commented 1 year ago

Thanks, Xi. I also saw some count values are weighted by 1/6, is this also due to "multi-component system"? I am trying to understand the system from https://github.com/xinehc/args_oap/tree/main/src/args_oap/db, but still confused.

xinehc commented 1 year ago

It could be a bug if you see 1/6. Are you using the example files?

Biofarmer commented 1 year ago

Hi Xi, I used my data, and some unnormalized count values from type files are like 16.83333, 19.1667 for multidrug type. The version I used is v3.2.2.

xinehc commented 1 year ago

the values are aggregated so 16.83333 might be 16 + 1/2 + 1/3 and 19.1667 might be 18 + 1/2 + 1/3*2.

Biofarmer commented 1 year ago

So, these values should be correct, not a bug, right? Additionally, this may be not very important for the output, but it will be great if you can guide me how to understand the two-component_structure.txt and multi-component_structure.txt?

xinehc commented 1 year ago

Yes it should be correct.

Regarding the two-component/multi-component, please see https://www.sciencedirect.com/science/article/pii/S2095809922008062 for more details.

Moreover, in SARG v3.0, special “two-component” and “three-component” tags are given to those ARG subtypes that have two-component systems or three-component systems encoding antibiotic resistance. For example, a pair of genes conferring an efflux pump (tetA(46) and tetB(46)) is required for tetracycline resistance [27]. AcrEF-TolC is another example of a three-component system from the subfamily of RND transporters, the function of which requires a membrane fusion protein (AcrE), inner membrane transporter (AcrF), and outer membrane factor (TolC) [28].

Biofarmer commented 1 year ago

That's great. Thanks. For the example of "two-component", I can see the following items in the two-component_structure.txt, but we cannot know that the tetracycline resistance requires this pair of genes if just based on two-component_structure.txt, right? "gb|AET10444.1|ARO:3004032|tetA(46) tetracyclinetetA(46) tetracycline gb|ANZ79240.1|ARO:3004035|tetA(60) tetracyclinetetA(60) tetracycline gb|AET10445.1|ARO:3004033|tetB(46) tetracyclinetetB(46) tetracycline gb|ANZ79241.1|ARO:3004036|tetB(60) tetracyclinetetB(60) tetracycline"

xinehc commented 1 year ago

but we cannot know that the tetracycline resistance requires this pair of genes if just based on two-component_structure.txt, right?

you are right, the structures files only show the hierarchal structures of genes

Biofarmer commented 1 year ago

Hi Xi, thanks, good to learn a lot from the communications. From my understanding, the counts are only weighted by 1/2 or 1/3 from 'two-component system' or 'multi-component system', right?

xinehc commented 1 year ago

the counts are only weighted by 1/2 or 1/3 from 'two-component system' or 'multi-component system', right?

yes

Biofarmer commented 1 year ago

Thanks a million.

xinehc / args_oap

Question about normalization #26