Stage_one WARNING: Duplicated sequences

xinehc / args_oap

ARGs-OAP: Online Analysis Pipeline for Antibiotic Resistance Genes Detection from Metagenomic Data Using an Integrated Structured ARG Database

MIT License

36 stars 11 forks source link

Stage_one WARNING: Duplicated sequences #58

Closed HuangLoong closed 3 months ago

HuangLoong commented 4 months ago

I generated a warning message while running ARGs OAP stage_one on the server. The metagenomic data is sourced from NCBI and has gone through the kneaddata process. However, when I used my own sequencing data, no warning message was generated. What is the reason and will it affect subsequent analysis? Thanks very much for your answer！ WARNING: Duplicated sequences in sequence extraction. WARNING: Duplicated sequences in 16S copy number calculation. WARNING: Duplicated sequences in cell number calculation.

xinehc commented 4 months ago

This likely means some of your sequences have exactly identical sequence id. You can check the duplication using e.g. seqkit rmdup.

HuangLoong commented 4 months ago

This likely means some of your sequences have exactly identical sequence id. You can check the duplication using e.g. seqkit rmdup.

We renamed the sequence after downloading the raw data from NCBI, and no duplicate sequence IDs were detected using seqkit rmdup.

xinehc commented 4 months ago

Hi,

do you mind sharing a minimal reproducible example? Does your _1 and _2 files have identical sequence id?

cat _1.fa _2.fa > seq.fa
seqkit rmdup seq.fa -D dup.fa

HuangLoong commented 4 months ago

Hi,

do you mind sharing a minimal reproducible example? Does your _1 and _2 files have identical sequence id?
cat _1.fa _2.fa > seq.fa
seqkit rmdup seq.fa -D dup.fa

Thank you. Based on the command you provided, I did detect some duplicated sequences. I then extracted the previously retrieved duplicate IDs from the paired-end sequencing file. Duplicate IDs were found only in one of the pairwise sequence files. I wonder if this error will affect the final result output.

xinehc commented 4 months ago

Only the first detected sequence will be counted. This will lead to an underestimation of either ARG/16S/genome copies.