自定义核苷酸数据库

yangzuokun commented 9 months ago

有个非常奇怪的现象，使用自定义的核苷酸数据库，流程中是使用单端数据bwa分别比对，然后合并提取，我这边尝试了直接用双端数据比对提取序列，发现两者区别很大。单端提取只有24000条，双端提取9万多条。 QQ截图20231218143906

xinehc commented 9 months ago

我测试了一下用greengenes85用single/paired没有很大区别，也许和数据/数据库有关？

bwa mem gg85.fasta _1.fa | samtools flagstats -

5187355 + 0 in total (QC-passed reads + QC-failed reads)
5186373 + 0 primary
0 + 0 secondary
982 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
13473 + 0 mapped (0.26% : N/A)
12491 + 0 primary mapped (0.24% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

bwa mem gg85.fasta _2.fa | samtools flagstats -

5187264 + 0 in total (QC-passed reads + QC-failed reads)
5186373 + 0 primary
0 + 0 secondary
891 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
13324 + 0 mapped (0.26% : N/A)
12433 + 0 primary mapped (0.24% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

bwa mem gg85.fasta _1.fa _2.fa | samtools flagstats -

10374641 + 0 in total (QC-passed reads + QC-failed reads)
10372746 + 0 primary
0 + 0 secondary
1895 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
26997 + 0 mapped (0.26% : N/A)
25102 + 0 primary mapped (0.24% : N/A)
10372746 + 0 paired in sequencing
5186373 + 0 read1
5186373 + 0 read2
16650 + 0 properly paired (0.16% : N/A)
22580 + 0 with itself and mate mapped
2522 + 0 singletons (0.02% : N/A)
5688 + 0 with mate mapped to a different chr
2900 + 0 with mate mapped to a different chr (mapQ>=5)

yangzuokun commented 9 months ago

使用双端比对计算的丰度比单端比对计算的丰度多了3-4倍

xinehc commented 9 months ago

我测试了一下，VFDB_setB_nt是会有这种情况，但在state_two的blastn筛选后结果没有很大区别。

bwa mem VFDB_setB_nt.fas STAS_1.fa STAS_2.fa | samtools fasta -F 2308 - > b.fa
blastn -db VFDB_setB_nt.fas -query b.fa -qcov_hsp_perc 75 -perc_identity 85 -outfmt 6 -max_target_seqs 1 | wc -l

[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 5006 reads
Warning: [blastn] Examining 5 or more matches is recommended
      21

bwa mem VFDB_setB_nt.fas STAS_1.fa | samtools fasta -F 2308 > a.fa
blastn -db VFDB_setB_nt.fas -query a.fa -qcov_hsp_perc 75 -perc_identity 85 -outfmt 6 -max_target_seqs 1 | wc -l

[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 365 reads
Warning: [blastn] Examining 5 or more matches is recommended
      10

xinehc commented 9 months ago

你序列的长度是多少？能否发给我一个可以复现这个问题的序列文件？

我测试50/100/150bp的ecoli在id 80和qcov 75的cutoff下single/paired没有很大区别。

xinehc commented 9 months ago

测试Q5_L1_1使用single/paired经过blastn筛选后没有很大区别，即使bwa初筛出的reads差3-4倍：

bwa mem -t 10 VFDB_setB_nt.fas Q5_L1_1.fq.gz Q5_L1_2.fq.gz | samtools fasta -F 2308 - > p.fa
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 106964 reads

bwa mem -t 10 VFDB_setB_nt.fas Q5_L1_1.fq.gz | samtools fasta -F 2308 - > s.fa
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 12667 reads

blastn -db VFDB_setB_nt.fas -query p.fa -qcov_hsp_perc 75 -perc_identity 80 -evalue 1e-7 -outfmt 6 -max_target_seqs 1 | wc -l
Warning: [blastn] Examining 5 or more matches is recommended
   10625

blastn -db VFDB_setB_nt.fas -query s.fa -qcov_hsp_perc 75 -perc_identity 80 -evalue 1e-7 -outfmt 6 -max_target_seqs 1 | wc -l
Warning: [blastn] Examining 5 or more matches is recommended
    5400

使用stage_two的filter，注意qcov的计算方式和blast计算qcovhsp的方式不完全一样：

blastn -db VFDB_setB_nt.fas \
    -query p.fa \
    -evalue 1e-7 \
    -outfmt '6 qseqid sseqid pident length qlen slen evalue bitscore' \
    -max_target_seqs 5 > p.txt

python -c "
import pandas as pd
df = pd.read_table('p.txt',
                   header=None,
                   names=['qseqid', 'sseqid', 'pident', 'length', 'qlen', 'slen', 'evalue', 'bitscore'],
                   dtype={'sseqid': str})

df['qcov'] = df['length'] / df['qlen'] # not the same as (qstart - qend) / qlen
df = df[(df.pident >= 80) & (df.qcov >= 0.75) & (df.length > 75) & (df.evalue <= 1e-7)]
print('Reads remained after filtering: {}'.format(df.qseqid.nunique()))
"

Reads remained after filtering: 10436

xinehc / args_oap

自定义核苷酸数据库 #50