Open PhilPalmer opened 5 years ago
Hi Phil Palmer,
The output suggested that MosaicHunter runs without correctly reading the reads from bam files (only 9 out of 6077767 sites passed the very first base_number_filter). It seems that you are trying to use the reads generated from Complete Genomics. Can you send me a few reads from your input in the SAM format? I doubt whether the read format and flag in Complete Genomics were labeled as the same way as Illumina. Please keep in mind, our pipeline (especially those error filters) was designed for Illumina platforms, therefore we can not guarantee its performance on other platforms.
Best, August
Hi @AugustHuang,
Thanks for your prompt response.
Do you know where I might be able to find some small test data for trio or paired BAM files?
Here are the first 10 reads from the BAM file I was using:
GS78791-FS3-L05-16:21308201 409 18 10005 0 25M * 0 0 AACCCTAACCCTAACCCCTAACCCT :<:;4<=<<<=/<;2;:5.556554 GC:Z:9S1G5S4G6S GS:Z:CNCCTTCCCT GQ:Z:<!;:,&555. R2:Z:TTGGCAGTAATTATTCATTNTTTACTTCAA Q2:Z:6666%)66663#6&:;;;<!:;;<;<<5:8 RG:Z:18_mapping_GS78791-FS3-L05_016_sorted
GS78791-FS3-L04-13:19196522 435 18 10010 0 27M = 10447 437 TCACCCTCACCCTCACCCCTCACCCTC :;<<<<===<=<<<<5;;656555554 GC:Z:9S1G7S2G8S GS:Z:CNCCCC GQ:Z:<!;'56 RG:Z:18_mapping_GS78791-FS3-L04_013_sorted
GS78790-FS3-L01-4:25749251 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5666666669';;;;;:<;<;<<<<<<7 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:59!; R2:Z:AACCCTAACCCCCTAACCCNTAACCCTAAC Q2:Z:4556555566:;<<<<<<<!:<==<<:;<: RG:Z:18_mapping_GS78790-FS3-L01_004_sorted
GS78791-FS3-L03-8:27775956 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5666655668.<;;;;:<<<<<<;<<<6 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:68!< R2:Z:AACCCTAACCCCCTAACCCNTAACCCTAAC Q2:Z:4555555555:;;9<;<==!:<<=<=<<<: RG:Z:18_mapping_GS78791-FS3-L03_008_sorted
GS78791-FS3-L03-12:11572173 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 56666666692;;;;;9<<<<<<=<<<8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:69!< R2:Z:ACCCTAACCCCCCTAACCCNCTAACCCTAA Q2:Z:4555655556:;<:<<8==!======<<<: RG:Z:18_mapping_GS78791-FS3-L03_012_sorted
GS78791-FS3-L03-12:26327256 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 6666666628*:<;;;:<<<<<<<<<;8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:68!< R2:Z:ACCCTAACCCACCCTAACCNCTAACCCTAA Q2:Z:45555455558;<<<<<==!=:====<<;: RG:Z:18_mapping_GS78791-FS3-L03_012_sorted
GS78791-FS3-L04-6:4649719 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5556555668)<;<;<:<<=<<<===<9 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:48!< R2:Z:ACCCTAACCCTAACCCTAANCCCTAACCCT Q2:Z:4555555665&;.<<<<=<!=====<<<<: RG:Z:18_mapping_GS78791-FS3-L04_006_sorted
GS78791-FS3-L05-5:5840971 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5666665668+<<;;;;<==<<<<==:9 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:48!= R2:Z:CCTAACCCTACCTAACCCTNAACCCTAACC Q2:Z:4455566655:;2<<<<=<!=:===8<<<: RG:Z:18_mapping_GS78791-FS3-L05_005_sorted
GS78791-FS3-L06-10:22112671 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5566666669&;;;;;;<<=<<<<<<<8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:69!< R2:Z:ACCCTAACCCCTAACCCTANACCCTAACCC Q2:Z:4555555666:3:8<<<<=!======<<<: RG:Z:18_mapping_GS78791-FS3-L06_010_sorted
GS78791-FS3-L07-2:22480995 329 18 10015 0 28M * 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 6666666669'<;;;;;<<<<<<=<<<8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:69!< R2:Z:AACCCTAACCAACCCTAACNCCTAACCCTA Q2:Z:4555555566:+<<<<<<=!======<<<: RG:Z:18_mapping_GS78791-FS3-L07_002_sorted
Hi Phil Palmer,
You can download the trio data (90X, Illumina platform) from the ftp of 1000 Genomes Project. See the urls listed in the supplementary table 2 of our NAR paper about MosaicHunter.
The first 10 reads you provided all labeled as secondary alignment probably
because of the very short read length for Complete Genomics. I also noticed
that you down-sampled the input bam file to 10%, which might be another
reason that you didn't have enough sites that passed the
base_number_filter. I suggested to have at least 50X average depth for the
input bam of MosaicHunter. And for trio calling, you should also specify
the path for paternal and maternal sequencing data in your command line "-P
father_bam_file=
Best, August
PhilPalmer notifications@github.com 于2019年3月19日周二 上午11:29写道:
Hi @AugustHuang https://github.com/AugustHuang,
Thanks for your prompt response.
Do you know where I might be able to find some some testdata for trio or paired BAM files?
Here are the first 10 reads from the BAM file I was using:
GS78791-FS3-L05-16:21308201 409 18 10005 0 25M 0 0 AACCCTAACCCTAACCCCTAACCCT :<:;4<=<<<=/<;2;:5.556554 GC:Z:9S1G5S4G6S GS:Z:CNCCTTCCCT GQ:Z:<!;:,&555. R2:Z:TTGGCAGTAATTATTCATTNTTTACTTCAA Q2:Z:6666%)66663#6&:;;;<!:;;<;<<5:8 RG:Z:18_mapping_GS78791-FS3-L05_016_sorted GS78791-FS3-L04-13:19196522 435 18 10010 0 27M = 10447 437 TCACCCTCACCCTCACCCCTCACCCTC :;<<<<===<=<<<<5;;656555554 GC:Z:9S1G7S2G8S GS:Z:CNCCCC GQ:Z:<!;'56 RG:Z:18_mapping_GS78791-FS3-L04_013_sorted GS78790-FS3-L01-4:25749251 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5666666669';;;;;:<;<;<<<<<<7 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:59!; R2:Z:AACCCTAACCCCCTAACCCNTAACCCTAAC Q2:Z:4556555566:;<<<<<<<!:<==<<:;<: RG:Z:18_mapping_GS78790-FS3-L01_004_sorted GS78791-FS3-L03-8:27775956 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5666655668.<;;;;:<<<<<<;<<<6 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:68!< R2:Z:AACCCTAACCCCCTAACCCNTAACCCTAAC Q2:Z:4555555555:;;9<;<==!:<<=<=<<<: RG:Z:18_mapping_GS78791-FS3-L03_008_sorted GS78791-FS3-L03-12:11572173 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 56666666692;;;;;9<<<<<<=<<<8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:69!< R2:Z:ACCCTAACCCCCCTAACCCNCTAACCCTAA Q2:Z:4555655556:;<:<<8==!======<<<: RG:Z:18_mapping_GS78791-FS3-L03_012_sorted GS78791-FS3-L03-12:26327256 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 6666666628:<;;;:<<<<<<<<<;8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:68!< R2:Z:ACCCTAACCCACCCTAACCNCTAACCCTAA Q2:Z:45555455558;<<<<<==!=:====<<;: RG:Z:18_mapping_GS78791-FS3-L03_012_sorted GS78791-FS3-L04-6:4649719 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5556555668)<;<;<:<<=<<<===<9 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:48!< R2:Z:ACCCTAACCCTAACCCTAANCCCTAACCCT Q2:Z:4555555665&;.<<<<=<!=====<<<<: RG:Z:18_mapping_GS78791-FS3-L04_006_sorted GS78791-FS3-L05-5:5840971 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5666665668+<<;;;;<==<<<<==:9 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:48!= R2:Z:CCTAACCCTACCTAACCCTNAACCCTAACC Q2:Z:4455566655:;2<<<<=<!=:===8<<<: RG:Z:18_mapping_GS78791-FS3-L05_005_sorted GS78791-FS3-L06-10:22112671 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 5566666669&;;;;;;<<=<<<<<<<8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:69!< R2:Z:ACCCTAACCCCTAACCCTANACCCTAACCC Q2:Z:4555555666:3:8<<<<=!======<<<: RG:Z:18_mapping_GS78791-FS3-L06_010_sorted GS78791-FS3-L07-2:22480995 329 18 10015 0 28M 0 0 CTAACCCTAACCCCTAACCCTAACCCTA 6666666669'<;;;;;<<<<<<=<<<8 GC:Z:9S1G8S1G9S GS:Z:AANC GQ:Z:69!< R2:Z:AACCCTAACCAACCCTAACNCCTAACCCTA Q2:Z:4555555566:+<<<<<<=!======<<<: RG:Z:18_mapping_GS78791-FS3-L07_002_sorted
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zzhang526/MosaicHunter/issues/3#issuecomment-474424889, or mute the thread https://github.com/notifications/unsubscribe-auth/AHpVrKeUu1GHGk04R4VKfEv5AQknLaflks5vYQJogaJpZM4b6DEU .
--
Hi,
I would like to run the pipeline in paired/trio mode. Is any testdata available?
I have tried myself, however, when I run the pipeline lots of the output produced is empty and the standard output has
ratio NaN%
for many values.I think this is similar to issue #1, however, I have removed the
chr
prefix from my BAM file and am using the FASTA provided as testdata.To get my BAM file:
s3://giab/data/AshkenazimTrio/HG002_NA24385_son/CompleteGenomics_normal_RMDNA/son_NA24385_GS000037263-ASM/BAM/chr18_mapping_sorted_header.bam
samtools view -s 0.1 -b chr18_mapping_sorted_header.bam > son_subsample_chr18_mapping_sorted_header.bam
samtools view -h son_subsample_chr18_mapping_sorted_header.bam | sed 's/chr//g' | samtools view -Shb - -o son_subsample_18_mapping_sorted_header.bam
When I run the following command:
I get the following output:
Do you have any idea what the problem may be and how I can resolve it? Is the problem that the reference FASTA and BAM file do not correspond?
Thanks in advance, any help would be much appreciated