yjx1217 / simuG

simuG: a general-purpose genome simulator
MIT License
86 stars 12 forks source link

Issue with -gene_gff file.gff -coding_partition_for_snp_simulation "coding" #6

Closed ppgardne closed 2 years ago

ppgardne commented 2 years ago

Hi There,

simuG looks like a fantastic tool -- I'm very keen to use it to randomise several genomes. I'm trialling a few bacterial genomes and have generated GFFs from the EMBL/Genbank annotations, but get an error:

Use of uninitialized value $mRNA_id in hash element at simuG.pl line 954, <$fh> line 1.

Is it possible that simuG requires a more specific GFF format than what we are using? If so, is this documented anywhere? I was hoping it only required the CDS+strand regions to be known (which I can easily generate from EMBL files), rather than full mRNA/spliced models.

Best wishes,

Paul.

yjx1217 commented 2 years ago

Hi Paul,

Thanks for trying out simuG! It expects standard GFF3 format with gene_id and mRNA_id clearly specified. You can check the the following GFF3 file as an example: https://github.com/yjx1217/simuG/blob/master/Testing_Example/SGDref.R64-2-1.gff3.gz

Best, Jia-Xing

ppgardne commented 2 years ago

Thanks for the speedy reply Jia-Xing!

So, just to be clear. The minimum requirements for a CDS region are both a CDS and mRNA line with identical/overlapping coordinates, e.g.:

chrI    SGD CDS 1807    2169    .   -   0   Parent=YAL068C_mRNA;Name=YAL068C_CDS;orf_classification=Verified
chrI    SGD mRNA    1807    2169    .   -   .   ID=YAL068C_mRNA;Name=YAL068C_mRNA;Parent=YAL068C

Where the 9th columns have a Parent=$mRNA_id for the CDS and ID=$mRNA_id for the mRNA? Is anything else expected?

Isn't that a little redundant? I was expecting a CDS line to be sufficient for a coding sequence.

yjx1217 commented 2 years ago

Hi Paul,

So, just to be clear. The minimum requirements for a CDS region are both a

CDS and mRNA line with identical/overlapping coordinates, e.g.:

chrI SGD mRNA 1807 2169 . - . ID=YAL068C_mRNA;Name=YAL068C_mRNA;Parent=YAL068C```

Where the 9th columns have a Parent=$mRNA_id for the CDS and ID=$mRNA_id for the mRNA? Is anything else expected?

Yes, the Parent=ABC, and ID=ABC tags are most critical. As for other Important notes, mabye: 1) the genomic coordinates are presorted, 2) for each gene, the gene track appears before the mRNA track, and the mRNA track appears before the CDS track. Basically, simuG expects a standard GFF3 file that follows the GFF3 specification ( https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md).

Isn't that a little redundant? I was expecting a CDS line to be sufficient for a coding sequence.

If it is just for simulating coding vs noncoding SNPs, I agree with you that the "Parent=" tag is not needed. But since I designed simuG to also be able to also simulate SNPs for 2D and 4D coding partitions, the "Parent=" tag become important for simuG to track the correspondence between each individual CDS and the mRNA it belongs. I feel tacking mRNA-CDS map in this way is more robust than doing it just based on track orders.

As you might found, the GFF3 files from the bioinformatics world may come with many different flavors and it is tricky to write a parser to fit them all. So I chose the lazy solution to write a parser only for the most standard GFF3. :-)

Best, Jia-Xing

--

岳家兴 研究组负责人 (PI) 中山大学肿瘤防治中心

腾飞园F座9层 中新知识城 广州市黄埔区,510555 广东省 中国

办公室电话: +86-20-87340150 Twitter: @iAmphioxus https://twitter.com/iAmphioxus 个人主页: http://www.iamphioxus.org/ 实验室主页: https://evomicslab.org/ 酵母群体参考基因组系: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

Jia-Xing Yue Principal Investigator (PI)

State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center (SYSUCC)

Address: Jia-Xing Yue's Group (Evomics Lab), 9th floor, Building F, Ascendas OneHub Business Park, Sino-Singaporc Guangzhou Knowledge City Huangpu District, 510555 Guangzhou Guangdong Province China

Office phone: +86-20-87340150 Twitter: @iAmphioxus https://twitter.com/iAmphioxus Personal website: http://www.iamphioxus.org/ Lab website: https://evomicslab.org/https://evomicslab.org/ Yeast Population Reference Panel: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

ppgardne commented 2 years ago

Kia ora Jia-Xing,

My apologies to keep bugging you. I'm just trying to generate a minimal working coding example that I can expand to 100s of genomes and/or genic regions.

File: U00096-short.3.gff
U00096  EMBL    gene    190 255 .   +   .   ID=thrL;Name=thrL;
U00096  EMBL    mRNA    190 255 .   +   .   ID=thrL;Name=thrL;Parent=thrL;
U00096  EMBL    CDS 190 255 .   +   0   Parent=thrL;Name=thrL;
File: U00096-short.3.fasta
>U00096 Escherichia coli str. K-12 substr. MG1655, complete genome.
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA
AAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT
TAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA
GCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCAT
TAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGG
GCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG

Which I can successfully execute with:

simuG.pl -coding_partition_for_snp_simulation coding -gene_gff U00096-short.3.gff -refseq U00096-short.3.fasta -prefix sequence.simuG -snp_count 1000

However, the following warning appears in the output:

!!! Warning! Only 0 coding sites available in the genome whereas 1000 SNPs needs to be simulated !!!
!!! Will only introduce 0 SNPs in coding sites !!!

Which suggests I failed. I also tried running the GFF3 from ENSEMBL for Schizosaccharomyces pombe chromosome I through with their public gff3 files and fasta files. Which produces the same warning, whilst your SGD example seems to work.

Any thoughts/suggestions would be gratefully received.

Ngā mihi, Paul.

yjx1217 commented 2 years ago

Hi Paul,

The issue lies in the problem of your gff is that each ID tag should be unique according to the GFF3 specifications. So the gene ID and mRNA ID should be different. If you modify your GFF as follows, it will work. U00096 EMBL gene 190 255 . + . ID=thrL;Name=thrL; U00096 EMBL mRNA 190 255 . + . ID=thrL.mRNA;Name=thrL;Parent=thrL; U00096 EMBL CDS 190 255 . + 0 Parent=thrL.mRNA;Name=thrL;

BTW: the "Name=ABC" tag is not strictly required. So you can potentially remove it if you prefer.

@.*** simuG]$ ./simuG.pl -coding_partition_for_snp_simulation coding -gene_gff U00096-short.3.gff -refseq U00096-short.3.fasta -prefix sequence.simuG -snp_count 1000

[Fri Nov 12 09:30:44 2021] Starting simuG ..

[Fri Nov 12 09:30:44 2021] Check specified options .. Running simuG for SNP/INDEL simulation >> Ignore all options for CNV/inversion/translocation simulation.

This simulation use the random seed: 335081617

The option snp_count has been specified: snp_count = 1000 The option titv_ratio has been specified: titv_ratio = 0.5 The option gene_gff has been specified: gene_gff = U00096-short.3.gff

[Fri Nov 12 09:30:44 2021] Introducing random SNPs based on the following parameters:

snp_count = 1000 titv_ratio = 0.5 gene_gff = U00096-short.3.gff coding_partition_for_snp_simulation = coding !!! Warning! Only 66 coding sites available in the genome whereas 1000 SNPs needs to be simulated !!! !!! Will only introduce 66 SNPs in coding sites !!!

[Fri Nov 12 09:30:44 2021] Simulation completed! :)

[Fri Nov 12 09:30:44 2021] Generating output files ..

Generating the correspondance map for genomic variants introduced during simulation: sequence.simuG.refseq2simseq.map.txt

Generating reference-based vcf file for genomic variants introduced during simulation: sequence.simuG.refseq2simseq.SNP.vcf [Fri Nov 12 09:30:44 2021]

I will look into the issue with parsing Ensembl GFF3 in coming days when I have a bit more time. Thanks for reporting this issue.

Best, Jia-Xing

On Fri, Nov 12, 2021 at 9:06 AM Paul Gardner @.***> wrote:

Kia ora Jia-Xing,

My apologies to keep bugging you. I'm just trying to generate a minimal working coding example that I can expand to 100s of genomes and/or genic regions.

File: U00096-short.3.gff

U00096 EMBL gene 190 255 . + . ID=thrL;Name=thrL;

U00096 EMBL mRNA 190 255 . + . ID=thrL;Name=thrL;Parent=thrL;

U00096 EMBL CDS 190 255 . + 0 Parent=thrL;Name=thrL;

File: U00096-short.3.fasta

U00096 Escherichia coli str. K-12 substr. MG1655, complete genome.

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA

AAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT

TAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA

GCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCAT

TAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGG

GCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG

Which I can successfully execute with:

simuG.pl -coding_partition_for_snp_simulation coding -gene_gff U00096-short.3.gff -refseq U00096-short.3.fasta -prefix sequence.simuG -snp_count 1000

However, the following warning appears in the output:

!!! Warning! Only 0 coding sites available in the genome whereas 1000 SNPs needs to be simulated !!!

!!! Will only introduce 0 SNPs in coding sites !!!

Which suggests I failed. I also tried running the GFF3 from ENSEMBL for Schizosaccharomyces pombe chromosome I through with their public gff3 files http://ftp.ensemblgenomes.org/pub/fungi/release-51/gff3/schizosaccharomyces_pombe/ and fasta files http://ftp.ensemblgenomes.org/pub/fungi/release-51/fasta/schizosaccharomyces_pombe/dna/. Which produces the same warning, whilst your SGD example seems to work.

Any thoughts/suggestions would be gratefully received.

Ngā mihi, Paul.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/yjx1217/simuG/issues/6#issuecomment-966736315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFLL5SLDSL2HX62USMYFE3ULRR7ZANCNFSM5HUDA63A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

岳家兴 研究组负责人 (PI) 中山大学肿瘤防治中心

腾飞园F座9层 中新知识城 广州市黄埔区,510555 广东省 中国

办公室电话: +86-20-87340150 Twitter: @iAmphioxus https://twitter.com/iAmphioxus 个人主页: http://www.iamphioxus.org/ 实验室主页: https://evomicslab.org/ 酵母群体参考基因组系: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

Jia-Xing Yue Principal Investigator (PI)

State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center (SYSUCC)

Address: Jia-Xing Yue's Group (Evomics Lab), 9th floor, Building F, Ascendas OneHub Business Park, Sino-Singaporc Guangzhou Knowledge City Huangpu District, 510555 Guangzhou Guangdong Province China

Office phone: +86-20-87340150 Twitter: @iAmphioxus https://twitter.com/iAmphioxus Personal website: http://www.iamphioxus.org/ Lab website: https://evomicslab.org/https://evomicslab.org/ Yeast Population Reference Panel: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

ppgardne commented 2 years ago

Many thanks Jia-Xing! I have it going now.

Very much appreciate your advice.

Cheers, Paul.

yjx1217 commented 2 years ago

Hi Paul,

Just for your information, I figured out the problem that you encountered when parsing GFF3 files from Ensembl. A fix has been applied to simuG's repository on Github: https://github.com/yjx1217/simuG/commit/303dc2f86a6d7308c4530f65d0d572c48dd64941

Thanks again for spotting this issue.

Best, Jia-Xing

On Fri, Nov 12, 2021 at 12:00 PM Paul Gardner @.***> wrote:

Closed #6 https://github.com/yjx1217/simuG/issues/6.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/yjx1217/simuG/issues/6#event-5607327577, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFLL5VPLMEDE5A2ZEG65WTULSGMZANCNFSM5HUDA63A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

岳家兴 研究组负责人 (PI) 中山大学肿瘤防治中心

腾飞园F座9层 中新知识城 广州市黄埔区,510555 广东省 中国

办公室电话: +86-20-87340150 Twitter: @iAmphioxus https://twitter.com/iAmphioxus 个人主页: http://www.iamphioxus.org/ 实验室主页: https://evomicslab.org/ 酵母群体参考基因组系: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

Jia-Xing Yue Principal Investigator (PI)

State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center (SYSUCC)

Address: Jia-Xing Yue's Group (Evomics Lab), 9th floor, Building F, Ascendas OneHub Business Park, Sino-Singaporc Guangzhou Knowledge City Huangpu District, 510555 Guangzhou Guangdong Province China

Office phone: +86-20-87340150 Twitter: @iAmphioxus https://twitter.com/iAmphioxus Personal website: http://www.iamphioxus.org/ Lab website: https://evomicslab.org/https://evomicslab.org/ Yeast Population Reference Panel: https://yjx1217.github.io/Yeast_PacBio_2016/welcome/

asan-emirsaleh commented 2 years ago

I have had the same issue while processed gff3 file from NCBI RefSeq. I have used the conda installation.

yjx1217 commented 2 years ago

Hi Asan,

The version of simuG packed by conda is older than this fix. Please try if the current GitHub version solves your problem.

Best, Jia-Xing

ppgardne commented 2 years ago

Thanks Jia-Xing. I appreciate the update -- I'll have another crack at it. P.

ppgardne commented 2 years ago

Hi Jia-Xing,

Just trying to have another crack at this. If I use your GFF edits above, but instead use the full-length U00096 genome sequence (https://www.ebi.ac.uk/ena/browser/view/U00096) I get the following error message again: !!! Warning! Only 0 coding sites available in the genome whereas 10 SNPs needs to be simulated !!! !!! Will only introduce 0 SNPs in coding sites !!! Whereas if I use just the first 300 or so nucleotides of the sequence it seems to work. I'm not sure why a longer sequence which is identical over those coordinates 190..255 makes the parser apparently drop the CDS annotation.

Is this expected behaviour?

Best wishes, Paul.

ppgardne commented 2 years ago

I think I found the problem. I need to make sure the GFF & FASTA sequence names match. My apologies!

yjx1217 commented 2 years ago

Ha, Glad the problem is solved. :-)