zhangrengang / SubPhaser

Phase, partition and visualize subgenomes of a neoallopolyploid or hybrid based on the subgenome-specific repetitive kmers.
https://doi.org/10.1111/nph.18173
GNU General Public License v3.0
53 stars 12 forks source link

sg.config configuration (The homology of the de novo assembled genome is not known) #4

Open hushaoqiang opened 2 years ago

hushaoqiang commented 2 years ago

Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line? I am looking forward to your reply.

zhangrengang commented 2 years ago

No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:

fan-fve

So the sg.config configuration can be set like:

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4
...
zhangrengang commented 2 years ago

Another way for you, the HiC contact heatmap can also show diagonal signals between homeologous chromosomes like: hic-heatmap So the sg.config configuration can be set like:

chr01a chr01b chr01c chr01d
chr02a chr02b chr02c chr02d
...

Note that the HiC heatmap should not be filtered by mapping qualtiy of reads.

hushaoqiang commented 2 years ago

谢谢老师您的解答,我将使用您所推荐的方式来做同源分析。另外还有个问题请教一下老师您,为了说详细点,我就是用中文了。 老师,使用Camarosa草莓的基因组序列想要试运行一下,基因组文件是发表的序列。配置文件内容: Fvb1-1 Fvb1-2 Fvb1-3 Fvb1-4 Fvb2-1 Fvb2-2 Fvb2-3 Fvb2-4 Fvb3-1 Fvb3-2 Fvb3-3 Fvb3-4 Fvb4-1 Fvb4-2 Fvb4-3 Fvb4-4 Fvb5-1 Fvb5-2 Fvb5-3 Fvb5-4 Fvb6-1 Fvb6-2 Fvb6-3 Fvb6-4 Fvb7-1 Fvb7-2 Fvb7-3 Fvb7-4

运行命令是:subphaser -i F_ana_Camarosa_6-28-17_hardmasked.fasta -c sg.config 但是会出现: 22-04-28 18:21:39 [INFO] Consistent with subgenome assignment: 16 (6.93%); potential exchange: 149 (64.50%) 22-04-28 18:21:39 [INFO] Output: /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.bin.enrich 22-04-28 18:21:39 [INFO] ###Step: LTR 22-04-28 18:21:39 [INFO] Identifying LTR-RTs by ['ltr_harvest'] 22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.scn.ok exists; skip this step 22-04-28 18:21:45 [INFO] 8186 LTRs identified 22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.tesort.ok exists; skip this step 22-04-28 18:21:45 [INFO] By TEsorter, 190 (2.3%) are classified as LTRs, of which 0 (0.0%) are intact with complete protein domains 22-04-28 18:21:45 [INFO] After filtering, 186 / 8186 (2.3%) LTRs retained 22-04-28 18:21:45 [INFO] Outputing coordinate - LTR maps to /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.ltr.bin.count 22-04-28 18:21:45 [INFO] Start Pool with 32 process(es) 22-04-28 18:21:45 [INFO] Processed 0 sequences Traceback (most recent call last): File "/home/hsq/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')()) File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main pipeline.run() File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[]) File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 563, in step_l tr Seqs.map_kmer3([ltrfile], d_kmers, fout=fout, k=self.k, ncpu=self.ncpu, File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/Seqs.py", line 97, in map_kmer3 logger.info('{} ({:.2%}) sequences contain subgenome-specific kmers'.format(mapped_seqs, mapped_seqs/i)) ZeroDivisionError: division by zero

k15_q200_f2.kmer_pca.pdf是成功生成的。生成的pca图与老师您附表中的图不一样。 请问一下老师我是不是需要调整参数或者其他设置有什么问题。 期待老师您的回复

zhangrengang commented 2 years ago

Do not use hardmasked genome sequences, because subphaser is based on repeated sequences (TEs in most cases) in fact.

hushaoqiang commented 2 years ago

老师,请问您对开麦罗莎草莓做分析时(Fig. S27 Subgenome phasing with SubPhaser in the Fragaria x ananassa genome.),使用的文件是哪一个啊。

zhangrengang commented 2 years ago

F_ana_Camarosa_6-28-17.fasta.gz. But you may not get the identical results but similar results.

hushaoqiang commented 2 years ago

老师,我刚刚又去GDR网站看了一下,我只看到了 F_ana_Camarosa_6-28-17_hardmasked.fasta.gz跟[F_ana_Camarosa_6-28-17_unmasked.fasta.gz两个文件,请问一下您告诉我的F_ana_Camarosa_6-28-17.fasta.gz文件这个在哪个链接下载啊,麻烦老师您了。

zhangrengang commented 2 years ago

F_ana_Camarosa_6-28-17_unmasked.fasta.gz

hushaoqiang commented 2 years ago

老师,已成功运行,4套亚基因组成功分型。图非常好看,感谢老师您的耐心指导与为科研工作者所开发的SubPhaser软件。

hushaoqiang commented 2 years ago

Teacher, I have successfully found homologous chromosomes using MUMmer software, and then performed genotyping using SubPhaser. The results are very good. Thank you!

zhangrengang commented 2 years ago

Great. Thanks for your feedback.

kashiff007 commented 2 years ago

Hi @zhangrengang, Thanks for this excellent software. I have been using this for my newly assembled allopolyploidy genome (2n = 4x = 40). The genome has two subgenomes, but I don't know the homologous pairs. So as you suggested in previous comments, I used mummer and minimapper2 to find the homologous pairs and assigned them in the configurations file.

But after running the SubPhaser with default parameters I got 19 chr in SG1 and only one chr in SG2. Idially, it should be 10 chr in each subgenome.

image

The differential k-mer heatmap also shows the similar result:

image

Can you suggest why it happend? Am I approaching rightly?

zhangrengang commented 2 years ago

@kashiff007 It seems that your genome is contig-level, but not chromosome-level?

kashiff007 commented 2 years ago

Yes, It is a newly assembled genome and from cytometry, we speculate that the largest 20 contigs are full genome length.

zhangrengang commented 2 years ago

@kashiff007 Can you show the homoelogous relationship such as dot plot, and your config file? How much percent do the largest 20 contigs account for the whole genome?

kashiff007 commented 2 years ago

image

here y-axis is my new genome and when we align with another closely related species Oropetium (x-asix), id gives the following plot. You can see each of the x-axis has two aligned positions in y-axis, which tells that these are homologous.

Based on this I made a config file which looks like: ptg000016l ptg000003l ptg000013l ptg000004l ptg000022l ptg000015l ptg000011l ptg000007l ptg000012l ptg000006l ptg000019l ptg000008l ptg000017l ptg000018l ptg000002l ptg000010l ptg000001l ptg000005l ptg000020l ptg000014l Each row has two contigs (assuming homologous chromosomes) separated by space.

Total genoem size is 587 mb (587146212) and top 20 contigs are abount 573mb (573812465). or 97.72% of the total genome.

zhangrengang commented 2 years ago

@kashiff007 It seems no problem. Do you have evidence for allo?

kashiff007 commented 2 years ago

I don't have evidence for allo. My primary job was to separate the subgenemes so I followed this thread. The circos looks like this: circos

kashiff007 commented 2 years ago

is it something to do with clustering?

zhangrengang commented 2 years ago

@kashiff007 If there are no much assembly errors (swtich errors, etc,), your genome is likely to be an auto-polyploid or be heavily recombined, which can not by phased by subphaser. You can try to change the parameters, such as setting -q 100 or reducing the number of chromosome sets like:

ptg000016l ptg000003l
ptg000013l ptg000004l
ptg000022l ptg000015l
ptg000011l ptg000007l
ptg000012l 
ptg000006l
ptg000019l 
...
ptg000020l 
ptg000014l

But these may also not work.

kashiff007 commented 2 years ago

OK, I appreciate the explanation.

kedduck commented 10 months ago

No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:

fan-fve

So the sg.config configuration can be set like:

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4
...

Hi, @zhangrengang

I'm soryy to disturb you. The plot is very fancy. Is it plotted by "dosPlotly" ?

Thanks!

zhangrengang commented 10 months ago

@kedduck No, it is from the paper: Edger P P, Poorten T J, VanBuren R et. al. Origin and evolution of the octoploid strawberry genome [J]. Nature Genet., 2019, 51 (3): 541–547 [http://doi.org/10.1038/s41588-019-0356-4]

kedduck commented 10 months ago

Thanks for your quick reply!