wjian8 / psvcp_v1.01

Pan-genome Construction and Population Structure Variation Calling pipeline
GNU General Public License v3.0
32 stars 5 forks source link

variant coordinates on pan-genome reference #3

Open mb47 opened 1 year ago

mb47 commented 1 year ago

Hi, this looks like a great tool -- we haven't tried it yet but before we do I wanted to ask whether you have anything in place in the pipeline that converts variant coordinates from SV and small variant calling back to coordinates on the original reference genome (genotype 0 in your incremental approach)? This would be very helpful -- we want to compare this approach with the likes of minigraph and this only really works if all approaches tested use the same genomic coordinate space. Many thanks Micha

wjian8 commented 1 year ago
  1. The pipeline will give you a output file "pan.pav.gff"
    
    head pan.pav.gff
    Chr1    CG14    Insertion       13197   13252   .       +       .       ID=CG14_Chr1_17852;Name=CG14_Chr1_17852
    Chr1    CG14    Insertion       30419   31148   .       +       .       ID=CG14_Chr1_26765;Name=CG14_Chr1_26765
    Chr1    CG14    Insertion       40380   41455   .       +       .       ID=CG14_Chr1_36664;Name=CG14_Chr1_36664
    .....
You can sort it in Linux. `sort -k1.4,1n -k4,4n  pan.pav.gff > pan.pav.sorted.gff`

head pan.pav.sorted.gff Chr1 CN1 Insertion 12248 22621 . + . ID=CN1_Chr1_15320;Name=CN1_Chr1_15320 Chr1 CG14 Insertion 13197 13252 . + . ID=CG14_Chr1_17852;Name=CG14_Chr1_17852 Chr1 CG14 Insertion 30419 31148 . + . ID=CG14_Chr1_26765;Name=CG14_Chr1_26765 .....


2. I have upload a python script [14pan_pav_gff_to_no_overlap_bed.py](https://github.com/wjian8/psvcp_v1.01/blob/main/construct_pan_script/14pan_pav_gff_to_no_overlap_bed.py) in github (wjian8/psvcp_v1.01/construct_pan_script/). 
`python3 14pan_pav_gff_to_no_overlap_bed.py pan.pav.sorted.gff`
It can delete overlap region of PAVs.
The output is a bed format file pan.pav.sorted.gff.no_overlap.bed

head pan.pav.sorted.gff.no_overlap.bed Chr1 12248 22621 Chr1 30419 31148 Chr1 40380 41455 .....


3. I have upload two R scripts  [8.1.1nipToPanSwitch.R](https://github.com/wjian8/psvcp_v1.01/blob/main/construct_pan_script/8.1.1nipToPanSwitch.R) and [8.1.2panToNipSwitch.R](https://github.com/wjian8/psvcp_v1.01/blob/main/construct_pan_script/8.1.2panToNipSwitch.R) in github (wjian8/psvcp_v1.01/construct_pan_script/)
It needs three input parameters.
`Rscript    8.1.1nipToPanSwitch.R    pos_in_ref0.txt    pan.pav.sorted.gff.no_overlap.bed   output_pos_in_pan.txt`

head pos_in_ref0.txt Chr1 MSU snp 43267665 43267665 Chr5 MSU snp 28661123 28661123 Chr9 MSU snp 23003383 23003383 .....

The pos_in_ref0.txt file splits with tab. It contains the position (column 1 and 4) information you want to convert.

head output_pos_in_pan.txt Chr1 MSU snp 49051287 43267665 Chr5 MSU snp 32947169 28661123 Chr9 MSU snp 27248306 23003383 .....

The column 4 of output_pos_in_pan.txt is the position in pangenome.

`Rscript    8.1.2panToNipSwitch.R    output_pos_in_pan.txt    pan.pav.sorted.gff.no_overlap.bed   output_pos_in_ref0.txt`

head output_pos_in_ref0.txt Chr1 MSU snp 43267665 43267665 Chr5 MSU snp 28661123 28661123 Chr9 MSU snp 23003383 23003383 .....


8.1.2panToNipSwitch.R can convert the position in pan to the position in ref0.
mb47 commented 1 year ago

That's great, thank you. So is file pos_in_ref0.txt an output from the main pipeline?

wjian8 commented 1 year ago

No, It isn't an output from the main pipeline. You create pos_in_ref0.txt and edit it by filling in the position information which you are interested in.

mb47 commented 1 year ago

Ok, I see. This looks like the first half of a GFF file. Does the third column in pos_in_ref0.txt take values from a controlled vocabulary or can it be anything you like? And is column 2 a "source" field?

wjian8 commented 1 year ago

column 2 and 3 can be anything you like.

SchreiberM commented 1 year ago

Hi, Thanks for providing the scripts. The python script to create no overlaps, does not resolve all overlaps properly.

Here is an example of what it was not able to resolve correctly: Input: chr1H | 819649 | 820847 chr1H | 820507 | 821125 chr1H | 821044 | 831630

Output: chr1H | 819649 | 820847 chr1H | 821044 | 831630

Correct result: chr1H | 819649 | 831630

It would be great if you could have a look into this. Thanks, Miriam

wjian8 commented 1 year ago

Thank you. I have fixed the 14pan_pav_gff_to_no_overlap_bed.py. Please try it.