xfengnefx / PPGfinder

Discover processed pseudogene from long structural variations.
MIT License
2 stars 2 forks source link

GFF3/GTF to BED #2

Closed kullrich closed 2 years ago

kullrich commented 2 years ago

Dear Xiaowen Feng, dealing with GTF3/GTF files either from NCBI or ENSEMBL is sometimes cumbersome.

Do you suggest any tool to convert these file formats into the BED format that one would need?

As from the REAMDE I guess I need to export the coding fraction from the GFF3/GTF including the introns, since most of the available tools do extract just the exon part, like e.g. gffread gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf.

Another question is, how one should deal with different isoforms of genes, do you just take the "primary transcript" or use all annotated isoforms?

Thank you in anticiaption for your help and nice tool to find all these events between genome assemblies.

Best regards

Kristian

xfengnefx commented 2 years ago

convert GTF3 to BED

You can use minimap2's paftools gff2bed in.gtf[.gz] > out.bed.

different isoforms of genes

All isofroms in the gene references are considered if they could be aligned to. The script will then check the overlaps with introns/exons, the TSD and the polyA tail to choose a best match for each SV input. For results (humans and human vs ape) we described in the manuscript, this was good enough. I haven't tried other use cases.

If you want to double check, use paftools splice2bed with the paf file from step3. Load this bed file, the bed file of gene references and the reference genome into IGV. The processed pseudogene should look like its best-matching isoform, with a few flanking bases and a polyA tail.

kullrich commented 2 years ago

Thx, I will try. I am working with the mm39 annotated ensembl genome to compare some mouse genomes and wanted to use the corresponsing GFF3 file.

But for step3 refgene.fa.gz are ment to be full-length (including introns) cDNA?

kullrich commented 2 years ago

Hi, I managed now to run it. There is one open question for me. In the final results file for the DEL types, the span_on_ref and span_on_contig are always the same positions. Meaning that the position from the call.tsv are not correctly put into it. In the call.tsv the DEL position on the contig is correct and points to a single position. Am I doing something wrong, since in the supplementary of your paper for the DEL types the span_on_contig is like in the call.tsv file a single position? Thank you in anticipation Best regards Kristian

kullrich commented 2 years ago

Example used:

#get genome data
wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

wget http://ftp.ensembl.org/pub/release-107/fasta/pan_troglodytes/dna/Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz
gunzip Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz

#get reference annotations
wget http://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/Homo_sapiens.GRCh38.107.chr.gtf.gz

#create reference bed file
paftools.js gff2bed Homo_sapiens.GRCh38.107.chr.gtf.gz > Homo_sapiens.GRCh38.107.chr.bed

#create reference transcript fasta file (including introns)
bedtools getfasta -s -fi Homo_sapiens.GRCh38.dna.primary_assembly.fa -bed Homo_sapiens.GRCh38.107.chr.bed -name > Homo_sapiens.GRCh38.107.chr.refgene.fa
seqkit replace -p "::.+" Homo_sapiens.GRCh38.107.chr.refgene.fa > Homo_sapiens.GRCh38.107.chr.refgene.red.fa

#step1
minimap2 -z 10000,200 -r 5k -t 32 -cx asm5 --cs Homo_sapiens.GRCh38.dna.primary_assembly.fa Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa > aln.paf
sort -k6,6 -k8,8n aln.paf > aln_sorted.paf
paftools.js call -q0 -L10000 -l1000 aln_sorted.paf 1>call.var 2>call.vst

#step2
python findPPG_fileprep.py -o call.tsv call.var 1> SV.fa

#step3
minimap2 -f 10000 -N 100 -p 0.1 -t 32 -cx splice --cs Homo_sapiens.GRCh38.107.chr.refgene.red.fa SV.fa > aln_SV.paf

#step4
python findPPG_parsealn.py -o PPGFparse -t32 Homo_sapiens.GRCh38.107.chr.bed Homo_sapiens.GRCh38.dna.primary_assembly.fa aln_SV.paf

#step5
python findPPG_annotate.py -o PPGFannotate -t32 --assembly Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa \
Homo_sapiens.GRCh38.dna.primary_assembly.fa \
Homo_sapiens.GRCh38.107.chr.bed \
call.tsv \
PPGFparse.PPG.paf
xfengnefx commented 2 years ago

Sorry I didn't see your July 14's additional question. Github does not send notification about editted comment. Please make new posts as needed.

The coordinate is my mistake. Please either apply this patch (patch findPPG_annotate.py < patch.txt) or pull the git HEAD, then rerun the step5:

diff --git a/findPPG_annotate.py b/findPPG_annotate.py
index 41c9ce3..e4a8169 100644
--- a/findPPG_annotate.py
+++ b/findPPG_annotate.py
@@ -193,6 +193,7 @@ def worker(d):
             parentgene = genename_map[parentgene]
         parentgene_ID = entry[-6].split('|')[0]

+        coor_contig = str(contig_start)+'-'+str(contig_end)
         if TYPE=='INS':
             contig_start = contig_start
             contig_end = contig_end
@@ -330,7 +331,7 @@ def worker(d):
                     continue
             cleavage = the_retrocopy[16:20].lower() + the_retrocopy[20:24]
             cleavage = cleavage[::-1]
-            coor_contig = str(contig_start)+'-'+str(contig_end)
+            #coor_contig = str(contig_start)+'-'+str(contig_end)
             coor_ref = str(chrom_s)+'-'+str(chrom_e)
             holder = [SVname, SVsamplename, 
                       TYPE, contig_name, SVchrom, SVstrand,

Other columns or the sequences are not affected by this. I lazily recycled variables and forgot they are later used for printing the results.

kullrich commented 2 years ago

Thank you so much, now the span_on_contig positions are there.