Closed kullrich closed 2 years ago
convert GTF3 to BED
You can use minimap2's paftools gff2bed in.gtf[.gz] > out.bed
.
different isoforms of genes
All isofroms in the gene references are considered if they could be aligned to. The script will then check the overlaps with introns/exons, the TSD and the polyA tail to choose a best match for each SV input. For results (humans and human vs ape) we described in the manuscript, this was good enough. I haven't tried other use cases.
If you want to double check, use paftools splice2bed
with the paf file from step3. Load this bed file, the bed file of gene references and the reference genome into IGV. The processed pseudogene should look like its best-matching isoform, with a few flanking bases and a polyA tail.
Thx, I will try. I am working with the mm39 annotated ensembl genome to compare some mouse genomes and wanted to use the corresponsing GFF3 file.
But for step3 refgene.fa.gz
are ment to be full-length (including introns) cDNA?
Hi, I managed now to run it. There is one open question for me. In the final results file for the DEL types, the span_on_ref and span_on_contig are always the same positions. Meaning that the position from the call.tsv are not correctly put into it. In the call.tsv the DEL position on the contig is correct and points to a single position. Am I doing something wrong, since in the supplementary of your paper for the DEL types the span_on_contig is like in the call.tsv file a single position? Thank you in anticipation Best regards Kristian
Example used:
#get genome data
wget http://ftp.ensembl.org/pub/release-107/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/release-107/fasta/pan_troglodytes/dna/Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz
gunzip Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa.gz
#get reference annotations
wget http://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/Homo_sapiens.GRCh38.107.chr.gtf.gz
#create reference bed file
paftools.js gff2bed Homo_sapiens.GRCh38.107.chr.gtf.gz > Homo_sapiens.GRCh38.107.chr.bed
#create reference transcript fasta file (including introns)
bedtools getfasta -s -fi Homo_sapiens.GRCh38.dna.primary_assembly.fa -bed Homo_sapiens.GRCh38.107.chr.bed -name > Homo_sapiens.GRCh38.107.chr.refgene.fa
seqkit replace -p "::.+" Homo_sapiens.GRCh38.107.chr.refgene.fa > Homo_sapiens.GRCh38.107.chr.refgene.red.fa
#step1
minimap2 -z 10000,200 -r 5k -t 32 -cx asm5 --cs Homo_sapiens.GRCh38.dna.primary_assembly.fa Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa > aln.paf
sort -k6,6 -k8,8n aln.paf > aln_sorted.paf
paftools.js call -q0 -L10000 -l1000 aln_sorted.paf 1>call.var 2>call.vst
#step2
python findPPG_fileprep.py -o call.tsv call.var 1> SV.fa
#step3
minimap2 -f 10000 -N 100 -p 0.1 -t 32 -cx splice --cs Homo_sapiens.GRCh38.107.chr.refgene.red.fa SV.fa > aln_SV.paf
#step4
python findPPG_parsealn.py -o PPGFparse -t32 Homo_sapiens.GRCh38.107.chr.bed Homo_sapiens.GRCh38.dna.primary_assembly.fa aln_SV.paf
#step5
python findPPG_annotate.py -o PPGFannotate -t32 --assembly Pan_troglodytes.Pan_tro_3.0.dna.toplevel.fa \
Homo_sapiens.GRCh38.dna.primary_assembly.fa \
Homo_sapiens.GRCh38.107.chr.bed \
call.tsv \
PPGFparse.PPG.paf
Sorry I didn't see your July 14's additional question. Github does not send notification about editted comment. Please make new posts as needed.
The coordinate is my mistake. Please either apply this patch (patch findPPG_annotate.py < patch.txt
) or pull the git HEAD, then rerun the step5:
diff --git a/findPPG_annotate.py b/findPPG_annotate.py
index 41c9ce3..e4a8169 100644
--- a/findPPG_annotate.py
+++ b/findPPG_annotate.py
@@ -193,6 +193,7 @@ def worker(d):
parentgene = genename_map[parentgene]
parentgene_ID = entry[-6].split('|')[0]
+ coor_contig = str(contig_start)+'-'+str(contig_end)
if TYPE=='INS':
contig_start = contig_start
contig_end = contig_end
@@ -330,7 +331,7 @@ def worker(d):
continue
cleavage = the_retrocopy[16:20].lower() + the_retrocopy[20:24]
cleavage = cleavage[::-1]
- coor_contig = str(contig_start)+'-'+str(contig_end)
+ #coor_contig = str(contig_start)+'-'+str(contig_end)
coor_ref = str(chrom_s)+'-'+str(chrom_e)
holder = [SVname, SVsamplename,
TYPE, contig_name, SVchrom, SVstrand,
Other columns or the sequences are not affected by this. I lazily recycled variables and forgot they are later used for printing the results.
Thank you so much, now the span_on_contig positions are there.
Dear Xiaowen Feng, dealing with GTF3/GTF files either from NCBI or ENSEMBL is sometimes cumbersome.
Do you suggest any tool to convert these file formats into the BED format that one would need?
As from the REAMDE I guess I need to export the coding fraction from the GFF3/GTF including the introns, since most of the available tools do extract just the exon part, like e.g. gffread
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
.Another question is, how one should deal with different isoforms of genes, do you just take the "primary transcript" or use all annotated isoforms?
Thank you in anticiaption for your help and nice tool to find all these events between genome assemblies.
Best regards
Kristian