gwajnberg commented 4 years ago

Hello ,

I just tried to use with my data PGA package and i had an error in dbCreator:

library(BSgenome.Hsapiens.UCSC.hg19) dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation_path,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path) Error in y[, z] : subscript out of bounds this is the vcf header:

fileformat=VCFv4.1

source=VarScan2

INFO=

FILTER=

FORMAT=

CHROM POS ID REF ALT QUAL

chr1 10043516 . A G 0.01 chr1 10044378 . A G 0.01

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/junctions.bed chr1 12697 13221 JUNC_0 0 + 12697 13221 255,0,0 2 50,50 0,575 chr1 14829 14970 JUNC_1 1 - 14829 14970 255,0,0 2 50,50 0,192 chr1 14829 15021 JUNC_2 0 - 14829 15021 255,0,0 2 50,50 0,243 chr1 15012 25233 JUNC_3 0 + 15012 25233 255,0,0 2 50,50 0,10272 chr1 15038 15796 JUNC_4 3 - 15038 15796 255,0,0 2 50,50 0,809 chr1 15947 16607 JUNC_5 3 - 15947 16607 255,0,0 2 50,50 0,711 chr1 16765 16854 JUNC_6 0 - 16765 16854 255,0,0 2 50,50 0,140 chr1 16765 16858 JUNC_7 0 - 16765 16858 255,0,0 2 50,50 0,144 chr1 17055 17233 JUNC_8 0 - 17055 17233 255,0,0 2 50,50 0,229 chr1 17055 17606 JUNC_9 0 - 17055 17606 255,0,0 2 50,50 0,602

iarc@iarc-H8QG6:/Processed_data/gabriel/amrita/tfRNA/EVs/derfinder-pipe$ head /New_data/pancreatic/new_data/merged.gtf chr1 Cufflinks exon 852260 857889 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1"; chr1 Cufflinks exon 857991 858025 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.6.1"; class_code "u"; tss_id "TSS1"; chr1 Cufflinks exon 878110 878438 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2"; chr1 Cufflinks exon 878633 878757 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2"; chr1 Cufflinks exon 879078 879188 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2"; chr1 Cufflinks exon 879288 880180 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "4"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2"; chr1 Cufflinks exon 880422 880526 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "5"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2"; chr1 Cufflinks exon 880898 881033 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "6"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2"; chr1 Cufflinks exon 881527 881666 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "7"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2"; chr1 Cufflinks exon 881782 881925 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "8"; oId "CUFF.36.1"; class_code "u"; tss_id "TSS2";

annotation driectory prepared with PrepareAnnotationRefseq2

Shawn-Xu commented 4 years ago

Hi,

Some information/columns are missing from your VCF file. You need to add these columns (e.g. content in the red box) to ensure the integrity of the format.

gwajnberg commented 4 years ago

ok my vcf was a product from a merging software that merged multiple vcf files of snps and indels...so i decided to use an vcf that was in this format generated by varscan. And now, I had another error:

dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile,annotation_path=annotation,outfile_name=outfile_name,genome=Hsapiens,outdir=outfile_path) Output novel junction peptides... Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : In range 1: 'end' must be >= 'start' - 1.

Shawn-Xu commented 4 years ago

In your *.bed file, the coordinates of the start point (12697+575) in your junction part 2 exceed the end point (13221). This indicates that your file is not a standard output file generated by Tophat. Each junction from junctions.bed reported by TopHat will consists of two connected BED blocks in the column 11 and 12.

In the .gtf file. this function was designed for the cufflinks output. Only the transcript_id beginning with "Cuff." will be considered as novel sequences for the subsequent processing. If you use the gffcompare to merge all the .gtf, you need to manually add the prefix "Cuff." to the transcript_id of each entries.

gwajnberg commented 4 years ago

ok... so what pipeline do you recommend to generate a dabase from 3 patients raw sequencing data? since pga accepts only one vcf, one bed and one gtf? cheers

gwajnberg commented 4 years ago

and why you don't create an alternative for tophat, since it's being discontinued since 2016?

Shawn-Xu commented 4 years ago

and why you don't create an alternative for tophat, since it's being discontinued since 2016?

vcf and gtf are the common data formats for point mutations and transcripts. Creating a new format is even more likely to cause new misunderstandings.

Tophat2 has entered a low maintenance since 2016, and HISAT2 provides the same core functionality. So we developed an interface to hisat2's splicesites output file. You can use tabFile argument instead of bedFile to complete junction construction。 20200505120148

splicesites format demo: 20200505124637 This format only contains the left and right flanking genomic positions, which are more easier to understand.

Jokendo-collab commented 4 years ago

In your *.bed file, the coordinates of the start point (12697+575) in your junction part 2 exceed the end point (13221). This indicates that your file is not a standard output file generated by Tophat. Each junction from junctions.bed reported by TopHat will consists of two connected BED blocks in the column 11 and 12.

In the .gtf file. this function was designed for the cufflinks output. Only the transcript_id beginning with "Cuff." will be considered as novel sequences for the subsequent processing. If you use the gffcompare to merge all the .gtf, you need to manually add the prefix "Cuff." to the transcript_id of each entries.

This information is missing in the tutorial. I think the tools used in the generation of the GTF and BED files needs to be mentioned briefly for ease of following the workflow. I have struggled and this information solved my problem

gwajnberg commented 4 years ago

hello I started from the beggining then I have a new vcf, new gtf and new tabFile

gwajnberg commented 4 years ago

here is the junction tab

gwajnberg commented 4 years ago

the vcf

gwajnberg commented 4 years ago

the gtf

gwajnberg commented 4 years ago

and now I have got this error :

dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,tabFile=tabfile, annotation_path=annotation,outfile_name=outfile_name, genome=Hsapiens,outdir=outfile_path) Output novel junction peptides... Error in $<-.data.frame(*tmp*, "V4", value = "+") : replacement has 1 row, data has 0 In addition: Warning messages: 1: In rbind(...) : number of columns of result is not a multiple of vector length (arg 1) 2: In rbind(...) : number of columns of result is not a multiple of vector length (arg 1) 3: In rbind(...) : number of columns of result is not a multiple of vector length (arg 1)

gwajnberg commented 4 years ago

any ideas?

Jokendo-collab commented 4 years ago

@gwajnberg I also had that problem and I was not able to solve it

gwajnberg commented 4 years ago

Hello javanOkendo, I found the problem there is a function in the end of the script called Tab2Range, which tries to replace if you have a row into your tab File with a "." instead of + or - ... but instead of using a condition the function just try to create a tmp variable, which if you don't have any "." it will be empty...here is it

Tab2Range function(tabfile) {
options(stringsAsFactors=FALSE) jun <- read.table(tabfile, sep='\t', header=F, stringsAsFactors=F)

replace the strand "." with "+" and "-".

tmp<-jun[jun$V4==".",] tmp_forward<-tmp tmp_reverse<-tmp tmp_forward$V4<-"+" tmp_reverse$V4<-"-" jun<-rbind(jun[jun$V4!=".",],tmp_forward,tmp_reverse)

part1_sta <- as.numeric(jun[,'V2'])+1-59 part1_end <- as.numeric(jun[,'V2'])+1 part2_sta <- as.numeric(jun[,'V3'])+1 part2_end <- as.numeric(jun[,'V3'])+1+59

junction <- data.frame(chr=jun[, 'V1'], id="NULL", start=jun[, 'V2'], end=jun[,'V3'], cov=255, strand=jun$V4, part1_len=0, part2_len=0, part1_sta, part1_end, part2_sta, part2_end) if('chrM' %in% junction$chr) junction <- junction[-which(junction$chr=='chrM'), ] if('MT' %in% junction$chr) junction <- junction[-which(junction$chr=='MT'), ]

junRange <- GRanges(seqnames=junction$chr, ranges=IRanges(start=junction$part1_end, end=junction$part2_sta), strand=junction$strand, id=junction$id, cov=junction$cov, part1_len=junction$part1_len, part2_len=junction$part2_len, part1_sta=junction$part1_sta, part1_end=junction$part1_end, part2_sta=junction$part2_sta, part2_end=junction$part2_end)

}

Jokendo-collab commented 4 years ago

@gwajnberg did this work with your data? if Yes can you share the workflow which you used so that I can give it a trial on my data? You can send the attachment to oknjav001@myuct.ac.za

gwajnberg commented 4 years ago

Hello javan, I will send you for sure!! I will just clean the script lol, because i went through line by line of each PGA's functions

Jokendo-collab commented 4 years ago

I will appreciate is and I look forward to it. Thanks for the good gesture

On Thu, Jun 4, 2020 at 5:53 PM gwajnberg notifications@github.com wrote:

Hello javan, I will send you for sure!! I will just clean the script lol, because i went through line by line of each PGA's functions

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wenbostar/PGA/issues/13#issuecomment-638944926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJ34OY3A4OXC54YV7VQFVDRU67N7ANCNFSM4MTZLVYA .

Jokendo-collab commented 4 years ago

@gwajnberg did you manage to make the script work?

wenbostar / PGA

Error with dbCreator #13

fileformat=VCFv4.1

source=VarScan2

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FILTER=

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

CHROM POS ID REF ALT QUAL

replace the strand "." with "+" and "-".