wenbostar / PGA

PGA: a tool for ProteoGenomics Analysis
http://wenbostar.github.io/PGA/
7 stars 10 forks source link

PrepareAnnotationEnsembl2 error with enable snp and related dbCreator error #22

Closed wanghao1991217 closed 3 years ago

wanghao1991217 commented 3 years ago

I use the PrepareAnnotationEnsembl2 function to Preparing annotation files with Ensembl release 101. There's error with enable the snp.

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl",
                   host="aug2020.archive.ensembl.org", path="/biomart/martservice",
                   archive=FALSE)

PrepareAnnotationEnsembl2(mart = ensembl, "D:/data_analysis/KIRC_multiomics/KIRC_rna_seq_rsem/PGA/ensembl_annotation",
                         splice_matrix = T,
                         dbsnp = "snp151", COSMIC = FALSE)

Prepare gene/transcript/protein id mapping information (ids.RData) ...  done
Build TranscriptDB object (txdb.sqlite) ... 
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
 done
Prepare exon annotation information (exon_anno.RData) ...  done
Prepare protein coding sequence (procodingseq.RData)...  done
Prepare protein sequence (proseq.RData) ...  done
Prepare dbSNP information (dbsnpinCoding.RData) ... Error in function (type, msg, asError = TRUE)  : 
  transfer closed with outstanding read data remaining
此外: Warning message:
In .infer_chrominfo_from_transcripts_and_splicings(transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object

Then, I skip this error and run dbCreator.

##input files
vcf = paste(getwd(),list.files(getwd(),pattern = "\\.vcf$"),sep = "/")
tab = paste(getwd(),list.files(getwd(),pattern = "\\.tab$"),sep = "/")
gtf = paste(getwd(),list.files(getwd(),pattern = "\\.gtf$"),sep = "/")

##dbcreator
dbCreator(gtfFile = gtf, vcfFile  = vcf, tabFile = tab, 
          annotation_path = "D:/data_analysis/KIRC_multiomics/KIRC_rna_seq_rsem/PGA/ensembl_annotation",
          outfile_name = "a382607", outdir = "D:/data_analysis/KIRC_multiomics/KIRC_rna_seq_rsem/PGA/out",
          genome=Hsapiens,debug = T)

Output abberant protein FASTA file caused by short INDEL...  done
Output variation table and variant protein sequence caused bySNVs... Error in h(simpleError(msg, call)) : 
  在为'unique'函数选择方法时评估'x'参数出了错: 在为'substr'函数选择方法时评估'x'参数出了错: 在为'translate'函数选择方法时评估'x'参数出了错: input must be a single non-NA string   
此外: There were 19 warnings (use warnings() to see them)

I wonder if this error is related to the failed construction of SNP database with the PrepareAnnotationEnsembl2 function. Here is my workflow of analysis of vcf, transcript.gtf, and junction.tab.

####hisat2 build index
##extract exon 
python /home/rchenubuntu/software/hisat2-2.2.1/extract_exons.py /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.101.gtf > /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.exon
##extract splice site
python /home/rchenubuntu/software/hisat2-2.2.1/extract_splice_sites.py /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.101.gtf > /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.ss
##build index
hisat2-build -p 48 /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.dna.primary_assembly.fa --exon /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.exon --ss /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/genome.ss /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/ensembl101

###hisat2 align     JUNCTION.tab 
hisat2 -x /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_index/ensembl101 -p 40 -1 /media/rchenubuntu/5618DC8B18DC6B8D/clean_data/A_382607_1_clean.fq.gz -2 /media/rchenubuntu/5618DC8B18DC6B8D/clean_data/A_382607_2_clean.fq.gz -S /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.sam --novel-splicesite-outfile /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.tab

##samtools view 
samtools view -@ 46 -S -b /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.sam > /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.bam
##samtools sort
samtools sort -@ 46 /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607.bam -O /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607_sorted.bam

##cufflinks     TRANSCRIPT.gtf
cufflinks /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607_sorted.bam --library-type fr-firststrand -p 46 -g /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.101.gtf -o /media/rchenubuntu/5618DC8B18DC6B8D/cufflinks_output/xzm

###call SNPs and short indels
samtools mpileup -uf /media/rchenubuntu/5618DC8B18DC6B8D/database/genome_and_annotation/ensembl_re101/Homo_sapiens.GRCh38.dna.primary_assembly.fa /media/rchenubuntu/5618DC8B18DC6B8D/hisat2_output/A382607_sorted.bam | bcftools call -mv > /media/rchenubuntu/5618DC8B18DC6B8D/bcf_call_output/A382607.vcf
Shawn-Xu commented 3 years ago

Hi,

For the first error, I suggest you disable the dbsnp and COSMIC options (just set them with FALSE). And the second error may be caused by the incomplete annotations of PrepareAnnotationEnsembl2.

wanghao1991217 commented 3 years ago

Hi,

For the first error, I suggest you disable the dbsnp and COSMIC options (just set them with FALSE). And the second error may be caused by the incomplete annotations of PrepareAnnotationEnsembl2.

I have disable the snp and COSMIC option, but the same error still there with dbCreator.

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl",
                   host="aug2020.archive.ensembl.org", path="/biomart/martservice",
                   archive=FALSE)

PrepareAnnotationEnsembl2(mart = ensembl, "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation",
                         splice_matrix = T,
                         dbsnp = NULL, COSMIC = FALSE)

Prepare gene/transcript/protein id mapping information (ids.RData) ...  done
Build TranscriptDB object (txdb.sqlite) ... 
Ensembl site unresponsive, trying uswest mirror
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... FAILED! (=> skipped)
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
 done
Prepare exon annotation information (exon_anno.RData) ...  done
Prepare protein coding sequence (procodingseq.RData)...  done
Prepare protein sequence (proseq.RData) ...  done
Prepare exon splice information (splicemax.RData) ...  done
Warning messages:
1: In .infer_chrominfo_from_transcripts_and_splicings(transcripts$tx_chrom,  :
  chromosome lengths and circularity flags are not available for this TxDb object
2: closing unused connection 3 () 

dbCreator(gtfFile = gtf, vcfFile  = vcf, tabFile = tab, 
          annotation_path = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation",
          outfile_name = "dbcreator_ensembl", outdir = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/out",
          genome=Hsapiens)

Output abberant protein FASTA file caused by short INDEL...  done
Output variation table and variant protein sequence caused bySNVs... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'unique': error in evaluating the argument 'x' in selecting a method for function 'substr': error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string
In addition: There were 19 warnings (use warnings() to see them)
traceback()
19: h(simpleError(msg, call))
18: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'substr': error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string", 
        base::quote(h(simpleError(msg, call))))
17: h(simpleError(msg, call))
16: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string", 
        base::quote(h(simpleError(msg, call))))
15: h(simpleError(msg, call))
14: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "input must be a single non-NA string", 
        base::quote(.charToXString(seqtype, x, start, end, width)))
13: stop(wmsg("input must be a single non-NA string"))
12: .charToXString(seqtype, x, start, end, width)
11: XString("DNA", x, start = start, width = nchar)
10: XString("DNA", x, start = start, width = nchar)
9: DNAString(paste0("ATG", x))
8: translate(DNAString(paste0("ATG", x)))
7: substr(as.character(translate(DNAString(paste0("ATG", x)))), 
       2, 2)
6: unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2))
5: FUN(X[[i]], ...)
4: lapply(vcodesDNAS, function(x) unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2)))
3: lapply(vcodesDNAS, function(x) unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2)))
2: aaVariation(postable_snv, codingseq)
1: dbCreator(gtfFile = gtf, vcfFile = vcf, tabFile = tab, annotation_path = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation", 
       outfile_name = "dbcreator_ensembl", outdir = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/out", 
       genome = Hsapiens)

how to completely prepare ensembl annotations with PrepareAnnotationEnsembl2 ?

Shawn-Xu commented 3 years ago

The error of "FAILED to get the chrominfo" was caused by some bugs of dependent package. We have resolved this by adding some modified functions. Please install the latest PGA packakge from github and try once again.

wanghao1991217 commented 3 years ago

I have updated the PGA to 1.15.3, re-prepare the ensembl annotation and run dbCreator.

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl",
+                    host="aug2020.archive.ensembl.org", path="/biomart/martservice",
+                    archive=FALSE)
> PrepareAnnotationEnsembl2(mart = ensembl, "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation",
+                          splice_matrix = T,
+                          dbsnp = NULL, COSMIC = FALSE)
Prepare gene/transcript/protein id mapping information (ids.RData) ...  done
Build TranscriptDB object (txdb.sqlite) ... 
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
 done
Prepare exon annotation information (exon_anno.RData) ...  done
Prepare protein coding sequence (procodingseq.RData)...  done
Prepare protein sequence (proseq.RData) ...  done
Prepare exon splice information (splicemax.RData) ...  done
> dbCreator(gtfFile = gtf, vcfFile  = vcf, tabFile = tab, 
+           annotation_path = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation",
+           outfile_name = "dbcreator_ensembl", outdir = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/out",
+           genome=Hsapiens)
Output abberant protein FASTA file caused by short INDEL...  done
Output variation table and variant protein sequence caused bySNVs... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'unique': error in evaluating the argument 'x' in selecting a method for function 'substr': error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string
In addition: There were 19 warnings (use warnings() to see them)
> traceback()
19: h(simpleError(msg, call))
18: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'substr': error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string", 
        base::quote(h(simpleError(msg, call))))
17: h(simpleError(msg, call))
16: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string", 
        base::quote(h(simpleError(msg, call))))
15: h(simpleError(msg, call))
14: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "input must be a single non-NA string", 
        base::quote(.charToXString(seqtype, x, start, end, width)))
13: stop(wmsg("input must be a single non-NA string"))
12: .charToXString(seqtype, x, start, end, width)
11: XString("DNA", x, start = start, width = nchar)
10: XString("DNA", x, start = start, width = nchar)
9: DNAString(paste0("ATG", x))
8: translate(DNAString(paste0("ATG", x)))
7: substr(as.character(translate(DNAString(paste0("ATG", x)))), 
       2, 2)
6: unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2))
5: FUN(X[[i]], ...)
4: lapply(vcodesDNAS, function(x) unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2)))
3: lapply(vcodesDNAS, function(x) unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2)))
2: aaVariation(postable_snv, codingseq)
1: dbCreator(gtfFile = gtf, vcfFile = vcf, tabFile = tab, annotation_path = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation", 
       outfile_name = "dbcreator_ensembl", outdir = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/out", 
       genome = Hsapiens)
Shawn-Xu commented 3 years ago

Could you please provide a demo vcf file and all the annotations in the ensembl_annotation/ for me to debug? You can send it to me by xsh.skye@gmail.com.

Shawn-Xu commented 3 years ago

I have updated the PGA to 1.15.3, re-prepare the ensembl annotation and run dbCreator.

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl",
+                    host="aug2020.archive.ensembl.org", path="/biomart/martservice",
+                    archive=FALSE)
> PrepareAnnotationEnsembl2(mart = ensembl, "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation",
+                          splice_matrix = T,
+                          dbsnp = NULL, COSMIC = FALSE)
Prepare gene/transcript/protein id mapping information (ids.RData) ...  done
Build TranscriptDB object (txdb.sqlite) ... 
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... OK
Download and preprocess the 'genes' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
 done
Prepare exon annotation information (exon_anno.RData) ...  done
Prepare protein coding sequence (procodingseq.RData)...  done
Prepare protein sequence (proseq.RData) ...  done
Prepare exon splice information (splicemax.RData) ...  done
> dbCreator(gtfFile = gtf, vcfFile  = vcf, tabFile = tab, 
+           annotation_path = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation",
+           outfile_name = "dbcreator_ensembl", outdir = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/out",
+           genome=Hsapiens)
Output abberant protein FASTA file caused by short INDEL...  done
Output variation table and variant protein sequence caused bySNVs... Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'unique': error in evaluating the argument 'x' in selecting a method for function 'substr': error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string
In addition: There were 19 warnings (use warnings() to see them)
> traceback()
19: h(simpleError(msg, call))
18: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'substr': error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string", 
        base::quote(h(simpleError(msg, call))))
17: h(simpleError(msg, call))
16: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'translate': input must be a single non-NA string", 
        base::quote(h(simpleError(msg, call))))
15: h(simpleError(msg, call))
14: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "input must be a single non-NA string", 
        base::quote(.charToXString(seqtype, x, start, end, width)))
13: stop(wmsg("input must be a single non-NA string"))
12: .charToXString(seqtype, x, start, end, width)
11: XString("DNA", x, start = start, width = nchar)
10: XString("DNA", x, start = start, width = nchar)
9: DNAString(paste0("ATG", x))
8: translate(DNAString(paste0("ATG", x)))
7: substr(as.character(translate(DNAString(paste0("ATG", x)))), 
       2, 2)
6: unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2))
5: FUN(X[[i]], ...)
4: lapply(vcodesDNAS, function(x) unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2)))
3: lapply(vcodesDNAS, function(x) unique(substr(as.character(translate(DNAString(paste0("ATG", 
       x)))), 2, 2)))
2: aaVariation(postable_snv, codingseq)
1: dbCreator(gtfFile = gtf, vcfFile = vcf, tabFile = tab, annotation_path = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/ensembl_annotation", 
       outfile_name = "dbcreator_ensembl", outdir = "/media/rchenubuntu/5618DC8B18DC6B8D/KIRC_R/PGA/out", 
       genome = Hsapiens)

Hi,

This error was caused by the multiple alternative alleles in ALT column. image We suggest to divide it to two lines by modifying vcf file in advance.

Please convert this format from image to image

The perl command line for your information: cat old.vcf |perl -F'\t' -lane 'if(/^#/){print;next};if( $F[4] =~/,/){my @arr=split /,/,$F[4]; map{$F[4]=$_; print join("\t",@F)}@arr}else{print }'>new.vcf

And we have fixed another minor bug. Please install the latest PGA packakge (version 1.15.4) from github and try once again.

wanghao1991217 commented 3 years ago

@Shawn-Xu The problem is solved. Thank you! However, I found the txFinder.fasta file which has been just generated from one sample is very large to more than 400 mb. And I checked the file manually and found many duplicate entries with same header and same amino acid sequence. 0901

Shawn-Xu commented 3 years ago

@Shawn-Xu The problem is solved. Thank you! However, I found the txFinder.fasta file which has been just generated from one sample is very large to more than 400 mb. And I checked the file manually and found many duplicate entries with same header and same amino acid sequence. 0901

I have resolved this problem in the latest PGA packakge (version 1.15.4). Alternatively, you could remove these redundant entries by yourself.

wanghao1991217 commented 3 years ago

@Shawn-Xu Thank you for help.

wenbostar commented 3 years ago

Please reopen this issue if you still have questions about it.