wenbostar / PGA

PGA: a tool for ProteoGenomics Analysis
http://wenbostar.github.io/PGA/
7 stars 10 forks source link

Customized database construction for de novo transcript assembly result #10

Closed shanzida45670 closed 5 years ago

shanzida45670 commented 5 years ago

Hello,

I am a Graduate bioinformatics student in the Hood College and working on Bioconductor(PGA) for identification of novel peptides from human brain sample.

I prepared the customised database without genome information and following: Based on the result from de novo assembly of RNASeq data without a reference genome

I followed transcript_seq_file <- system.file("extdata/input", "Trinity.fasta", package="PGA") outdb <- createProDB4DenovoRNASeq(infa=transcript_seq_file, outfile_name = "denovo")

valid models = 61

Unique models = 51

Estimated false positives = 3 +/- 2

Post-processing I have successfully run all the codes by couldn't generate report gear.

I am getting this information while trying to run: SNV(DB) didn't exist!

I don't understand in which code I include this SNV database. Please help me.

wenbostar commented 5 years ago

Hi @shanzida45670 , the current function reportGear requires SNV-derived variant peptide identification result existing in your result folder for report generation.

shanzida45670 commented 5 years ago

For getiing HTML format report I used the following code. reportGear(parser_dir = "parser_outdir", tab_dir = outfile_path, report_dir = "report")

in parser_outdir I have got three files 1.protein txt,2.peptide.txt and 3.Pga-rawPSMs.txt Now,I am confused about what would be in tab_dir = outfile_path, because

for making customized database I used the following code where annotation _path, outfile_name argument was absent.

outdb <- createProDB4DenovoRNASeq(infa=transcript_seq_file, outfile_name = "denovo")

instead of the following code dbfile <- dbCreator(gtfFile=gtffile,vcfFile=vcffile,bedFile=bedfile, annotation_path=annotation,outfile_name=outfile_name, genome=Hsapiens,outdir=outfile_path)

Please let me know the current function of reportGear.

wenbostar commented 5 years ago

How did you perform the peptide identification?

wenbostar commented 5 years ago

The function reportGear doesn't support to generate HTML-based report for search result derived from database generated by createProDB4DenovoRNASeq.

wenbostar commented 5 years ago

If you want to visualize the identification result in this case, I can help you use another tool PDV.

shanzida45670 commented 5 years ago

Thank you so much for your response.

The previous experiment of my professor has identified 30,000 repetitive elements that are consistently transcribed in the human brain. Additionally, there are sequencing reads overlapping splice junctions between the repetitive elements(RE) and annotated exon of known genes indicating that at least some of the expressed are novel exons in previously unannotated mRNA isoforms. Similar novel exons were also observed in data from a published mouse study that specifically sequenced RNAs bond to polyribosomes, indicating that they were being translated to make protein. However, there is no evidence yet that protein-containing novel RE exons are actually produced in human. So, the goal of my project is to determine if any of these putative RE exons are expressed in human cells using information from the various protein database.

I used the mgf file from the PRIDE database on human brain sample to run the code ms/ms searching.

shanzida45670 commented 5 years ago

Please help me to visualize the result.

shanzida45670 commented 5 years ago

My professor has done this experiment 8 years ago so I didn't have VCF, BED and GTF file.

shanzida45670 commented 5 years ago

She only provided me with RNA seq data set which includes potential RE exons with putative 3 frames translation.

shanzida45670 commented 5 years ago

I input the FASTA file of this exon sequence and convert it to denovo FASTA txt by using this function createProDB4DenovoRNASeq

wenbostar commented 5 years ago

In this case, you can firstly build a customized database (human reference protein database + the protein database derived from RNA-Seq data generated by function createProDB4DenovoRNASeq). Then you can use MS-GF+ to search your MS/MS data against this customized database. The mzID file generated by MS-GF+ can be visualized using PDV.

shanzida45670 commented 5 years ago

Thank you so much for your suggestions. I was trying but didn't find any direct R function which can incorporate these two arguments for making the customized database. I used the following function instead

library(customProDB) PrepareAnnotationRefseq(genome='hg19', CDSfasta, pepfasta, annotation_path,

  • dbsnp = NULL, transcript_ids=transcript_ids,
  • splice_matrix=FALSE, ClinVar=FALSE)

Please let me know if you know any other R function for creating the database.

Thanks for your time. Shanzida.

wenbostar commented 5 years ago

You can follow the instruction here to download reference protein database. Then combine the two databases.

shanzida45670 commented 5 years ago

Thank you for your reply. I have downloaded the human reference protein database and successfully run the PrepareAnnotationRefseq2 function. Would you please let me know which function I would use now to build a customized database by combining the two databases (human reference protein database + the protein database derived from RNA-Seq data generated by function createProDB4DenovoRNASeq)?

shanzida45670 commented 5 years ago

Hi, Please let me know which function I should use to combine the two databases. I am stuck in this step, and I can't go further. Please give me some suggestions.

wenbostar commented 5 years ago

On Linux or Mac, you can use the command "cat" to combine two databases. For example:

cat db1.fasta db2.fasta >combined.fasta
wenbostar commented 5 years ago

Another option for you is our PepQuery web server.

shanzida45670 commented 5 years ago

Thanks a lot.Hope this will help.

wenbostar commented 5 years ago

If you still have questions about this issue, please re-open this issue.