Closed kimthaodao closed 5 years ago
@kimthaodao Happy to see your interest in PGA.
Q1 Generally, a nuclutide sequence translated by six-frame strategy will generate a lot of stop codons in the protein sequence. And then, the short amino acid sequence was split off whenever a stop codon was encountered. Thus, a nucleic acid sequence with 6-frame method will usually produce more than six short protein sequences. Moreover, the "createProDB4DenovoRNASeq" function from PGA package contains a default value ( >=30 amino acids) to limit the minimum amino acid length. Therefore, you can only get these two (>=30 amino acids) instead of more peptides. I will add a new parameter to fix this in the future. Since this function is just a small feature of PGA, you can use the following script directly with a variable parameter “aalimit”to retain much more peptides.
https://github.com/Shawn-Xu/PPIP/blob/master/Auxtools/createProDB4DenovoRNASeq.R
Q2 Usually, an ideal decoy database should contain at least the same number of amino acids than the target database. However, a translation of reverse complement of the initial nucleotide sequence will produce a random number of stop codons in the protein sequence. And as mentioned earlier, the amino acid sequence is required to be cut off before each stop codon. This process cannot guarantee that target and decoy databases have the same size. Therefore, I do not recommend you to use the strategy of translating the reverse complement nucleotide sequence. Of course, if you still have some reason to do this, you can use the ‘reverseComplement’function from Biostrings package to treat the sequence in advance. Then use “createProDB4DenovoRNASeq.R” to translate this part of sequences.
Hi Shawn-Xu,
Thank you for your reply. I have another question: Is it possible to change the default amino acids length of 30?
Kim
@kimthaodao The “aalimit” parameter (in the https://github.com/Shawn-Xu/PPIP/blob/master/Auxtools/createProDB4DenovoRNASeq.R) is used to control the amino acid length. You can download it and source this script in the R console. And then using this function just like in PGA.
@Shawn-Xu
Thank you for your reply. I modified the aalimit to 100 and source the script in R and then ran PGA again. These are my codes:
source("/Users/kim/Desktop/createProDB4DenovoRNASeq.R") denovo_100 <- "/Users/kim/Grad/Polar_lobe_project/trinity_genes_supertranscript.fasta"
outdb <- createProDB4DenovoRNASeq(infa="/Users/kim/Grad/Polar_lobe_project/trinity_genes_supertranscript.fasta",bool_use_3frame=FALSE,outmtab="novel_transcripts_ntx.tab",outfa="./denovo.fasta",bool_get_longest=TRUE,make_decoy=FALSE,decoy_tag="#REV#",aalimit=100, outfile_name="denovo_100")
....... Proteomic database construction ! ............
It worked but it did not say "writing successfully to file" as it did before. Also, it only gave out two files: one was a .fasta file and the other one is a .tab file. Before there was used to be another "_txFinder.fasta" that was used as the database for mass spec search. I was guessing the .fasta file would be the same as the "_txFinder.fasta". However, the name of the original RNA sequences was not retained in the header.
For example:
NTX44805 TITSEVKLDYHEESVTIHRQTDRRLESWVHWKKPRRTCRKESIGEAEENEEHDRRKESFTCISMQKFFFFKSVSLYIYAKFFFFFKSVSTVHDRRHISWSHGHGSLRPPHERLL
instead of this before:
DNO|NTX4557545|TRINITY_DN4621_c0_g1|-|F3|1 TVQTSNRFFFIATSIVIEWNPGCLSQGVHITAVQHHPFFVYVHFSIRWDGVYCACRIFYFCVYTQDLGLMSHSKD
This is a problem as we need to be able to know which RNA sequence the protein is translated from. Please let know what you think and how to fix it.
Kim
@kimthaodao
I am sorry to have confused you. The .fasta file (e.g. denovo.fasta) could be used for subsequent search process. In fact, all the additional information is now stored in the tab file (e.g. novel_transcripts_ntx.tab).
That's because some search engines for mass spectrometry have some restrictions on the ID format or length, so I changed them to the short ID with uniform indexs to avoid being too long in the recent version.
Actually, the original IDs from PGA code were concatenated by vertical bars(|) with the imformation from *.tab file. You can consult this file to trace back all the information you need.
Additionally, I see that you set aalimit to 100, which will cause translated sequences shorter than this limit to be filtered out. Therefore, the larger this value is set, the more candidate sequences will be screened out. Normally, we don't set a threshold too large unless there is a special requirement.
Moreover, as the outmtab file was writed with "append" mode, it is better to delete this file with unlink function in R Whenever you rerun the script.
Hi,
We used the protein database generated from PGA and found out that the 3 reverse reading frames were not from reverse complement strands of the transcriptome, but from the translating contigs backwards.
For example, I have this nucleotide sequence and want to find its 6-frame translation:
I ran the code:
And this was the result:
There were two problems:
Please let me know how to fix this.
Thanks a lot.