wenbostar / PGA

PGA: a tool for ProteoGenomics Analysis
http://wenbostar.github.io/PGA/
7 stars 10 forks source link

Reverse peptides are not from reverse complement strands. #4

Closed kimthaodao closed 5 years ago

kimthaodao commented 5 years ago

Hi,

We used the protein database generated from PGA and found out that the 3 reverse reading frames were not from reverse complement strands of the transcriptome, but from the translating contigs backwards.

For example, I have this nucleotide sequence and want to find its 6-frame translation:

TRINITY_DN125758_c12_g1_short GGTCAAGGTAATAAAGGTCAAGGTGAAATATCAAAAAGGTAATAAAAAAAAACCACCATTTTCTCAGAAACCCCTTGAGCTACAGTCACCAAATTTAT

I ran the code:

short <- "/Users/kim/Desktop/sample_short.fasta" outdb <- createProDB4DenovoRNASeq(infa=short, bool_use_3frame = FALSE, bool_get_longest = FALSE, outfile_name = "short_6_frame")

And this was the result:

DNO|NTX1|TRINITY_DN125758_c12_g1_short|+|F2|1 VKVIKVKVKYQKGNKKKPPFSQKPLELQSPNL

REV#DNO|NTX1|TRINITY_DN125758_c12_g1_short|+|F2|1

LNPSQLELPKQSFPPKKKNGKQYKVKVKIVKV

There were two problems:

TRINITY_DN125758_c12_g1_short_reverse_complement ATAAATTTGGTGACTGTAGCTCAAGGGGTTTCTGAGAAAATGGTGGTTTTTTTTTATTACCTTTTTGATATTTCACCTTGACCTTTATTACCTTGACC

REV_frame2 -IW-L-LKGFLRKWWFFFITFLIFHLDLYYLD

Please let me know how to fix this.

Thanks a lot.

Shawn-Xu commented 5 years ago

@kimthaodao Happy to see your interest in PGA.

Q1 Generally, a nuclutide sequence translated by six-frame strategy will generate a lot of stop codons in the protein sequence. And then, the short amino acid sequence was split off whenever a stop codon was encountered. Thus, a nucleic acid sequence with 6-frame method will usually produce more than six short protein sequences. Moreover, the "createProDB4DenovoRNASeq" function from PGA package contains a default value ( >=30 amino acids) to limit the minimum amino acid length. Therefore, you can only get these two (>=30 amino acids) instead of more peptides. I will add a new parameter to fix this in the future. Since this function is just a small feature of PGA, you can use the following script directly with a variable parameter “aalimit”to retain much more peptides.

https://github.com/Shawn-Xu/PPIP/blob/master/Auxtools/createProDB4DenovoRNASeq.R

Q2 Usually, an ideal decoy database should contain at least the same number of amino acids than the target database. However, a translation of reverse complement of the initial nucleotide sequence will produce a random number of stop codons in the protein sequence. And as mentioned earlier, the amino acid sequence is required to be cut off before each stop codon. This process cannot guarantee that target and decoy databases have the same size. Therefore, I do not recommend you to use the strategy of translating the reverse complement nucleotide sequence. Of course, if you still have some reason to do this, you can use the ‘reverseComplement’function from Biostrings package to treat the sequence in advance. Then use “createProDB4DenovoRNASeq.R” to translate this part of sequences.

kimthaodao commented 5 years ago

Hi Shawn-Xu,

Thank you for your reply. I have another question: Is it possible to change the default amino acids length of 30?

Kim

Shawn-Xu commented 5 years ago

@kimthaodao The “aalimit” parameter (in the https://github.com/Shawn-Xu/PPIP/blob/master/Auxtools/createProDB4DenovoRNASeq.R) is used to control the amino acid length. You can download it and source this script in the R console. And then using this function just like in PGA.

kimthaodao commented 5 years ago

@Shawn-Xu

Thank you for your reply. I modified the aalimit to 100 and source the script in R and then ran PGA again. These are my codes:

source("/Users/kim/Desktop/createProDB4DenovoRNASeq.R") denovo_100 <- "/Users/kim/Grad/Polar_lobe_project/trinity_genes_supertranscript.fasta"

outdb <- createProDB4DenovoRNASeq(infa="/Users/kim/Grad/Polar_lobe_project/trinity_genes_supertranscript.fasta",bool_use_3frame=FALSE,outmtab="novel_transcripts_ntx.tab",outfa="./denovo.fasta",bool_get_longest=TRUE,make_decoy=FALSE,decoy_tag="#REV#",aalimit=100, outfile_name="denovo_100")

....... Proteomic database construction ! ............

It worked but it did not say "writing successfully to file" as it did before. Also, it only gave out two files: one was a .fasta file and the other one is a .tab file. Before there was used to be another "_txFinder.fasta" that was used as the database for mass spec search. I was guessing the .fasta file would be the same as the "_txFinder.fasta". However, the name of the original RNA sequences was not retained in the header.

For example:

NTX44805 TITSEVKLDYHEESVTIHRQTDRRLESWVHWKKPRRTCRKESIGEAEENEEHDRRKESFTCISMQKFFFFKSVSLYIYAKFFFFFKSVSTVHDRRHISWSHGHGSLRPPHERLL

instead of this before:

DNO|NTX4557545|TRINITY_DN4621_c0_g1|-|F3|1 TVQTSNRFFFIATSIVIEWNPGCLSQGVHITAVQHHPFFVYVHFSIRWDGVYCACRIFYFCVYTQDLGLMSHSKD

This is a problem as we need to be able to know which RNA sequence the protein is translated from. Please let know what you think and how to fix it.

Kim

Shawn-Xu commented 5 years ago

@kimthaodao I am sorry to have confused you. The .fasta file (e.g. denovo.fasta) could be used for subsequent search process. In fact, all the additional information is now stored in the tab file (e.g. novel_transcripts_ntx.tab). That's because some search engines for mass spectrometry have some restrictions on the ID format or length, so I changed them to the short ID with uniform indexs to avoid being too long in the recent version.
Actually, the original IDs from PGA code were concatenated by vertical bars(|) with the imformation from *.tab file. You can consult this file to trace back all the information you need. 111

Additionally, I see that you set aalimit to 100, which will cause translated sequences shorter than this limit to be filtered out. Therefore, the larger this value is set, the more candidate sequences will be screened out. Normally, we don't set a threshold too large unless there is a special requirement.

Moreover, as the outmtab file was writed with "append" mode, it is better to delete this file with unlink function in R Whenever you rerun the script. image