Closed noepozzan closed 2 years ago
Hi! The qc step uses known coding genes. You need to input coding annotations. The noncoding analysis is in the predict step.
I see, didn't realize that. Thanks a lot for the help! Btw, I opened another issue since I have been using ribotish quite extensively.
Getting back to you here because even though ribotish predict works with the following parameters: "--seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest", I do not get a single prediction out, when predicting on the RNAcentral gtf.. The pred(_all).txt files only have a header and nothing else. Do you have any idea why?
I am asking this because I am specifically searching for these new ORFs that I see on IGV. There are clearly some reads lying around these ORFs and they even look quite periodic. Why is ribotish not identifying any of them, even with these permissive parameters? Thank you!
What about the quality result? What is the content in the para.py file?
This is one of them, they look like this:
offdict = {25: 7, 26: 7, 27: 10, 28: 7, 29: 7, 31: 7, 'm0': {25: 7, 26: 10, 28: 13, 31: 8, 33: 16}}
Looks kind of strange, or what do you think?
It's strange. What about the pdf figure?
The reads length s are too short. The lengths should be around 29nt. How did you trim the reads?
I got rid of the first 5 bases and then clipped at the adapter. Would it help you if I pasted some reads here? Thanks btw, for really trying to help me!
That's the reason. 5+7=12 is the most common offset value. Do not cut the first 5 bases.
I see. I'll give it a try. Thanks again!
SRR1333393.sorted_indexed_qual.pdf
What do you think of this? Definitely looks better, no? Still doesn't find those regions that I see some reads and periodicity in..
It's much better now. Are there any prediction results?
Yes, there are results. I should specify: The quality figure I posted above comes actually from a different experiment from CHX-treated riboseq data that I took from SRA trying to reproduce some paper's results. In that paper, they found 10 short ORFs. But using ribotish I only find 4 of them. What's also strange is that I found the same 4 ORFs even before you told me to correct the offset. So this doesn't seem to matter too much. The problem is that I have no idea why ribotish doesn't see them. These regions of interest aren't even in the predict outputs. I expected them to be in there, just maybe not with a passing p-value. Do you have any other idea why this may be?
Run predict with --transprofile option can output signal data on all transcripts. Check if there is any signal in your transcript.
At 2022-11-28 23:35:41, "Noè Pozzan" @.***> wrote:
Yes, there are results. I should specify: The quality figure I posted above comes actually from a different experiment from CHX-treated riboseq data that I took from SRA trying to reproduce some paper's results. In that paper, they found 10 short ORFs. But using ribotish I only find 4 of them. What's also strange is that I found the same 4 ORFs even before you told me to correct the offset. So this doesn't seem to matter too much. The problem is that I have no idea why ribotish doesn't see them. These regions of interest aren't even in the predict outputs. I expected them to be in there, just maybe not with a passing p-value. Do you have any other idea why this may be?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>
Hey @zhpn1024,
Sorry for the late reply..
I ran ribotish predict
with the --transprofile
option. Pretty handy!
I see signal in the transcripts im interested in, but not for the exact sequence I am interested in.
So, as an example this paper I am trying to replicate has the following: They found ribosome profiling and mass spec evidence for the following peptide: ATAAGLSAGLTR. They annotated this with id = ENST00000283195. Ribotish finds many things with an id of ENST00000283195, but not this exact peptide I'd be interested in. Below is a screenshot from IGV where I plot the full footprints and only their A-site for periodicity of the region of this peptide. You should be able to see it in the upper translation frame.
Do the reads not look good enough to be considered by ribotish?
Does it make sense what I am telling you? And thanks again for your help!
How do you generate the A site profile? From the A site result, the 394 data support your peptide, while 393 data supports translation in another frame (bottom). My transprofile results are P site profile. What are the signals in this region?
Good point. The A site profiles in the screenshot above were actually separately derived using a python script from transcriptome-aligned reads. The 393 is actually CHX-treated, while 394 is LTM-treated. Not sure how that should make a difference for periodicity, though...
Looking into the transprofile file, there is a lot in there for the entry with id = ENST00000283195. I only pasted about half of it due to the length.
ENSG00000153201 ENST00000283195 RANBP2 {} {48:1, 63:2, 64:1, 65:1, 66:22, 69:12, 72:1, 79:3, 135:3, 137:2, 147:1, 154:2
, 156:5, 176:1, 285:5, 288:1, 294:3, 295:6, 345:2, 346:6, 348:1, 351:4, 354:2, 360:1, 462:1, 463:1, 468:2, 471:5, 480:1, 490:
2, 642:6, 645:3, 648:2, 651:11, 652:1, 653:3, 672:4, 673:3, 690:2, 717:5, 718:2, 720:2, 721:1, 726:1, 727:1, 729:19, 730:1, 7
32:4, 733:6, 734:1, 735:21, 736:3, 737:3, 738:6, 888:3, 891:7, 903:16, 904:1, 912:1, 913:1, 924:1, 927:1, 928:1, 930:1, 933:5
0, 934:9, 1029:4, 1033:1, 1035:1, 1042:2, 1047:4, 1048:1, 1050:1, 1221:2, 1225:3, 1248:1, 1251:1, 1272:1, 1344:1, 1347:2, 134
8:3, 1350:1, 1362:3, 1368:1, 1374:5, 1375:1, 1377:5, 1379:1, 1380:2, 1383:1, 1385:1, 1386:2, 1392:1, 1509:1, 1512:1, 1513:1,
1527:1, 1671:2, 1752:4, 1755:8, 1758:25, 1767:1, 1915:2, 1917:1, 1918:1, 2068:1, 2100:1, 2118:1, 2195:2, 2196:3, 2202:2, 2208
:1, 2211:1, 2269:1, 2271:1, 2283:1, 2292:1, 2478:2, 2541:1, 2595:2, 2616:2, 2640:1, 2670:5, 2814:2, 2913:1, 2963:2, 2964:3, 2
979:3, 3026:2, 3039:2, 3077:1, 3081:3, 3090:2, 3102:1, 3111:1, 3135:1, 3165:2, 3247:1, 3282:2, 3312:1, 3354:1, 3355:1, 3360:3
, 3435:7, 3474:2, 3510:1, 3519:1, 3554:1, 3555:1, 3557:1, 3558:2, 3561:1, 3567:3, 3573:1, 3576:1, 3579:6, 3585:1, 3588:1, 359
1:2, 3594:4, 3597:2, 3600:3, 3612:3, 3615:1, 3618:1, 3633:4, 3657:2, 3660:10, 3663:10, 3684:2, 3687:1, 3699:1, 3711:3, 3738:2, 3753:4, 3754:1, 3755:1, 3759:2, 3765:6, 3774:3, 3825:1, 3830:1, 3834:1, 3837:5, 3840:2, 3864:3, 3871:2, 3909:1, 3918:6, 3945:1, 3981:1, 3984:4, 4042:3, 4090:1, 4098:2, 4116:3, 4119:1, 4131:1, 4138:1, 4171:1, 4176:1, 4227:1, 4228:1, 4230:1, 4251:7, 4260:2, 4320:1, 4327:1, 4335:1, 4338:3, 4341:1, 4356:1, 4359:1, 4362:2, 4368:1, 4374:2, 4421:2, 4427:2, 4437:3, 4440:1, 4449:6, 4509:17, 4542:1, 4545:5, 4550:1, 4551:1, 4553:1, 4561:1, 4575:1, 4593:2, 4605:1, 4629:2, 4632:1, 4635:3, 4645:1, 4647:10, 4678:3, 4701:8, 4704:1, 4725:5, 4726:2, 4737:6, 4738:2, 4744:5, 4752:4, 4755:2, 4767:2, 4785:1, 4794:6, 4797:1, 4803:6 ...
Your peptide is starting at position 63. So the translation signal is right here. Why this region not reported may be because there's no start codon upstream. Did you use the --alt option?
I think your script for the A site may be not correct.
No, I did not use the --alt option before actually. Thanks for telling me about it. I am running the workflow with it right now. What makes you think that the script is wrong for inferring the A site?
And how exactly do you see that the peptide is starting at position 63?
Hi @zhpn1024
I have another question that I mentioned before.
So, I run ribotish quality
on the ensembl gtf file with --th 0.4
.
The quality output looks pretty good, the offsets for an exemplary file:
offdict = {25: 10, 26: 8, 28: 10, 29: 11, 31: 11, 32: 11, 33: 11, 'm0': {28: 7, 29: 11, 31: 11, 33: 8}}
But then I want to run ribotish predict
on another gtf file from RNAcentral with the following parameters: --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest
.
This is the log:
Mon Dec 12 17:18:47 2022 Loading genome...
Making fasta index for Mus_musculus.GRCm39.dna.primary_assembly.fa...
No input TIS data!
Mon Dec 12 17:19:25 2022 Predicting...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
GL456210.1
GL456211.1
GL456212.1
GL456219.1
GL456221.1
GL456233.2
GL456239.1
GL456354.1
GL456359.1
GL456360.1
GL456366.1
GL456367.1
GL456368.1
GL456370.1
GL456372.1
GL456378.1
GL456379.1
GL456381.1
GL456382.1
GL456383.1
GL456385.1
GL456387.1
GL456389.1
GL456390.1
GL456394.1
GL456396.1
JH584296.1
JH584297.1
JH584298.1
JH584299.1
JH584300.1
JH584301.1
JH584302.1
JH584303.1
JH584304.1
MT
MU069434.1
MU069435.1
X
Y
Mon Dec 12 18:36:50 2022 Checking overlap with known CDS..
Mon Dec 12 18:36:50 2022 BH correcting...
Mon Dec 12 18:36:50 2022 Done! Time used: 4682.619626283646 s.
Even though the log file seems alright, there is no output besides the header to both pred(_all).txt files.
I am really out of ideas on this one. Can you help? Thanks for your help!
No, I did not use the --alt option before actually. Thanks for telling me about it. I am running the workflow with it right now. What makes you think that the script is wrong for inferring the A site?
Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads. The position 63 is by calculating the mapping of genome location to transcript position.
Are there any data in TRANSPROFILE file?
Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads. The position 63 is by calculating the mapping of genome location to transcript position.
I see. Yes, the A sites have not been computed based on the genome alignments but based on the transcriptome alignment. I would need to look more closely into what happened there exactly.
Btw, I really thought that the --alt
option was by default enabled, haha... Read over that every time.
So, when running with the --alt
option enabled and some additional alternative start codons, the peptide mentioned above is actually identified. It is also annotated with id = ENST00000283195:
ENSG00000153201 ENST00000283195 RANBP2 protein_coding 2:109335960-109400357:+ GCG 24 9801 Extended 0 None 3.32963423346377e-92 N None None 7.647073795067235e-89 None 3258
I didn't paste the sequence because of its length. So ATAAGLSAGLTR is really just a small part of that.
Coming back to calculating which positions this peptide is represented by in the TRANSPROFILE; I still don't really get how the exact numbers are calculated. But ignoring this, this small part of the larger TRANSPROFILE does seem to indicate periodicity:
... 63:2, 64:1, 65:1, 66:22, 69:12 ...
I also made sure to include A as a possible start codon. Why is ATAAGLSAGLTR not called by itself?
Are there any data in TRANSPROFILE file?
For this other experiment, there is nothing in the TRANSPROFILE except the header, either. Why does ribotish take more than an hour to not output anything at all?
Glad to see the new results. It is curious that GCG is not in the alt start codons. What's your command? The transprofile data is position: count. The position is the 5' end of P-site. Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.
Are there any data in TRANSPROFILE file?
For this other experiment, there is nothing in the TRANSPROFILE except the header, either. Why does ribotish take more than an hour to not output anything at all?
It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.
Glad to see the new results. It is curious that GCG is not in the alt start codons. What's your command? The transprofile data is position: count. The position is the 5' end of P-site. Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.
I am glad, too. Thank you!
Since the --alt
is described to only consider start codons differing in one nt from ATG, I added some that were relevant for the small peptides I am searching for:
--alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest
Thanks, I got that part. But the peptide I am interested in is starting at chr2:109'336'000. If I see that there were reads at 63, 66, 69 in the TRANSPROFILE of the respective entry, then how is this related to the absolute genome coordinates?
I'll certainly give --framebest
a try, aswell!
It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.
All of the files have the usual names like 1,2,3 .. MT, X, Y.
But then the RNAcentral gtf has a couple entries like this: JH584298.1 which are only in there.
And the bam file header shows some entries like this: GL000192.1 which are also in the genome.
Does that tell you anything?
Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.
The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?
Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.
I see. May have a look at it but I'll just trust you on that for now. =)
The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?
This is the QC output for one exemplary file: Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed_qual.pdf
The data quality is not good. There are very few reads in coding regions, and the frame pattern is not clear.
Well, alright... We thought it was good enough, but thanks for the hint.
Hi @zhpn1024
So, I agree that the data mentioned just above is not very deep. But now I have tried to predict on the RNAcentral gtf with other data.
This data is almost strangely periodic and deep, or not?: SRR1333393.sorted_indexed_qual.pdf
Still, all output files (including the TRANSPROFILE) are empty.
I ran the following:
ribotish predict -b file.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
This is the log:
Wed Dec 14 13:03:49 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh37.75.dna.primary_assembly.fa...
No input TIS data!
Wed Dec 14 13:04:30 2022 Predicting...
.
.
CHR_HSCHRX_2_CTG3
GL000008.2
.
.
Wed Dec 14 13:04:37 2022 Checking overlap with known CDS..
Wed Dec 14 13:04:37 2022 BH correcting...
Wed Dec 14 13:04:37 2022 Done! Time used: 48.64457869529724 s.
I don't get it. Why doesn't ribotish predict anything on this gtf? And why was it so fast if it took way later on that other data?
This data looks good. GRCh37 or 38? The version for gtf and genome fa is not consistent.
Hmm... As far as I am aware RNAcentral doesn't exist for anything pre 38.
But I'll just run it with all files from 38 to make sure.
Will get back to you, haha. Thanks!
Well, I tried running with all 3 files (ensemnbl gtf, RNAcentral gtf and genome.fa) from the same release and it didn't help. The output files and TRANSPROFILE are still empty.
ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
The log pretty much looks the same as just above. Quality as discussed earlier should also definitely not be an issue.
Again, if I aligned with the normal ensembl gtf, then estimate offsets with the normal gtf but then predict with the RNAcentral gtf this should work?
I mean mapping with STAR works fine (~50% of uniquely mapped reads), also if I only map using the RNAcentral gtf.
In summary; I am really out of ideas why ribotish doesn't predict with the RNAcentral gtf. Do you have an idea still?
Try to predict with ensembl gtf?
well, that works. That gives the results we saw further up in the thread.
I found RNAcentral data in gff3 format. Where do you download th gtf format?
You are right. I downloaded it in gff3 format, but then transformed to gtf using gffread.
I'll run it with the gff3 to make sure..
Maybe ribotish failed to parse the converted gtf format.
Hi @zhpn1024
So I ran ribotish with the RNAcentral gff3 file. Again the output files (TRANSPROFILE included) are empty. But this time atleast the log is telling something new:
Fri Dec 16 14:18:23 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh38.dna.primary_assembly.fa...
No input TIS data!
Fri Dec 16 14:19:06 2022 Predicting...
.
.
.
Inconsistent trans strand: URS00000235AA_9606.241936 - +
15
Inconsistent trans strand: URS000027D32E_9606.274248 + -
16
Inconsistent trans strand: URS00008B4EA9_9606.317462 + -
17
Inconsistent trans strand: URS00001AB772_9606.350408 + -
.
.
.
Fri Dec 16 14:19:48 2022 Checking overlap with known CDS..
Fri Dec 16 14:19:48 2022 BH correcting...
Fri Dec 16 14:19:48 2022 Done! Time used: 85.3397159576416 s.
This is the command I ran. I had ribotish autodetect the gff3 file type. Is that a problem?
ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
Do you have an idea what the error is trying to tell? Thanks!
I just checked the RNAcentral gff3 format. There are two major problems that my program fail to parse. The first is the 'exon' in the third column is changed to 'noncoding_exon', resulted in no exon in any transcript. The second is no gene annotation to organize transcripts. I think using the RNAcentral bed annotation would be OK.
Hi @zhpn1024
It really finally worked when running ribotish with the RNAcentral bed file. Thanks so much for your help & persistence.
So, on our own samples, which are not that deeply sequenced (QC output in the thread above), none of the ORFs passed the thresholds, so there are only entries in the "_all.txt" files. Still, I am glad we found the error.
I ran:
ribotish predict -b Sample_11.bam -g mus_musculus.GRCm39.bed -f Mus_musculus.GRCm39.dna.primary_assembly.fa --ribopara Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed.bam.para.py --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt
The log file has >600'000 warning messages, though: I just copied you the tail of the log where 2 of those messages are visible.
Wrong CDS annotation: URS000028A7C4_10090 URS000028A7C4_10090 0 73 73
Wrong CDS annotation: URS000061C2F2_10090 URS000061C2F2_10090 0 73 73
Mon Dec 19 13:04:30 2022 Checking overlap with known CDS..
Mon Dec 19 13:04:30 2022 BH correcting...
Mon Dec 19 13:04:30 2022 Done! Time used: 5537.979197025299 s.
Seems like ribotish still doesn't like the annotation file.. I guess that means I am losing information. Is there any way around this?
Hi there,
I am trying to run ribotish on regular riboseq data from our lab. Unfortunately, this already fails at the quality step. I mapped the reads to their reference (GRCm39) with a special annotation file from RNAcentral containing annotation for ncRNAs. I used the same gtf below.
I ran:
The error I get is:
The first couple gtf entries look the following way:
Do you have any idea where this error may be coming from?
Thanks for your help!