zhpn1024 / ribotish

Ribo-seq TIS Hunter, predicting translation initiation sites and ORFs using riboseq data
http://dx.doi.org/10.1038/s41467-017-01981-8
GNU General Public License v3.0
27 stars 8 forks source link

No reads found! #29

Closed noepozzan closed 2 years ago

noepozzan commented 2 years ago

Hi there,

I am trying to run ribotish on regular riboseq data from our lab. Unfortunately, this already fails at the quality step. I mapped the reads to their reference (GRCm39) with a special annotation file from RNAcentral containing annotation for ncRNAs. I used the same gtf below.

I ran:

ribotish quality -b file.bam -g mus_musculus.RNAcentral.gtf --th 0.5

The error I get is:

Counted reads: 0
Error: no reads found! Check read length or protein coding annotation.

The first couple gtf entries look the following way:

1   RNAcentral  transcript  3056358 3056384 .   -   .   transcript_id "URS0000253E70_10090.0"; gene_id "URS0000253E70_10090.0";
1   RNAcentral  exon    3056358 3056384 .   -   .   transcript_id "URS0000253E70_10090.0";
1   RNAcentral  transcript  3056360 3056384 .   -   .   transcript_id "URS000042B783_10090.1"; gene_id "URS000042B783_10090.1";

Do you have any idea where this error may be coming from?

Thanks for your help!

zhpn1024 commented 2 years ago

Hi! The qc step uses known coding genes. You need to input coding annotations. The noncoding analysis is in the predict step.

noepozzan commented 2 years ago

I see, didn't realize that. Thanks a lot for the help! Btw, I opened another issue since I have been using ribotish quite extensively.

noepozzan commented 2 years ago

Getting back to you here because even though ribotish predict works with the following parameters: "--seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest", I do not get a single prediction out, when predicting on the RNAcentral gtf.. The pred(_all).txt files only have a header and nothing else. Do you have any idea why?

noepozzan commented 2 years ago

I am asking this because I am specifically searching for these new ORFs that I see on IGV. There are clearly some reads lying around these ORFs and they even look quite periodic. Why is ribotish not identifying any of them, even with these permissive parameters? Thank you!

zhpn1024 commented 2 years ago

What about the quality result? What is the content in the para.py file?

noepozzan commented 2 years ago

This is one of them, they look like this:

offdict = {25: 7, 26: 7, 27: 10, 28: 7, 29: 7, 31: 7, 'm0': {25: 7, 26: 10, 28: 13, 31: 8, 33: 16}}

Looks kind of strange, or what do you think?

zhpn1024 commented 2 years ago

It's strange. What about the pdf figure?

noepozzan commented 2 years ago

SRR1333393.sorted_indexed_qual.pdf

zhpn1024 commented 2 years ago

The reads length s are too short. The lengths should be around 29nt. How did you trim the reads?

noepozzan commented 2 years ago

I got rid of the first 5 bases and then clipped at the adapter. Would it help you if I pasted some reads here? Thanks btw, for really trying to help me!

zhpn1024 commented 2 years ago

That's the reason. 5+7=12 is the most common offset value. Do not cut the first 5 bases.

noepozzan commented 2 years ago

I see. I'll give it a try. Thanks again!

noepozzan commented 2 years ago

SRR1333393.sorted_indexed_qual.pdf

What do you think of this? Definitely looks better, no? Still doesn't find those regions that I see some reads and periodicity in..

zhpn1024 commented 2 years ago

It's much better now. Are there any prediction results?

noepozzan commented 2 years ago

Yes, there are results. I should specify: The quality figure I posted above comes actually from a different experiment from CHX-treated riboseq data that I took from SRA trying to reproduce some paper's results. In that paper, they found 10 short ORFs. But using ribotish I only find 4 of them. What's also strange is that I found the same 4 ORFs even before you told me to correct the offset. So this doesn't seem to matter too much. The problem is that I have no idea why ribotish doesn't see them. These regions of interest aren't even in the predict outputs. I expected them to be in there, just maybe not with a passing p-value. Do you have any other idea why this may be?

zhpn1024 commented 2 years ago

Run predict with --transprofile option can output signal data on all transcripts. Check if there is any signal in your transcript.

At 2022-11-28 23:35:41, "Noè Pozzan" @.***> wrote:

Yes, there are results. I should specify: The quality figure I posted above comes actually from a different experiment from CHX-treated riboseq data that I took from SRA trying to reproduce some paper's results. In that paper, they found 10 short ORFs. But using ribotish I only find 4 of them. What's also strange is that I found the same 4 ORFs even before you told me to correct the offset. So this doesn't seem to matter too much. The problem is that I have no idea why ribotish doesn't see them. These regions of interest aren't even in the predict outputs. I expected them to be in there, just maybe not with a passing p-value. Do you have any other idea why this may be?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>

noepozzan commented 1 year ago

Hey @zhpn1024,

Sorry for the late reply.. I ran ribotish predict with the --transprofile option. Pretty handy! I see signal in the transcripts im interested in, but not for the exact sequence I am interested in.

So, as an example this paper I am trying to replicate has the following: They found ribosome profiling and mass spec evidence for the following peptide: ATAAGLSAGLTR. They annotated this with id = ENST00000283195. Ribotish finds many things with an id of ENST00000283195, but not this exact peptide I'd be interested in. Below is a screenshot from IGV where I plot the full footprints and only their A-site for periodicity of the region of this peptide. You should be able to see it in the upper translation frame. ATAAGL

Do the reads not look good enough to be considered by ribotish?

Does it make sense what I am telling you? And thanks again for your help!

zhpn1024 commented 1 year ago

How do you generate the A site profile? From the A site result, the 394 data support your peptide, while 393 data supports translation in another frame (bottom). My transprofile results are P site profile. What are the signals in this region?

noepozzan commented 1 year ago

Good point. The A site profiles in the screenshot above were actually separately derived using a python script from transcriptome-aligned reads. The 393 is actually CHX-treated, while 394 is LTM-treated. Not sure how that should make a difference for periodicity, though...

Looking into the transprofile file, there is a lot in there for the entry with id = ENST00000283195. I only pasted about half of it due to the length.

ENSG00000153201 ENST00000283195 RANBP2  {}      {48:1, 63:2, 64:1, 65:1, 66:22, 69:12, 72:1, 79:3, 135:3, 137:2, 147:1, 154:2
, 156:5, 176:1, 285:5, 288:1, 294:3, 295:6, 345:2, 346:6, 348:1, 351:4, 354:2, 360:1, 462:1, 463:1, 468:2, 471:5, 480:1, 490:
2, 642:6, 645:3, 648:2, 651:11, 652:1, 653:3, 672:4, 673:3, 690:2, 717:5, 718:2, 720:2, 721:1, 726:1, 727:1, 729:19, 730:1, 7
32:4, 733:6, 734:1, 735:21, 736:3, 737:3, 738:6, 888:3, 891:7, 903:16, 904:1, 912:1, 913:1, 924:1, 927:1, 928:1, 930:1, 933:5
0, 934:9, 1029:4, 1033:1, 1035:1, 1042:2, 1047:4, 1048:1, 1050:1, 1221:2, 1225:3, 1248:1, 1251:1, 1272:1, 1344:1, 1347:2, 134
8:3, 1350:1, 1362:3, 1368:1, 1374:5, 1375:1, 1377:5, 1379:1, 1380:2, 1383:1, 1385:1, 1386:2, 1392:1, 1509:1, 1512:1, 1513:1, 
1527:1, 1671:2, 1752:4, 1755:8, 1758:25, 1767:1, 1915:2, 1917:1, 1918:1, 2068:1, 2100:1, 2118:1, 2195:2, 2196:3, 2202:2, 2208
:1, 2211:1, 2269:1, 2271:1, 2283:1, 2292:1, 2478:2, 2541:1, 2595:2, 2616:2, 2640:1, 2670:5, 2814:2, 2913:1, 2963:2, 2964:3, 2
979:3, 3026:2, 3039:2, 3077:1, 3081:3, 3090:2, 3102:1, 3111:1, 3135:1, 3165:2, 3247:1, 3282:2, 3312:1, 3354:1, 3355:1, 3360:3
, 3435:7, 3474:2, 3510:1, 3519:1, 3554:1, 3555:1, 3557:1, 3558:2, 3561:1, 3567:3, 3573:1, 3576:1, 3579:6, 3585:1, 3588:1, 359
1:2, 3594:4, 3597:2, 3600:3, 3612:3, 3615:1, 3618:1, 3633:4, 3657:2, 3660:10, 3663:10, 3684:2, 3687:1, 3699:1, 3711:3, 3738:2, 3753:4, 3754:1, 3755:1, 3759:2, 3765:6, 3774:3, 3825:1, 3830:1, 3834:1, 3837:5, 3840:2, 3864:3, 3871:2, 3909:1, 3918:6, 3945:1, 3981:1, 3984:4, 4042:3, 4090:1, 4098:2, 4116:3, 4119:1, 4131:1, 4138:1, 4171:1, 4176:1, 4227:1, 4228:1, 4230:1, 4251:7, 4260:2, 4320:1, 4327:1, 4335:1, 4338:3, 4341:1, 4356:1, 4359:1, 4362:2, 4368:1, 4374:2, 4421:2, 4427:2, 4437:3, 4440:1, 4449:6, 4509:17, 4542:1, 4545:5, 4550:1, 4551:1, 4553:1, 4561:1, 4575:1, 4593:2, 4605:1, 4629:2, 4632:1, 4635:3, 4645:1, 4647:10, 4678:3, 4701:8, 4704:1, 4725:5, 4726:2, 4737:6, 4738:2, 4744:5, 4752:4, 4755:2, 4767:2, 4785:1, 4794:6, 4797:1, 4803:6 ...
zhpn1024 commented 1 year ago

Your peptide is starting at position 63. So the translation signal is right here. Why this region not reported may be because there's no start codon upstream. Did you use the --alt option?

zhpn1024 commented 1 year ago

I think your script for the A site may be not correct.

noepozzan commented 1 year ago

No, I did not use the --alt option before actually. Thanks for telling me about it. I am running the workflow with it right now. What makes you think that the script is wrong for inferring the A site?

noepozzan commented 1 year ago

And how exactly do you see that the peptide is starting at position 63?

noepozzan commented 1 year ago

Hi @zhpn1024

I have another question that I mentioned before. So, I run ribotish quality on the ensembl gtf file with --th 0.4. The quality output looks pretty good, the offsets for an exemplary file:

offdict = {25: 10, 26: 8, 28: 10, 29: 11, 31: 11, 32: 11, 33: 11, 'm0': {28: 7, 29: 11, 31: 11, 33: 8}}

But then I want to run ribotish predict on another gtf file from RNAcentral with the following parameters: --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest.

This is the log:

Mon Dec 12 17:18:47 2022 Loading genome...
Making fasta index for Mus_musculus.GRCm39.dna.primary_assembly.fa...
No input TIS data!
Mon Dec 12 17:19:25 2022 Predicting...
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
GL456210.1
GL456211.1
GL456212.1
GL456219.1
GL456221.1
GL456233.2
GL456239.1
GL456354.1
GL456359.1
GL456360.1
GL456366.1
GL456367.1
GL456368.1
GL456370.1
GL456372.1
GL456378.1
GL456379.1
GL456381.1
GL456382.1
GL456383.1
GL456385.1
GL456387.1
GL456389.1
GL456390.1
GL456394.1
GL456396.1
JH584296.1
JH584297.1
JH584298.1
JH584299.1
JH584300.1
JH584301.1
JH584302.1
JH584303.1
JH584304.1
MT
MU069434.1
MU069435.1
X
Y
Mon Dec 12 18:36:50 2022 Checking overlap with known CDS..
Mon Dec 12 18:36:50 2022 BH correcting...
Mon Dec 12 18:36:50 2022 Done! Time used: 4682.619626283646 s.

Even though the log file seems alright, there is no output besides the header to both pred(_all).txt files.

I am really out of ideas on this one. Can you help? Thanks for your help!

zhpn1024 commented 1 year ago

No, I did not use the --alt option before actually. Thanks for telling me about it. I am running the workflow with it right now. What makes you think that the script is wrong for inferring the A site?

Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads. The position 63 is by calculating the mapping of genome location to transcript position.

zhpn1024 commented 1 year ago

Are there any data in TRANSPROFILE file?

noepozzan commented 1 year ago

Based on the offset. The A-site is upstream of P-site, and the offset is around 9nt. But your A-sites are nearly the end of reads. The position 63 is by calculating the mapping of genome location to transcript position.

I see. Yes, the A sites have not been computed based on the genome alignments but based on the transcriptome alignment. I would need to look more closely into what happened there exactly.

Btw, I really thought that the --alt option was by default enabled, haha... Read over that every time. So, when running with the --alt option enabled and some additional alternative start codons, the peptide mentioned above is actually identified. It is also annotated with id = ENST00000283195:

ENSG00000153201 ENST00000283195 RANBP2  protein_coding  2:109335960-109400357:+ GCG 24  9801    Extended    0   None    3.32963423346377e-92    N   None    None    7.647073795067235e-89   None    3258

I didn't paste the sequence because of its length. So ATAAGLSAGLTR is really just a small part of that.

Coming back to calculating which positions this peptide is represented by in the TRANSPROFILE; I still don't really get how the exact numbers are calculated. But ignoring this, this small part of the larger TRANSPROFILE does seem to indicate periodicity:

... 63:2, 64:1, 65:1, 66:22, 69:12 ...

I also made sure to include A as a possible start codon. Why is ATAAGLSAGLTR not called by itself?

noepozzan commented 1 year ago

Are there any data in TRANSPROFILE file?

For this other experiment, there is nothing in the TRANSPROFILE except the header, either. Why does ribotish take more than an hour to not output anything at all?

zhpn1024 commented 1 year ago

Glad to see the new results. It is curious that GCG is not in the alt start codons. What's your command? The transprofile data is position: count. The position is the 5' end of P-site. Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.

zhpn1024 commented 1 year ago

Are there any data in TRANSPROFILE file?

For this other experiment, there is nothing in the TRANSPROFILE except the header, either. Why does ribotish take more than an hour to not output anything at all?

It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.

noepozzan commented 1 year ago

Glad to see the new results. It is curious that GCG is not in the alt start codons. What's your command? The transprofile data is position: count. The position is the 5' end of P-site. Maybe you run with the --longest, so the most upstream start codon is used. You can try --framebest instead. This option will test all possible start codons and choose the most significant one.

I am glad, too. Thank you!

Since the --alt is described to only consider start codons differing in one nt from ATG, I added some that were relevant for the small peptides I am searching for:

--alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq --tpth 1 --fpth 1 --fspth 1 -v --longest

Thanks, I got that part. But the peptide I am interested in is starting at chr2:109'336'000. If I see that there were reads at 63, 66, 69 in the TRANSPROFILE of the respective entry, then how is this related to the absolute genome coordinates?

I'll certainly give --framebest a try, aswell!

noepozzan commented 1 year ago

It is curious. Are chromosome names consistent in bam, genome fa and gtf files? For example 'chr1' with '1'.

All of the files have the usual names like 1,2,3 .. MT, X, Y.

But then the RNAcentral gtf has a couple entries like this: JH584298.1 which are only in there.

And the bam file header shows some entries like this: GL000192.1 which are also in the genome.

Does that tell you anything?

zhpn1024 commented 1 year ago

Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.

zhpn1024 commented 1 year ago

The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?

noepozzan commented 1 year ago

Mapping transcript coordinate to genome coordinate need custom scripts. If you are fammilar with python, you can use my ribotish.zbio.gtf module. The Trans.genome_pos() function does this job.

I see. May have a look at it but I'll just trust you on that for now. =)

noepozzan commented 1 year ago

The small chromosomes like JH584298.1 and GL000192.1 do not affect. Can you show the QC figure?

This is the QC output for one exemplary file: Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed_qual.pdf

zhpn1024 commented 1 year ago

The data quality is not good. There are very few reads in coding regions, and the frame pattern is not clear.

noepozzan commented 1 year ago

Well, alright... We thought it was good enough, but thanks for the hint.

noepozzan commented 1 year ago

Hi @zhpn1024

So, I agree that the data mentioned just above is not very deep. But now I have tried to predict on the RNAcentral gtf with other data.

This data is almost strangely periodic and deep, or not?: SRR1333393.sorted_indexed_qual.pdf

Still, all output files (including the TRANSPROFILE) are empty.

I ran the following:

ribotish predict -b file.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

This is the log:

Wed Dec 14 13:03:49 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh37.75.dna.primary_assembly.fa...
No input TIS data!
Wed Dec 14 13:04:30 2022 Predicting...
.
.
CHR_HSCHRX_2_CTG3
GL000008.2
.
.
Wed Dec 14 13:04:37 2022 Checking overlap with known CDS..
Wed Dec 14 13:04:37 2022 BH correcting...
Wed Dec 14 13:04:37 2022 Done! Time used: 48.64457869529724 s.

I don't get it. Why doesn't ribotish predict anything on this gtf? And why was it so fast if it took way later on that other data?

zhpn1024 commented 1 year ago

This data looks good. GRCh37 or 38? The version for gtf and genome fa is not consistent.

noepozzan commented 1 year ago

Hmm... As far as I am aware RNAcentral doesn't exist for anything pre 38.

But I'll just run it with all files from 38 to make sure.

Will get back to you, haha. Thanks!

noepozzan commented 1 year ago

Well, I tried running with all 3 files (ensemnbl gtf, RNAcentral gtf and genome.fa) from the same release and it didn't help. The output files and TRANSPROFILE are still empty.

ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gtf -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

The log pretty much looks the same as just above. Quality as discussed earlier should also definitely not be an issue.

Again, if I aligned with the normal ensembl gtf, then estimate offsets with the normal gtf but then predict with the RNAcentral gtf this should work?

I mean mapping with STAR works fine (~50% of uniquely mapped reads), also if I only map using the RNAcentral gtf.

In summary; I am really out of ideas why ribotish doesn't predict with the RNAcentral gtf. Do you have an idea still?

zhpn1024 commented 1 year ago

Try to predict with ensembl gtf?

noepozzan commented 1 year ago

well, that works. That gives the results we saw further up in the thread.

zhpn1024 commented 1 year ago

I found RNAcentral data in gff3 format. Where do you download th gtf format?

noepozzan commented 1 year ago

You are right. I downloaded it in gff3 format, but then transformed to gtf using gffread.

I'll run it with the gff3 to make sure..

zhpn1024 commented 1 year ago

Maybe ribotish failed to parse the converted gtf format.

noepozzan commented 1 year ago

Hi @zhpn1024

So I ran ribotish with the RNAcentral gff3 file. Again the output files (TRANSPROFILE included) are empty. But this time atleast the log is telling something new:

Fri Dec 16 14:18:23 2022 Loading genome...
Making fasta index for Homo_sapiens.GRCh38.dna.primary_assembly.fa...
No input TIS data!
Fri Dec 16 14:19:06 2022 Predicting...
.
.
.
Inconsistent trans strand:  URS00000235AA_9606.241936 - +
15
Inconsistent trans strand:  URS000027D32E_9606.274248 + -
16
Inconsistent trans strand:  URS00008B4EA9_9606.317462 + -
17
Inconsistent trans strand:  URS00001AB772_9606.350408 + -
.
.
.
Fri Dec 16 14:19:48 2022 Checking overlap with known CDS..
Fri Dec 16 14:19:48 2022 BH correcting...
Fri Dec 16 14:19:48 2022 Done! Time used: 85.3397159576416 s.

This is the command I ran. I had ribotish autodetect the gff3 file type. Is that a problem?

ribotish predict -b SRR1333393.bam -g homo_sapiens.GRCh38.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa --ribopara SRR1333393.sorted_indexed.bam.para.py --alt --altcodons TCT,GCC,GCG,GGA --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

Do you have an idea what the error is trying to tell? Thanks!

zhpn1024 commented 1 year ago

I just checked the RNAcentral gff3 format. There are two major problems that my program fail to parse. The first is the 'exon' in the third column is changed to 'noncoding_exon', resulted in no exon in any transcript. The second is no gene annotation to organize transcripts. I think using the RNAcentral bed annotation would be OK.

noepozzan commented 1 year ago

Hi @zhpn1024

It really finally worked when running ribotish with the RNAcentral bed file. Thanks so much for your help & persistence.

So, on our own samples, which are not that deeply sequenced (QC output in the thread above), none of the ORFs passed the thresholds, so there are only entries in the "_all.txt" files. Still, I am glad we found the error.

I ran:

ribotish predict -b Sample_11.bam -g mus_musculus.GRCm39.bed -f Mus_musculus.GRCm39.dna.primary_assembly.fa --ribopara Sample_11_Concatenated_WT_30M_132443_RIBO_ALL.sorted_indexed.bam.para.py --alt --transprofile TRANSPROFILE --seq --aaseq -v --longest -o ribotish_pred.txt

The log file has >600'000 warning messages, though: I just copied you the tail of the log where 2 of those messages are visible.

Wrong CDS annotation: URS000028A7C4_10090 URS000028A7C4_10090 0 73 73
Wrong CDS annotation: URS000061C2F2_10090 URS000061C2F2_10090 0 73 73
Mon Dec 19 13:04:30 2022 Checking overlap with known CDS..
Mon Dec 19 13:04:30 2022 BH correcting...
Mon Dec 19 13:04:30 2022 Done! Time used: 5537.979197025299 s.

Seems like ribotish still doesn't like the annotation file.. I guess that means I am losing information. Is there any way around this?