hg38.longest.transcripts.info.txt error

junjunlab commented 4 weeks ago

Hi, I perform MetageneAnalysis and use hg38.longest.transcripts.info.txt from Ribominer package. I got an error ValueError: invalid literal for int() with base 10: 'CDS_length':

1723960415119

I don't know what is the problem. Is there any way to solve it? Thanks!

sherkinglee commented 3 weeks ago

Hi, I think it is the problem of the long.transcript.info.txt. The correct file should have the following columns:

$ less -S longest.transcripts.info.txt|sed -n "1p"|sed -s "s/\t/\n/g"|nl
     1  chrom
     2  trans_id
     3  strand
     4  gene_id
     5  gene_name
     6  transcript_biotype
     7  gene_start
     8  gene_stop
     9  CDS_start
    10  CDS_stop
    11  CDS_length
    12  5UTR_length
    13  3UTR_length
    14  transcript_length

If not, you can check out your reference file first and try it again.

best wishes

junjunlab commented 3 weeks ago

I re-created the longest.transcripts.info.txt and run MetageneAnalysis -f attributes.txt -c longest.transcripts.info.txt -o ./4e2_metagene, I got another error like following:

1724045921985

longest.transcripts.info.txt:

1724046023981

sherkinglee commented 3 weeks ago

That's weird! The fetch of pysam is to get the mapped reads in the bam file. Unless there are no mapped reads in this transcript and this transcript id did not appear in the bam file. Otherwise, I cannot come up with any other problem that could cause this. Could you please send your bam file and attribute.txt file so I can repeat your problem?

junjunlab commented 3 weeks ago

Sure，this is my upstream code：

1724053476259

attribute.txt file：

1724054649990

This is my data：

通过网盘分享的文件：bam 链接: https://pan.baidu.com/s/1U3Zmc0gwFAM7PHrN9qANKA?pwd=dddv 提取码: dddv

sherkinglee commented 3 weeks ago

Hi, I converted your bam into a fastq file and re-analyzed it with RiboMiner and I did not encounter any problems. Here is the command I used:

Ref=/home/00.Reference/human/ensemble110
transcript=/home/00.Reference/human/ensemble110/RiboCode_annot/transcripts_sequence.fa
results=$workdir/MA
attribute=$workdir/configure.txt
trans_info=$Ref/longest.transcripts.info.txt
groups='4E2'
replicates='4E2'

mkdir -p $results

MetageneAnalysis -f $attribute -c $trans_info -o $results/MA_normed -U codon -M counts -u 0 -d 500 -l 100 -n 10 -m 5 -e 5 --norm yes -y 100 --CI 0.95 --type CDS

and here is the running info:

your input: 1 bam files
19726  transcripts will be used in the following analysis.

Length filter(-l)---Transcripts number filtered by criterion one is : 782
Length filter (3n)---Transcripts number filtered by criterion two is : 120
Total counts filter---Transcripts number filtered by criterion three is : 18324
CDS density filter(RPKM-n or counts-n)---Transcripts number filtered by criterion four is : 453
CDS density filter(normed-m)---Transcripts number filtered by criterion five is : 0
Metaplots Transcript Number for bam file../08.STAR/4E2_ribo_STAR/4E2_ribo.Aligned.toTranscriptome.out.sorted.bam is :47
Finish the step of ribosomeDensityNormPerTrans
Finish the step of MetageneAnalysis!
findfont: Font family 'Arial' not found.
Finish the step of metagenePlot!

And I noticed that although the length distribution is OK, the periodicity is not good for this sample. And too many reads are mapped to intron or intergenic regions, indicating potential DNA contaminations.

# cutadapt
sample  Total   Trimmed(Percent)        shortNum(Percentage)    LeftNum(Percentage)
4E2     4,787,469       152,867 (3.2%)  164 (0.0%)      4,787,305 (100.0%)
# filtering
sample  inputNum        Remained        discarded(Percent)
4E2     4787305 4693320 93985 (1%)
# remove rRNA contamination
sample  ProcessedNum    rRNA(Percent)   noContamRNA(Percent)
4E2_ribo        4693320 44 (0.00%)      4693276 (100.00%)
# remove tRNA contamination
sample  ProcessedNum    tRNA(Percent)   noContamRNA(Percent)
4E2_ribo_tRNA   4693276 0 (0.00%)       4693276 (100.00%)
# Star mapping
sample  input   UniquelyMapped(Percent) MutipulMapped(Percent)
4E2_ribo        4693276 4495285 (95.78%)        187047 (3.99%)
# DNA contamination
sample  Exon    DNA     Intron  ambiguous_RNA
4E2     195542  4049406 244464  5873

Anyway, I also checked out your original bam file and the transcript you mentioned was indeed not in the bam. Thus, is it possible that the bam and reference files do not match correctly? In this case, I suggest that you can re-do the mapping step and try it again.

best wishes

junjunlab commented 3 weeks ago

Thanks for your reply and test！I will have try aggin。

xryanglab / RiboMiner

hg38.longest.transcripts.info.txt error #17