zhqingit / giremi

GIREMI is a method that can identify RNA editing sites using one RNA-seq data set without requiring genome sequence data.
42 stars 15 forks source link

Low quantity of RNA editings found #37

Open s-a-nersisyan opened 7 years ago

s-a-nersisyan commented 7 years ago

Hi, Qing,

I'm using GIREMI to find some RNAE in my RNA-seq data. I have good aligned file src.bam, reads are paired end (and properly paired). I'm performing following steps:

  1. Sorting bam in coordinate order and indexing it via samtools sort and index. Output file: sorted.bam
  2. Variant calling: samtools mpileup -ugf genome.fa sorted.bam -q 1 | bcftools view -bvcg - > raw.bcf
  3. Filtering results: bcftools view raw.bcf | vcfutils.pl varFilter -D100 | awk '$6>=10' | grep "0\/1:" > filtered.vcf In this line I remove homozygous SNV's by grep "0\/1:"
  4. Annotating filtered.vcf and generating SNV list file (if dbSNP, gene name, strand etc) via snpEff and own simple python script
  5. Running GIREMI: ./giremi -f genome.fa -l SNV_list.txt -o result sorted.bam

My problem is that GIREMI generates 56616 lines .res file and only 3654 of them are RNAE. I have meanMI:0.669357 sdMI:0.143034 and it seems to be normal (as I read in previous issues). I think it's very low quantity of RNA editings (my bam file is 15GB) and they don't normally match with results of other RNA editing find tools. Maybe I'm doing something wrong or missing some important steps? Looking forward for your answer and thank you very much!

zhqingit commented 7 years ago

Hi,

Could you send me you input and output files? Thanks.

Best, Qing

2017-07-07 1:15 GMT-07:00 s-a-nersisyan notifications@github.com:

Hi, Qing,

I'm using GIREMI to find some RNAE in my RNA-seq data. I have good aligned file src.bam, reads are paired end (and properly paired). I'm performing following steps:

  1. Sorting bam in coordinate order and indexing it via samtools sort and index. Output file: sorted.bam
  2. Variant calling: samtools mpileup -ugf genome.fa sorted.bam -q 1 | bcftools view -bvcg - > raw.bcf
  3. Filtering results: bcftools view raw.bcf | vcfutils.pl varFilter -D100 | awk '$6>=10' | grep "0/1:" > filtered.vcf In this line I remove homozygous SNV's by grep "0/1:"
  4. Annotating filtered.vcf and generating SNV list file (if dbSNP, gene name, strand etc) via snpEff and own simple python script
  5. Running GIREMI: ./giremi -f genome.fa -l SNV_list.txt -o result sorted.bam

My problem is that GIREMI generates 56616 lines .res file and only 3654 of them are RNAE. I have meanMI:0.669357 sdMI:0.143034 and it seems to be normal (as I read in previous issues). I think it's very low quantity of RNA editings (my bam file is 15GB) and they don't normally match with results of other RNA editing find tools. Maybe I'm doing something wrong or missing some important steps? Looking forward for your answer and thank you very much!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zhqingit/giremi/issues/37, or mute the thread https://github.com/notifications/unsubscribe-auth/ALL2zdAHaJ_1olyZ1necHmVoYCOlt2JHks5sLekggaJpZM4OQoVW .

s-a-nersisyan commented 7 years ago

Hello Qing, Sorry for late reply. I've re-tested all procedure on https://www.encodeproject.org/experiments/ENCSR000COR/ dataset and it found 8228 RNA editings (5114 from them was found by MI). I think that it is low quantity of RNAE and it must find much more. Please find SNV list and .res file in the attachment. Thank you! ENCFF836AMM.zip