williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

Mapping speed too low in hg19 #65

Closed yangjywhu closed 4 years ago

yangjywhu commented 5 years ago

Hello, I'm trying to run BuildRefProcess in hg19. There was no problems in building STAR index, but when it comes to mapping(in Mapability/), the speed of it is too low(about 0.1M/hr, speed of mm10 is about 600M/hr). What can I do to improve the speed?

Best, Jiayi Yang

dg520 commented 5 years ago

Hi @yangjywhu

Are you using all the chromosome including those scaffolds or only the main chromosomes (e.g. 1-22, X, Y and MT)? I suggest the latter to shrink your searching space. I 'm surprised there was such a huge speed gap between human and mouse. Typically both should be completed in 2 hours, when the computational resources are sufficient.

Best, Dadi

yangjywhu commented 5 years ago

Thank you @dg520

I was using all the chromosome to run with hg19, but the parameters I set are same as mm10. I tried to run with hg19 on other server, but the speed of it is also low. Then I run BuildRefProcess with hg38. The speed of hg38 is as high as mm10. I think there are some problems only occured in hg19. Maybe something is abnormal with the annotation files of hg19?

Best, Jiayi Yang

dg520 commented 5 years ago

Hi @yangjywhu ,

During the mapability estimation stage, we first use a 70bp sliding window with a step size of 10bp to breakdown the whole genome into pieces. These pieces are then treated as fake RNA-Seq reads to map back to the genome. We use this method to judge the complexity of each genomic region. As you can see, this is a straightforward process and the running time should be correlated with genome size only.

If your are still suffering from low speed, you might want: 1) check your GTF file and FASTA file, do they use the same chromosome name conventions? For example, both of them should use either "chr1, chr2, chr3..." or "1, 2, 3, ....". Mixing the two naming style will confuse the aligner when it searches for splicing junctions during mapping stage. 2) Instead of using -m BuildRef instead of -m BuildRefProcess. This will download refresh reference from EnSembl and build from there. If this approach has a normal speed, it suggests your local reference files have some problems. 3) using the main chromosomes only (e.g. 1-22, X and Y), so that the searching space is smaller. Scaffold contigs might contain regions that are hard to map against.

Please try the above suggestions in the order I listed. Let me know if the problem is still there.

Best, Dadi

yangjywhu commented 4 years ago

Hi @dg520, I'm so sorry. I've been too busy recently and forgot to reply. There is no problem with -m BuildRef. Thanks for your patient reply. "IRFinder" is a good tool to detect intron retention from my RNA-Seq data, which helped me a lot.

Best wishes, Jiayi Yang.