yarden / MISO

MISO: Mixture of Isoforms model for RNA-Seq isoform quantitation
http://genes.mit.edu/burgelab/miso/index.html
132 stars 74 forks source link

check_gff_and_bam() fails when BAM file does not contain read sequences #112

Open romansloutsky opened 7 years ago

romansloutsky commented 7 years ago

I'm trying to run MISO on aligned reads that look like this (in SAM format):

... R6652946 99 1 15021 255 51M = 15078 108 R6652946 147 1 15078 255 51M = 15021 -108 ...

Sequences were intentionally left out of this data to preserve tissue donor anonymity (you can see the white paper for this study here)

MISO throws the following exception:

  File "/home/rs97a/.pyenv/versions/2.7.13/bin/miso", line 11, in <module>
    load_entry_point('misopy==0.5.4', 'console_scripts', 'miso')()
  File "/home/rs97a/.pyenv/versions/2.7.13/lib/python2.7/site-packages/misopy/miso.py", line 591, in main
    wait_on_jobs=wait_on_jobs)
  File "/home/rs97a/.pyenv/versions/2.7.13/lib/python2.7/site-packages/misopy/miso.py", line 372, in compute_all_genes_psi
    given_read_len=read_len)
  File "/home/rs97a/.pyenv/versions/2.7.13/lib/python2.7/site-packages/misopy/run_events_analysis.py", line 102, in check_gff_and_bam
    seq_lens[len(bam_read.seq)] = True
TypeError: object of type 'NoneType' has no len()

The problem is in using the seq attribute of pysam.Samfile (deprecated in favor of pysam.AlignmentFile, btw) to obtain the read's length, since this attribute appears to be set to None when the sequence is not available. From misopy/run_events_analysis.py:

 95     bam_file = pysam.Samfile(bam_filename, "rb")
 96     n = 0
 97     seq_lens = {}
 98     for bam_read in bam_file:
 99         # Check more of the reads for mixed read lengths
100         if n >= (num_reads * 10):
101             break
102         seq_lens[len(bam_read.seq)] = True
103         n += 1.

But there's no need to rely on the sequence, especially since I don't believe it's otherwise used by MISO, when alignment length is available as both pysam.AlignmentFile.alen and pysam.AlignmentFile.query_alignment_length. If it's ever necessary, the length can probably be calculated from the CIGAR string as a fallback.