zhou-lab / biscuit

BISulfite-seq CUI Toolkit
Other
63 stars 24 forks source link

biscuit asm fails on generated epiread files #13

Open ttriche opened 7 years ago

ttriche commented 7 years ago

Compiled with debug symbols and run under gdb...

gdb -q --args ./biscuit asm epiread.txt
Reading symbols from ./biscuit...done.
(gdb) b main_asm
Breakpoint 1 at 0x453853: file src/asm_pairwise.c, line 91.
(gdb) r
Starting program: /home/tim/biscuit/biscuit asm epiread.txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, main_asm (argc=2, argv=0x7fffffffdf50) at src/asm_pairwise.c:91
91  int main_asm(int argc, char *argv[]) {
... 
(gdb) n
125     assert(strlen(in->fields[4]) == 1);
(gdb) print in->fields[4]
$1 = 0x6f0600 "10468"
(gdb) n
biscuit: src/asm_pairwise.c:125: main_asm: Assertion `strlen(in->fields[4]) == 1' failed.

Program received signal SIGABRT, Aborted.

I'm having a hard time figuring out why it is asserting that a coordinate should be length 1...

epiread.txt

ttriche commented 7 years ago

Hey I think I have an idea, maybe this is just off-by-one or off-by-two?

Here is what epiread.txt (above) looks like:

chr1    MG01HX08:278:H3VCCALXX:8:1104:27052:45892   1   -   10468   
CCCCTC  10341   CCATCACC

Unless something funny is happening, in wztsv.h, https://github.com/zwdzwd/utils/blob/3281eb6f70e431480a036d33ba587c70f6fc26c7/wztsv.h#L86, tsv_read should fill up the tsv struct "in" as follows:

(gdb) explore in
'in' is a pointer to a value of type 'tsv_t'
Continue exploring it as a pointer to a single value [y/n]: y

  The value of '*in' is of type 'tsv_t' which is a typedef of type 'struct tsv_t'
The value of '*in' is a struct/class of type 'struct tsv_t' with the following fields:

    fields = <Enter 0 to explore this field of type 'char **'>
         n = <Enter 1 to explore this field of type 'size_t'>
         m = <Enter 2 to explore this field of type 'size_t'>
        fp = <Enter 3 to explore this field of type 'gzFile'>
  finished = 0 .. (Value of type 'int')
  line_max = 10000 .. (Value of type 'int')
      line = <Enter 6 to explore this field of type 'char *'>
Enter the field number of choice: q 
(gdb) print in
$1 = (tsv_t *) 0x6e60a0
(gdb) print in->fields[0]
$2 = 0x6f0550 "chr1"
(gdb) print in->fields[1]
$3 = 0x6f0570 "MG01HX08:278:H3VCCALXX:8:1104:27052:45892"
(gdb) print in->fields[2]
$4 = 0x6f05b0 "1"
(gdb) print in->fields[3]
$5 = 0x6f0530 "-"
(gdb) print in->fields[4]
$6 = 0x6f0600 "10468"

So it looks like this is maybe an off-by-one error?

If I replace the code with

    assert(strlen(in->fields[2]) == 1);
    assert(strlen(in->fields[3]) == 1);

it looks like we are OK, although then it's going to corrupt the _snp / _cg code following. Let's see...

ttriche commented 7 years ago

nb. It's not an issue of single-end vs paired-end epiread format:

gdb -q --args ./biscuit asm epiread.paired.txt 
Reading symbols from ./biscuit...done.
(gdb) b main_asm
Breakpoint 1 at 0x453853: file src/asm_pairwise.c, line 91.
(gdb) r
Starting program: /home/tim/Dropbox/biscuit/biscuit asm epiread.paired.txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
...
(gdb) print in->fields[3]
$1 = 0x6f0530 "CCCT"
(gdb) print in->fields[4]
$2 = 0x6f05e0 "."
(gdb) n
biscuit: src/asm_pairwise.c:124: main_asm: Assertion `strlen(in->fields[3]) == 1' failed.

epiread.paired.txt

ttriche commented 7 years ago

followup: changing

assert(strlen(in->fields[3]) == 1);
assert(strlen(in->fields[4]) == 1);

to

assert(strlen(in->fields[2]) == 1);
assert(strlen(in->fields[3]) == 1);

allows things to proceed and eventually the program exits normally, although I haven't come across an ASM-possessing region in my osteosarcoma to test it out on. So maybe it is an off-by-one error?

I need a positive control before I'm willing to send in a patch :-)