Possible bug in split_raw_dataset(): sequence.data() and header.data() w/o correspondence in the fastq

Hi,

I would need to obtain a conversion file from read idx used in MECAT and actual read name. I was looking at the function split_raw_dataset() and I found out that when printing read.header().data() and read.sequence().data() (single thread) the results do not have a correspondence in the input fastq file. They look like a mix of different reads (in bold the text that shouldn't be there) for both headers and sequences. Clear headers and sequences after each iteration seem not to affect the output. Am I missing something?

Priting headers:

[split_raw_dataset, 254] 1 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/247/0_9332 [split_raw_dataset, 254] 2 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/476/0_18315 [split_raw_dataset, 254] 3 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/100000/0_15291 [split_raw_dataset, 254] 4 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/100000/15337_17375 [split_raw_dataset, 254] 5 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/153391/0_147517375 [split_raw_dataset, 254] 6 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/58832/0_1388817375 [split_raw_dataset, 254] 7 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/58833/792_345217375

Here's the code I'm referring to:

split_raw_dataset(const char* reads, const char* wrk_dir)
{
    [...]
    idx_t num_reads = 0, num_nucls = 0;
    while (1)
    {
        idx_t rsize = fr.read_one_seq(read);
        if (rsize == -1) break; 
        idx2read << num_reads << ' ' << read << endl;

        ++num_reads; 
        num_nucls += rsize; 

        if (v->curr + rsize + 1 > MCS)      
                {
            v->start_read_id = rid;
            rid += v->num_reads;
            generate_vol_file_name(wrk_dir, vol++, vol_file_name);
            fprintf(idx_file, "%s\n", vol_file_name);
            dump_volume(vol_file_name, v);
            clear_volume_t(v);
        }

        **LOG(stderr, "%d %s", num_reads, read.sequence().data()); // incorrect sequence
                 LOG(stderr, "%d %s", num_reads, read.header().data()); // incorrect header 
                 LOG(stderr, "%d %s", num_reads, read); // correct sequence and header **
        add_one_seq(v, read.sequence().data(), rsize); 
        ++v->curr; 
    }
[...]
    return vol;
}

Thanks, Giulia

xiaochuanle / MECAT

Possible bug in split_raw_dataset(): sequence.data() and header.data() w/o correspondence in the fastq #69