xiaochuanle / MECAT

MECAT: an ultra-fast mapping, error correction and de novo assembly tool for single-molecule sequencing reads
107 stars 26 forks source link

Possible bug in split_raw_dataset(): sequence.data() and header.data() w/o correspondence in the fastq #69

Open giuliaguidi opened 5 years ago

giuliaguidi commented 5 years ago

Hi,

I would need to obtain a conversion file from read idx used in MECAT and actual read name. I was looking at the function split_raw_dataset() and I found out that when printing read.header().data() and read.sequence().data() (single thread) the results do not have a correspondence in the input fastq file. They look like a mix of different reads (in bold the text that shouldn't be there) for both headers and sequences. Clear headers and sequences after each iteration seem not to affect the output. Am I missing something?

Priting headers:

[split_raw_dataset, 254] 1 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/247/0_9332 [split_raw_dataset, 254] 2 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/476/0_18315 [split_raw_dataset, 254] 3 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/100000/0_15291 [split_raw_dataset, 254] 4 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/100000/15337_17375 [split_raw_dataset, 254] 5 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/153391/0_147517375 [split_raw_dataset, 254] 6 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/58832/0_1388817375 [split_raw_dataset, 254] 7 m140213_230323_42129_c100520410120000001823082509281362_s1_X0/58833/792_345217375

Here's the code I'm referring to:

split_raw_dataset(const char* reads, const char* wrk_dir)
{
    [...]
    idx_t num_reads = 0, num_nucls = 0;
    while (1)
    {
        idx_t rsize = fr.read_one_seq(read);
        if (rsize == -1) break; 
        idx2read << num_reads << ' ' << read << endl;

        ++num_reads; 
        num_nucls += rsize; 

        if (v->curr + rsize + 1 > MCS)      
                {
            v->start_read_id = rid;
            rid += v->num_reads;
            generate_vol_file_name(wrk_dir, vol++, vol_file_name);
            fprintf(idx_file, "%s\n", vol_file_name);
            dump_volume(vol_file_name, v);
            clear_volume_t(v);
        }

        **LOG(stderr, "%d %s", num_reads, read.sequence().data()); // incorrect sequence
                 LOG(stderr, "%d %s", num_reads, read.header().data()); // incorrect header 
                 LOG(stderr, "%d %s", num_reads, read); // correct sequence and header **
        add_one_seq(v, read.sequence().data(), rsize); 
        ++v->curr; 
    }
[...]
    return vol;
}

Thanks, Giulia