zhaoyanswill / RAPSearch2

Reduced Alphabet based Protein similarity Search
40 stars 15 forks source link

Masking characters in DNA search lead to hits not found #32

Open bachev opened 7 years ago

bachev commented 7 years ago

Hi there,

I stumbled across e peculiar bug: when searching a a DNA sequence against a database, I noticed that some hits (100% identity) were not found. I traced this back to parts of the query sequence being masked with a special character ('x') in the input. Interesting thing is: the bug seems to appear only starting with a certain length of 'invalid' DNA bases.

I attached a simple example which demonstrates this: a 61 nt stretch with 'x' at end of query sequence works, 62 does not.

rs_mask_error.tar.gz

Please note I did not check masked characters at beginning or within a sequence.

Best, Bastien

PS: (tested in 2.23 & 2.24)

bachev commented 7 years ago

Identified the problem: the automatic dna/protein recognition fails as it only counts ACGTUN as nucleotides, not X.

zhaoyanswill commented 7 years ago

Thanks! Should the program consider X in query sequences?

On Oct 10, 2016, at 10:36 AM, Bastien Chevreux notifications@github.com wrote:

Identified the problem: the automatic dna/protein recognition fails as it only counts ACGTUN as nucleotides, not X.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zhaoyanswill/RAPSearch2/issues/32#issuecomment-252658581, or mute the thread https://github.com/notifications/unsubscribe-auth/AETkp5b2t55uKsglpn9ikO501IfmWnCJks5qyluAgaJpZM4KNF11.

bachev commented 7 years ago

I know I would. Well, I did.

//----------------------------------------------------------------------
int CHashSearch::GuessQueryType(POOL& vPool)
{
  static const char valid_bases[256]=
    {
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 'A', 0, 'C', 0, 0, 0, 'G', 0, 0, 0, 0, 0, 0, 'N', 0,  // 0x41 = A
      0, 0, 0, 0, 'T', 0, 0, 0, 'X', 0, 0, 0, 0, 0, 0, 0,
      0, 'a', 0, 'c', 0, 0, 0, 'g', 0, 0, 0, 0, 0, 0, 'n', 0,   // 0x61 = a
      0, 0, 0, 0, 't', 0, 0, 0, 'x', 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
    };

  // read some reads
  std::vector<uchar> seq;
  ITER itStop = vPool.end();
  ITER itSt = find(vPool.begin(), itStop, '>');
  ITER itEd = find(itSt+1, itStop, '>');
  while (itEd!=itStop && seq.size()<10000)
  {
    ITER itBeg = find(itSt, itEd, '\n');
    while (itBeg == itEd)
    {
      itEd = find(itEd+1, itStop, '>');
      itBeg = find(itBeg, itEd, '\n');
    }
    ++itBeg;
    int nIter = seq.size();
    seq.insert(seq.end(), itBeg, itEd-1);
    seq.erase(remove(seq.begin()+nIter, seq.end(), '\n'), seq.end());

    itSt = itEd;
    itEd = find(itSt+1, itStop, '>');
  }

  unsigned int bases=0;
  for(auto c : seq) if(valid_bases[c]) ++bases;

  if(bases > seq.size() * 0.80) return QTYPE_NUCL;

  return QTYPE_PROT;
}