Open bachev opened 8 years ago
Identified the problem: the automatic dna/protein recognition fails as it only counts ACGTUN as nucleotides, not X.
Thanks! Should the program consider X in query sequences?
On Oct 10, 2016, at 10:36 AM, Bastien Chevreux notifications@github.com wrote:
Identified the problem: the automatic dna/protein recognition fails as it only counts ACGTUN as nucleotides, not X.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zhaoyanswill/RAPSearch2/issues/32#issuecomment-252658581, or mute the thread https://github.com/notifications/unsubscribe-auth/AETkp5b2t55uKsglpn9ikO501IfmWnCJks5qyluAgaJpZM4KNF11.
I know I would. Well, I did.
//----------------------------------------------------------------------
int CHashSearch::GuessQueryType(POOL& vPool)
{
static const char valid_bases[256]=
{
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 'A', 0, 'C', 0, 0, 0, 'G', 0, 0, 0, 0, 0, 0, 'N', 0, // 0x41 = A
0, 0, 0, 0, 'T', 0, 0, 0, 'X', 0, 0, 0, 0, 0, 0, 0,
0, 'a', 0, 'c', 0, 0, 0, 'g', 0, 0, 0, 0, 0, 0, 'n', 0, // 0x61 = a
0, 0, 0, 0, 't', 0, 0, 0, 'x', 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
// read some reads
std::vector<uchar> seq;
ITER itStop = vPool.end();
ITER itSt = find(vPool.begin(), itStop, '>');
ITER itEd = find(itSt+1, itStop, '>');
while (itEd!=itStop && seq.size()<10000)
{
ITER itBeg = find(itSt, itEd, '\n');
while (itBeg == itEd)
{
itEd = find(itEd+1, itStop, '>');
itBeg = find(itBeg, itEd, '\n');
}
++itBeg;
int nIter = seq.size();
seq.insert(seq.end(), itBeg, itEd-1);
seq.erase(remove(seq.begin()+nIter, seq.end(), '\n'), seq.end());
itSt = itEd;
itEd = find(itSt+1, itStop, '>');
}
unsigned int bases=0;
for(auto c : seq) if(valid_bases[c]) ++bases;
if(bases > seq.size() * 0.80) return QTYPE_NUCL;
return QTYPE_PROT;
}
Hi there,
I stumbled across e peculiar bug: when searching a a DNA sequence against a database, I noticed that some hits (100% identity) were not found. I traced this back to parts of the query sequence being masked with a special character ('x') in the input. Interesting thing is: the bug seems to appear only starting with a certain length of 'invalid' DNA bases.
I attached a simple example which demonstrates this: a 61 nt stretch with 'x' at end of query sequence works, 62 does not.
rs_mask_error.tar.gz
Please note I did not check masked characters at beginning or within a sequence.
Best, Bastien
PS: (tested in 2.23 & 2.24)