Closed balajtimate closed 1 year ago
Outside of the one mentioned above, I also removed TCGTATGCCGTCTTC
which was causing similar issues, as it had the same percent found as ATCTCGTATGCCGTC
in many cases.
There is also a third pair, CCGACAGGTTCAGAG
and CGACAGGTTCAGAGT
(as well as ACAGGTTCAGAGTTC
), but so far it hasn't been inferred once. CGACAGGTTCAGAGT
is the sequencing primer used for Small RNA kits, so maybe that should be included, and the other ones removed?
Yes, I would keep CGACAGGTTCAGAGT
, I think that makes sense.
Did you use an automatic strategy to detect these local internal alignment issues, and if so, is it exhaustive?
Well, I just BLASTed the sequences against the whole list and checked if there were any matches, but I suppose a partial string search would be enough, and could be automated to check if there's already a match if we add new sequences to the transcipts file.
Based on some of the results from #107, the 3'-adapter cannot be inferred because the ratio between the first and second most common adapter is too low (usually around 1). It is the case for e.g.
AGATCGGAAGAGCAC
andGATCGGAAGAGCACA
where the latter is the same as the first one, but shifted by 1 base.According to the Illumina adapter guide,
AGATCGGAAGAGCAC
should be used for adapter trimming, as there is an A-tailing before the adapter ligation step. FindingGATCGGAAGAGCACA
could also mean it was sequenced deeper into the adapter, which is not always the case.These similar adapters should be removed from
data/adapter_fragments.txt
.