zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
9 stars 22 forks source link

Remove similar adapters to increase frequency ratio #121

Closed balajtimate closed 1 year ago

balajtimate commented 1 year ago

Based on some of the results from #107, the 3'-adapter cannot be inferred because the ratio between the first and second most common adapter is too low (usually around 1). It is the case for e.g. AGATCGGAAGAGCAC and GATCGGAAGAGCACA where the latter is the same as the first one, but shifted by 1 base.

According to the Illumina adapter guide, AGATCGGAAGAGCAC should be used for adapter trimming, as there is an A-tailing before the adapter ligation step. Finding GATCGGAAGAGCACA could also mean it was sequenced deeper into the adapter, which is not always the case.

These similar adapters should be removed from data/adapter_fragments.txt.

balajtimate commented 1 year ago

Outside of the one mentioned above, I also removed TCGTATGCCGTCTTC which was causing similar issues, as it had the same percent found as ATCTCGTATGCCGTC in many cases.

There is also a third pair, CCGACAGGTTCAGAG and CGACAGGTTCAGAGT (as well as ACAGGTTCAGAGTTC), but so far it hasn't been inferred once. CGACAGGTTCAGAGT is the sequencing primer used for Small RNA kits, so maybe that should be included, and the other ones removed?

uniqueg commented 1 year ago

Yes, I would keep CGACAGGTTCAGAGT, I think that makes sense.

Did you use an automatic strategy to detect these local internal alignment issues, and if so, is it exhaustive?

balajtimate commented 1 year ago

Well, I just BLASTed the sequences against the whole list and checked if there were any matches, but I suppose a partial string search would be enough, and could be automated to check if there's already a match if we add new sequences to the transcipts file.