zavolanlab / htsinfer

Infer metadata for your downstream analysis straight from your RNA-seq data
Apache License 2.0
10 stars 22 forks source link

Count motifs in a set of sequences #13

Closed mzavolan closed 2 years ago

mzavolan commented 3 years ago

Given a set of sequences, calculate the number of occurrences of all motifs within a range of length that occur in the input sequences. Return the results as a dictionary.

stinea98 commented 3 years ago

I would like to work on this issue :)

mzavolan commented 3 years ago

Go ahead!

lukasheuberger commented 3 years ago

I would join @stinea98 with this issue

uniqueg commented 3 years ago

Sure, added you @lukasheuberger, thanks!

Dhryata commented 3 years ago

I would like to work on this as well, if possible

stinea98 commented 3 years ago

This would be our suggestion for a docstring:

def count_motifs(input_sequences: list, min_motif_length: int, max_motif_length: int) -> dict """ Function that calculates the occurrence of all possible motifs in one or multiple sequences and returns a dictionary with all motifs within the specified length and their occurrence

Args: input_sequences (list): list of sequences min_motif_length (int): minimal length of motif max_motif_length (int): maximal length of motif

Returns: motif frequency (dict) with paired data {"motif_seq": frequency }

Raises: ValueError: input_sequences contains non-standard (ACTG) characters TypeError: input_sequences is not a list """

And this is how we think we'll proceed:

1) create all possible motifs with lengths in range [a,b], save as list motifs idea for code: nucleotides = ["A", "T", "G", "C"] for i in range(a,b): motifs = itertools.product(nucleotides, repeat=i) already add these in a dict with all freq counters = 0

2) scan all sequences for all motifs and record frequencies fastest method might be regex: for motif in motifs: if motif in sequence: dict[motif] += 1 -> see https://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find

3) construct and output dictionary with motif sequences and corresponding frequencies

mzavolan commented 3 years ago

Hi Stine

The docstring looks good. Regarding the implementation, imagine what would happen if you got b = 20. You will have a much smaller number of motifs actually represented in the sequences than those that are possible for a given length. I would rather count the motifs that DO OCCUR in the sequences themselves. Another comment about non-[ACGT] characters, you may not want the program to quit unless the number of these characters is large. So you could either count them and give more information to the user as to what proportion of characters are not correct, or even do the counting with/without these characters. Finally, you could take into account either DNA or RNA, in which case you would have to consider U's instead of T's.

uniqueg commented 2 years ago

Closing this for now as we chose to implement a brute force algorithm for 3' adapter extraction for the meantime.