zyndagj / RNNotate

BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

RepeatMasker to gff3 script #5

Open zyndagj opened 5 years ago

zyndagj commented 5 years ago

While RepeatMasker does have an option to output gff format, it excludes all class and family information.

##gff-version 2                                                                                              
##date 2019-04-05                                                                                            
##sequence-region Arabidopsis_thaliana.TAIR10.dna.toplevel.fa                                                
1       RepeatMasker    similarity      1       107     13.2    -       .       Target "Motif:ATREP18" 561 649
1       RepeatMasker    similarity      1066    1097    10.0    +       .       Target "Motif:(C)n" 1 32
1       RepeatMasker    similarity      1155    1187    17.1    +       .       Target "Motif:(TTTCTT)n" 1 33

To keep this information, I need to convert

    SW   perc perc perc  query     position in query              matching           repeat                 position in repeat
 score   div. del. ins.  sequence  begin    end          (left)   repeat             class/family       begin   end    (left)     ID
   282   13.2  0.0  8.1  1                1      107 (30427564) C ATREP18            DNA                 (1142)    649     561     1
    22   10.0  0.0  0.0  1             1066     1097 (30426574) + (C)n               Simple_repeat            1     32     (0)     2
    15   17.1  0.0  0.0  1             1155     1187 (30426484) + (TTTCTT)n          Simple_repeat            1     33     (0)     3
   231   10.9  0.0  0.0  1             4285     4330 (30423341) C MuDR-16_ALy        DNA/MULE-MuDR       (3083)    934     889     4

to GFF3 format, where I track

in additional metadata fields