yangli557 / AnnoSINE

SINE annotation tool for plant genomes
MIT License
18 stars 8 forks source link

Running issues #2

Open amvarani opened 2 years ago

amvarani commented 2 years ago

Hi there I would like to thank all the AnnoSINE developers ! I'm trying to run this software on Ubuntu Linux, and it seems that the code as optimized for MacOSX. trf and irf binaries should be correctly explicitly in the py scripts. However, the A. thaliana test example is taking more than 2 hours to run on a 40CPU server, and not 7 minutes as stated on the README. Other genomes, ranging 500mb is taking more than 2 day to run (not finished yet). This is normal ?

yangli557 commented 2 years ago

Hi Amvarani,

Thanks for your questions. First, I need to confirm if your inverted repeat finder (IRF) could work normally. Because I tried to run the program on Linux with the IRF version of Linux on PC compatible before, however, the IRF could not work properly shown in the figure below: image

amvarani commented 2 years ago

Hi @yangli557 In ubuntu-like system you must install "lib32z1" for IRF (sudo apt-get install lib32z1), it should fix this issue. Here the A. thaliana chr4 example is working fine, but taking 25-20 minutes to run on a 40x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 512Gb of RAM. However, for other genomes the pipeline got stuck on step1, please see below.

Apparently, the problem is with SINEFinder.py, right ?

image

amvarani commented 2 years ago

This is my ouput using the test example (A. thaliana chr4) As you can see, it tooks 30min to run.

It only work with this example. If I provide another fasta from other genome it stops at "Merging the same hmm prediction ..." step

python3 AnnoSINE.py 3 ../Testing/A.thaliana_Chr4.fasta test Example: python3 AnnoSINE.py 2 ../Input_Files/test.fasta ../Output_Files Please input the path of genomic sequence


*** AnnoSINE START! ****


====== Step 1: HMMER prediction and structure search has begun ======= Processing the hmm prediction ... Processing the hmm prediction ... Processing the hmm prediction ... Merging the same hmm prediction ...

======================== Step 1 has been done ========================

================ Step 2: TSD identification has begun ================ Find 173 sequence headers in the input file: test/Step1_extend_tsd_input.fa

Start writing result to file: test/Step2_tsd.txt Succeeded!

======================== Step 2 has been done ========================

================ Step 3: MSA implementation has begun ================ BLAST againist the genome assembly ... Processing the BLAST output ...

======================== Step 3 has been done ========================

========= Step 4: RNA derived head identification has begun ==========

========================= Step 4 has been done =======================

=============== Step 5: Tandem repeat finder has begun ===============

Tandem Repeats Finder, Version 4.09 Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence... Allocating Memory... Initializing data structures... Computing TR Model Statistics... Scanning... ........................................................................................

Freeing Memory... Resolving output... Done. ======================== Step 5 has been done ========================

=============== Step 6: Inverted repeat finder has begun =============

Inverted Repeats Finder, Version 3.05 Copyright (C) Dr. Gary Benson 2002-2003. All rights reserved.

Loading sequence... Allocating Memory... Initializing data structures...

Tuples tupsize index tuplesize tuplemaxdistance 1 4 154 2 5 813 3 7 14800 Scanning.................................................................................................... Resolving output... Freeing Memory... Done

========================= Step 6 has been done =======================

=============== Step 7: Sequences clustering has begun ===============

Program: CD-HIT, V4.8.1, Mar 01 2019, 14:14:47 Command: cd-hit-est -T 28 -i test/Step6_irf_output.fasta -o test/Step7_cluster_output.fasta -c 0.8

Started: Fri Jan 14 11:59:20 2022

                        Output                              

Option -T is ignored: multi-threading with OpenMP is NOT enabled! total seq: 1 longest and shortest : 170 and 170 Total letters: 170 Sequences have been sorted

Approximated minimal memory consumption: Sequence : 0M Buffer : 1 X 12M = 12M Table : 1 X 16M = 16M Miscellaneous : 4M Total : 33M

Table limit with the given memory limit: Max number of representatives: 2838769 Max number of word counting entries: 95868787

comparing sequences from 0 to 1

    1  finished          1  clusters

Approximated maximum memory consumption: 33M writing new database writing clustering information program completed !

Total CPU time 0.12

======================== Step 7 has been done ========================

================= Step 8: Genome annotation has begun ================ ../Testing/A.thaliana_Chr4.fasta test RepeatMasker version 4.1.1 Search Engine: NCBI/RMBLAST [ 2.10.0+ ] Using Custom Repeat Library: test/Seed_SINE.fa

analyzing file ../Testing/A.thaliana_Chr4.fasta identifying matches to Seed_SINE.fa sequences in batch 1 of 321 identifying matches to Seed_SINE.fa sequences in batch 4 of 321 identifying matches to Seed_SINE.fa sequences in batch 2 of 321 identifying matches to Seed_SINE.fa sequences in batch 5 of 321 identifying matches to Seed_SINE.fa sequences in batch 16 of 321 identifying matches to Seed_SINE.fa sequences in batch 8 of 321 identifying matches to Seed_SINE.fa sequences in batch 10 of 321 identifying matches to Seed_SINE.fa sequences in batch 9 of 321 identifying matches to Seed_SINE.fa sequences in batch 7 of 321 identifying matches to Seed_SINE.fa sequences in batch 3 of 321 identifying matches to Seed_SINE.fa sequences in batch 14 of 321 identifying matches to Seed_SINE.fa sequences in batch 6 of 321 identifying matches to Seed_SINE.fa sequences in batch 15 of 321 identifying matches to Seed_SINE.fa sequences in batch 12 of 321 identifying matches to Seed_SINE.fa sequences in batch 13 of 321 identifying matches to Seed_SINE.fa sequences in batch 11 of 321 identifying matches to Seed_SINE.fa sequences in batch 18 of 321 identifying matches to Seed_SINE.fa sequences in batch 17 of 321 identifying matches to Seed_SINE.fa sequences in batch 21 of 321 identifying matches to Seed_SINE.fa sequences in batch 22 of 321 identifying matches to Seed_SINE.fa sequences in batch 23 of 321 identifying matches to Seed_SINE.fa sequences in batch 19 of 321 identifying matches to Seed_SINE.fa sequences in batch 24 of 321 identifying matches to Seed_SINE.fa sequences in batch 20 of 321 identifying matches to Seed_SINE.fa sequences in batch 25 of 321 identifying matches to Seed_SINE.fa sequences in batch 29 of 321 identifying matches to Seed_SINE.fa sequences in batch 27 of 321 identifying matches to Seed_SINE.fa sequences in batch 28 of 321 identifying matches to Seed_SINE.fa sequences in batch 30 of 321 identifying matches to Seed_SINE.fa sequences in batch 31 of 321 identifying matches to Seed_SINE.fa sequences in batch 26 of 321 identifying matches to Seed_SINE.fa sequences in batch 32 of 321 identifying matches to Seed_SINE.fa sequences in batch 34 of 321 identifying matches to Seed_SINE.fa sequences in batch 35 of 321 identifying matches to Seed_SINE.fa sequences in batch 33 of 321 identifying matches to Seed_SINE.fa sequences in batch 36 of 321 identifying matches to Seed_SINE.fa sequences in batch 37 of 321 identifying matches to Seed_SINE.fa sequences in batch 38 of 321 identifying matches to Seed_SINE.fa sequences in batch 40 of 321 identifying matches to Seed_SINE.fa sequences in batch 39 of 321 identifying matches to Seed_SINE.fa sequences in batch 42 of 321 identifying matches to Seed_SINE.fa sequences in batch 41 of 321 identifying matches to Seed_SINE.fa sequences in batch 43 of 321 identifying matches to Seed_SINE.fa sequences in batch 44 of 321 identifying matches to Seed_SINE.fa sequences in batch 46 of 321 identifying matches to Seed_SINE.fa sequences in batch 45 of 321 identifying matches to Seed_SINE.fa sequences in batch 48 of 321 identifying matches to Seed_SINE.fa sequences in batch 47 of 321 identifying matches to Seed_SINE.fa sequences in batch 50 of 321 identifying matches to Seed_SINE.fa sequences in batch 49 of 321 identifying matches to Seed_SINE.fa sequences in batch 53 of 321 identifying matches to Seed_SINE.fa sequences in batch 52 of 321 identifying matches to Seed_SINE.fa sequences in batch 54 of 321 identifying matches to Seed_SINE.fa sequences in batch 51 of 321 identifying matches to Seed_SINE.fa sequences in batch 55 of 321 identifying matches to Seed_SINE.fa sequences in batch 57 of 321 identifying matches to Seed_SINE.fa sequences in batch 56 of 321 identifying matches to Seed_SINE.fa sequences in batch 58 of 321 identifying matches to Seed_SINE.fa sequences in batch 59 of 321 identifying matches to Seed_SINE.fa sequences in batch 60 of 321 identifying matches to Seed_SINE.fa sequences in batch 61 of 321 identifying matches to Seed_SINE.fa sequences in batch 62 of 321 identifying matches to Seed_SINE.fa sequences in batch 63 of 321 identifying matches to Seed_SINE.fa sequences in batch 64 of 321 identifying matches to Seed_SINE.fa sequences in batch 66 of 321 identifying matches to Seed_SINE.fa sequences in batch 65 of 321 identifying matches to Seed_SINE.fa sequences in batch 67 of 321 identifying matches to Seed_SINE.fa sequences in batch 68 of 321 identifying matches to Seed_SINE.fa sequences in batch 69 of 321 identifying matches to Seed_SINE.fa sequences in batch 70 of 321 identifying matches to Seed_SINE.fa sequences in batch 71 of 321 identifying matches to Seed_SINE.fa sequences in batch 72 of 321 identifying matches to Seed_SINE.fa sequences in batch 73 of 321 identifying matches to Seed_SINE.fa sequences in batch 74 of 321 identifying matches to Seed_SINE.fa sequences in batch 75 of 321 identifying matches to Seed_SINE.fa sequences in batch 76 of 321 identifying matches to Seed_SINE.fa sequences in batch 77 of 321 identifying matches to Seed_SINE.fa sequences in batch 78 of 321 identifying matches to Seed_SINE.fa sequences in batch 79 of 321 identifying matches to Seed_SINE.fa sequences in batch 81 of 321 identifying matches to Seed_SINE.fa sequences in batch 80 of 321 identifying matches to Seed_SINE.fa sequences in batch 82 of 321 identifying matches to Seed_SINE.fa sequences in batch 83 of 321 identifying matches to Seed_SINE.fa sequences in batch 84 of 321 identifying matches to Seed_SINE.fa sequences in batch 85 of 321 identifying matches to Seed_SINE.fa sequences in batch 86 of 321 identifying matches to Seed_SINE.fa sequences in batch 87 of 321 identifying matches to Seed_SINE.fa sequences in batch 88 of 321 identifying matches to Seed_SINE.fa sequences in batch 89 of 321 identifying matches to Seed_SINE.fa sequences in batch 90 of 321 identifying matches to Seed_SINE.fa sequences in batch 92 of 321 identifying matches to Seed_SINE.fa sequences in batch 91 of 321 identifying matches to Seed_SINE.fa sequences in batch 93 of 321 identifying matches to Seed_SINE.fa sequences in batch 94 of 321 identifying matches to Seed_SINE.fa sequences in batch 95 of 321 identifying matches to Seed_SINE.fa sequences in batch 97 of 321 identifying matches to Seed_SINE.fa sequences in batch 96 of 321 identifying matches to Seed_SINE.fa sequences in batch 98 of 321 identifying matches to Seed_SINE.fa sequences in batch 99 of 321 identifying matches to Seed_SINE.fa sequences in batch 100 of 321 identifying matches to Seed_SINE.fa sequences in batch 101 of 321 identifying matches to Seed_SINE.fa sequences in batch 102 of 321 identifying matches to Seed_SINE.fa sequences in batch 103 of 321 identifying matches to Seed_SINE.fa sequences in batch 104 of 321 identifying matches to Seed_SINE.fa sequences in batch 105 of 321 identifying matches to Seed_SINE.fa sequences in batch 106 of 321 identifying matches to Seed_SINE.fa sequences in batch 107 of 321 identifying matches to Seed_SINE.fa sequences in batch 108 of 321 identifying matches to Seed_SINE.fa sequences in batch 109 of 321 identifying matches to Seed_SINE.fa sequences in batch 111 of 321 identifying matches to Seed_SINE.fa sequences in batch 110 of 321 identifying matches to Seed_SINE.fa sequences in batch 112 of 321 identifying matches to Seed_SINE.fa sequences in batch 113 of 321 identifying matches to Seed_SINE.fa sequences in batch 114 of 321 identifying matches to Seed_SINE.fa sequences in batch 115 of 321 identifying matches to Seed_SINE.fa sequences in batch 117 of 321 identifying matches to Seed_SINE.fa sequences in batch 116 of 321 identifying matches to Seed_SINE.fa sequences in batch 118 of 321 identifying matches to Seed_SINE.fa sequences in batch 119 of 321 identifying matches to Seed_SINE.fa sequences in batch 120 of 321 identifying matches to Seed_SINE.fa sequences in batch 121 of 321 identifying matches to Seed_SINE.fa sequences in batch 123 of 321 identifying matches to Seed_SINE.fa sequences in batch 122 of 321 identifying matches to Seed_SINE.fa sequences in batch 124 of 321 identifying matches to Seed_SINE.fa sequences in batch 125 of 321 identifying matches to Seed_SINE.fa sequences in batch 126 of 321 identifying matches to Seed_SINE.fa sequences in batch 128 of 321 identifying matches to Seed_SINE.fa sequences in batch 127 of 321 identifying matches to Seed_SINE.fa sequences in batch 129 of 321 identifying matches to Seed_SINE.fa sequences in batch 130 of 321 identifying matches to Seed_SINE.fa sequences in batch 131 of 321 identifying matches to Seed_SINE.fa sequences in batch 132 of 321 identifying matches to Seed_SINE.fa sequences in batch 133 of 321 identifying matches to Seed_SINE.fa sequences in batch 134 of 321 identifying matches to Seed_SINE.fa sequences in batch 135 of 321 identifying matches to Seed_SINE.fa sequences in batch 136 of 321 identifying matches to Seed_SINE.fa sequences in batch 137 of 321 identifying matches to Seed_SINE.fa sequences in batch 138 of 321 identifying matches to Seed_SINE.fa sequences in batch 140 of 321 identifying matches to Seed_SINE.fa sequences in batch 139 of 321 identifying matches to Seed_SINE.fa sequences in batch 141 of 321 identifying matches to Seed_SINE.fa sequences in batch 142 of 321 identifying matches to Seed_SINE.fa sequences in batch 144 of 321 identifying matches to Seed_SINE.fa sequences in batch 143 of 321 identifying matches to Seed_SINE.fa sequences in batch 145 of 321 identifying matches to Seed_SINE.fa sequences in batch 146 of 321 identifying matches to Seed_SINE.fa sequences in batch 147 of 321 identifying matches to Seed_SINE.fa sequences in batch 148 of 321 identifying matches to Seed_SINE.fa sequences in batch 149 of 321 identifying matches to Seed_SINE.fa sequences in batch 150 of 321 identifying matches to Seed_SINE.fa sequences in batch 151 of 321 identifying matches to Seed_SINE.fa sequences in batch 152 of 321 identifying matches to Seed_SINE.fa sequences in batch 153 of 321 identifying matches to Seed_SINE.fa sequences in batch 154 of 321 identifying matches to Seed_SINE.fa sequences in batch 155 of 321 identifying matches to Seed_SINE.fa sequences in batch 156 of 321 identifying matches to Seed_SINE.fa sequences in batch 157 of 321 identifying matches to Seed_SINE.fa sequences in batch 158 of 321 identifying matches to Seed_SINE.fa sequences in batch 159 of 321 identifying matches to Seed_SINE.fa sequences in batch 160 of 321 identifying matches to Seed_SINE.fa sequences in batch 161 of 321 identifying matches to Seed_SINE.fa sequences in batch 163 of 321 identifying matches to Seed_SINE.fa sequences in batch 162 of 321 identifying matches to Seed_SINE.fa sequences in batch 164 of 321 identifying matches to Seed_SINE.fa sequences in batch 165 of 321 identifying matches to Seed_SINE.fa sequences in batch 166 of 321 identifying matches to Seed_SINE.fa sequences in batch 167 of 321 identifying matches to Seed_SINE.fa sequences in batch 169 of 321 identifying matches to Seed_SINE.fa sequences in batch 168 of 321 identifying matches to Seed_SINE.fa sequences in batch 170 of 321 identifying matches to Seed_SINE.fa sequences in batch 171 of 321 identifying matches to Seed_SINE.fa sequences in batch 172 of 321 identifying matches to Seed_SINE.fa sequences in batch 173 of 321 identifying matches to Seed_SINE.fa sequences in batch 174 of 321 identifying matches to Seed_SINE.fa sequences in batch 175 of 321 identifying matches to Seed_SINE.fa sequences in batch 176 of 321 identifying matches to Seed_SINE.fa sequences in batch 177 of 321 identifying matches to Seed_SINE.fa sequences in batch 178 of 321 identifying matches to Seed_SINE.fa sequences in batch 179 of 321 identifying matches to Seed_SINE.fa sequences in batch 180 of 321 identifying matches to Seed_SINE.fa sequences in batch 181 of 321 identifying matches to Seed_SINE.fa sequences in batch 182 of 321 identifying matches to Seed_SINE.fa sequences in batch 183 of 321 identifying matches to Seed_SINE.fa sequences in batch 184 of 321 identifying matches to Seed_SINE.fa sequences in batch 186 of 321 identifying matches to Seed_SINE.fa sequences in batch 185 of 321 identifying matches to Seed_SINE.fa sequences in batch 187 of 321 identifying matches to Seed_SINE.fa sequences in batch 188 of 321 identifying matches to Seed_SINE.fa sequences in batch 189 of 321 identifying matches to Seed_SINE.fa sequences in batch 190 of 321 identifying matches to Seed_SINE.fa sequences in batch 191 of 321 identifying matches to Seed_SINE.fa sequences in batch 192 of 321 identifying matches to Seed_SINE.fa sequences in batch 193 of 321 identifying matches to Seed_SINE.fa sequences in batch 194 of 321 identifying matches to Seed_SINE.fa sequences in batch 195 of 321 identifying matches to Seed_SINE.fa sequences in batch 196 of 321 identifying matches to Seed_SINE.fa sequences in batch 197 of 321 identifying matches to Seed_SINE.fa sequences in batch 198 of 321 identifying matches to Seed_SINE.fa sequences in batch 200 of 321 identifying matches to Seed_SINE.fa sequences in batch 199 of 321 identifying matches to Seed_SINE.fa sequences in batch 202 of 321 identifying matches to Seed_SINE.fa sequences in batch 201 of 321 identifying matches to Seed_SINE.fa sequences in batch 203 of 321 identifying matches to Seed_SINE.fa sequences in batch 204 of 321 identifying matches to Seed_SINE.fa sequences in batch 205 of 321 identifying matches to Seed_SINE.fa sequences in batch 207 of 321 identifying matches to Seed_SINE.fa sequences in batch 206 of 321 identifying matches to Seed_SINE.fa sequences in batch 208 of 321 identifying matches to Seed_SINE.fa sequences in batch 209 of 321 identifying matches to Seed_SINE.fa sequences in batch 210 of 321 identifying matches to Seed_SINE.fa sequences in batch 211 of 321 identifying matches to Seed_SINE.fa sequences in batch 212 of 321 identifying matches to Seed_SINE.fa sequences in batch 213 of 321 identifying matches to Seed_SINE.fa sequences in batch 214 of 321 identifying matches to Seed_SINE.fa sequences in batch 215 of 321 identifying matches to Seed_SINE.fa sequences in batch 217 of 321 identifying matches to Seed_SINE.fa sequences in batch 216 of 321 identifying matches to Seed_SINE.fa sequences in batch 218 of 321 identifying matches to Seed_SINE.fa sequences in batch 220 of 321 identifying matches to Seed_SINE.fa sequences in batch 219 of 321 identifying matches to Seed_SINE.fa sequences in batch 221 of 321 identifying matches to Seed_SINE.fa sequences in batch 222 of 321 identifying matches to Seed_SINE.fa sequences in batch 223 of 321 identifying matches to Seed_SINE.fa sequences in batch 224 of 321 identifying matches to Seed_SINE.fa sequences in batch 225 of 321 identifying matches to Seed_SINE.fa sequences in batch 226 of 321 identifying matches to Seed_SINE.fa sequences in batch 227 of 321 identifying matches to Seed_SINE.fa sequences in batch 228 of 321 identifying matches to Seed_SINE.fa sequences in batch 230 of 321 identifying matches to Seed_SINE.fa sequences in batch 229 of 321 identifying matches to Seed_SINE.fa sequences in batch 231 of 321 identifying matches to Seed_SINE.fa sequences in batch 233 of 321 identifying matches to Seed_SINE.fa sequences in batch 232 of 321 identifying matches to Seed_SINE.fa sequences in batch 234 of 321 identifying matches to Seed_SINE.fa sequences in batch 235 of 321 identifying matches to Seed_SINE.fa sequences in batch 236 of 321 identifying matches to Seed_SINE.fa sequences in batch 238 of 321 identifying matches to Seed_SINE.fa sequences in batch 237 of 321 identifying matches to Seed_SINE.fa sequences in batch 239 of 321 identifying matches to Seed_SINE.fa sequences in batch 240 of 321 identifying matches to Seed_SINE.fa sequences in batch 241 of 321 identifying matches to Seed_SINE.fa sequences in batch 242 of 321 identifying matches to Seed_SINE.fa sequences in batch 243 of 321 identifying matches to Seed_SINE.fa sequences in batch 244 of 321 identifying matches to Seed_SINE.fa sequences in batch 245 of 321 identifying matches to Seed_SINE.fa sequences in batch 246 of 321 identifying matches to Seed_SINE.fa sequences in batch 247 of 321 identifying matches to Seed_SINE.fa sequences in batch 248 of 321 identifying matches to Seed_SINE.fa sequences in batch 249 of 321 identifying matches to Seed_SINE.fa sequences in batch 251 of 321 identifying matches to Seed_SINE.fa sequences in batch 250 of 321 identifying matches to Seed_SINE.fa sequences in batch 252 of 321 identifying matches to Seed_SINE.fa sequences in batch 253 of 321 identifying matches to Seed_SINE.fa sequences in batch 255 of 321 identifying matches to Seed_SINE.fa sequences in batch 254 of 321 identifying matches to Seed_SINE.fa sequences in batch 256 of 321 identifying matches to Seed_SINE.fa sequences in batch 257 of 321 identifying matches to Seed_SINE.fa sequences in batch 258 of 321 identifying matches to Seed_SINE.fa sequences in batch 259 of 321 identifying matches to Seed_SINE.fa sequences in batch 260 of 321 identifying matches to Seed_SINE.fa sequences in batch 261 of 321 identifying matches to Seed_SINE.fa sequences in batch 262 of 321 identifying matches to Seed_SINE.fa sequences in batch 263 of 321 identifying matches to Seed_SINE.fa sequences in batch 264 of 321 identifying matches to Seed_SINE.fa sequences in batch 266 of 321 identifying matches to Seed_SINE.fa sequences in batch 265 of 321 identifying matches to Seed_SINE.fa sequences in batch 267 of 321 identifying matches to Seed_SINE.fa sequences in batch 268 of 321 identifying matches to Seed_SINE.fa sequences in batch 269 of 321 identifying matches to Seed_SINE.fa sequences in batch 270 of 321 identifying matches to Seed_SINE.fa sequences in batch 271 of 321 identifying matches to Seed_SINE.fa sequences in batch 272 of 321 identifying matches to Seed_SINE.fa sequences in batch 273 of 321 identifying matches to Seed_SINE.fa sequences in batch 274 of 321 identifying matches to Seed_SINE.fa sequences in batch 275 of 321 identifying matches to Seed_SINE.fa sequences in batch 276 of 321 identifying matches to Seed_SINE.fa sequences in batch 277 of 321 identifying matches to Seed_SINE.fa sequences in batch 278 of 321 identifying matches to Seed_SINE.fa sequences in batch 279 of 321 identifying matches to Seed_SINE.fa sequences in batch 280 of 321 identifying matches to Seed_SINE.fa sequences in batch 282 of 321 identifying matches to Seed_SINE.fa sequences in batch 281 of 321 identifying matches to Seed_SINE.fa sequences in batch 283 of 321 identifying matches to Seed_SINE.fa sequences in batch 284 of 321 identifying matches to Seed_SINE.fa sequences in batch 286 of 321 identifying matches to Seed_SINE.fa sequences in batch 285 of 321 identifying matches to Seed_SINE.fa sequences in batch 287 of 321 identifying matches to Seed_SINE.fa sequences in batch 289 of 321 identifying matches to Seed_SINE.fa sequences in batch 288 of 321 identifying matches to Seed_SINE.fa sequences in batch 290 of 321 identifying matches to Seed_SINE.fa sequences in batch 292 of 321 identifying matches to Seed_SINE.fa sequences in batch 291 of 321 identifying matches to Seed_SINE.fa sequences in batch 293 of 321 identifying matches to Seed_SINE.fa sequences in batch 294 of 321 identifying matches to Seed_SINE.fa sequences in batch 295 of 321 identifying matches to Seed_SINE.fa sequences in batch 296 of 321 identifying matches to Seed_SINE.fa sequences in batch 297 of 321 identifying matches to Seed_SINE.fa sequences in batch 298 of 321 identifying matches to Seed_SINE.fa sequences in batch 300 of 321 identifying matches to Seed_SINE.fa sequences in batch 299 of 321 identifying matches to Seed_SINE.fa sequences in batch 301 of 321 identifying matches to Seed_SINE.fa sequences in batch 302 of 321 identifying matches to Seed_SINE.fa sequences in batch 303 of 321 identifying matches to Seed_SINE.fa sequences in batch 304 of 321 identifying matches to Seed_SINE.fa sequences in batch 305 of 321 identifying matches to Seed_SINE.fa sequences in batch 306 of 321 identifying matches to Seed_SINE.fa sequences in batch 308 of 321 identifying matches to Seed_SINE.fa sequences in batch 307 of 321 identifying matches to Seed_SINE.fa sequences in batch 309 of 321 identifying matches to Seed_SINE.fa sequences in batch 310 of 321 identifying matches to Seed_SINE.fa sequences in batch 311 of 321 identifying matches to Seed_SINE.fa sequences in batch 312 of 321 identifying matches to Seed_SINE.fa sequences in batch 313 of 321 identifying matches to Seed_SINE.fa sequences in batch 314 of 321 identifying matches to Seed_SINE.fa sequences in batch 315 of 321 identifying matches to Seed_SINE.fa sequences in batch 316 of 321 identifying matches to Seed_SINE.fa sequences in batch 317 of 321 identifying matches to Seed_SINE.fa sequences in batch 318 of 321 identifying matches to Seed_SINE.fa sequences in batch 319 of 321 identifying matches to Seed_SINE.fa sequences in batch 321 of 321 identifying matches to Seed_SINE.fa sequences in batch 320 of 321 processing output: cycle 1 cycle 2 Generating output... masking done

========================= Step 8 has been done =======================

Total running time: 1822.8337078094482 s


** AnnoSINE COMPLETE! **


amvarani commented 2 years ago

OK, I have found a temporary solution. It seems a problem with SINEFinder.py script, in the code lines just bellow:

def run(seqfile, **kwargs):

....

start processing

while 1:
    e = next(fi)

...

The "e = next(fi)" is the problem. Apparently the script is taking forever to read the fasta file I have chromosome length scaffolds ranging up to 55mb

The solution is to break the chromosome length scaffolds into smaller chunks ranging 5mb, for example

yuk2521 commented 2 years ago

Hi @yangli557 In ubuntu-like system you must install "lib32z1" for IRF (sudo apt-get install lib32z1), it should fix this issue. Here the A. thaliana chr4 example is working fine, but taking 25-20 minutes to run on a 40x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 512Gb of RAM. However, for other genomes the pipeline got stuck on step1, please see below.

Apparently, the problem is with SINEFinder.py, right ?

image

@amvarani I also believe it's a problem with SINEFinder.py. The findall method is recursively calling itself, which can cause the recursion error when the sequence is too long.

It looks like the SINEFinder.py script has not been updated since 2010. I suppose it should need some improvement to work with large data.

amvarani commented 2 years ago

There is another solution

Change the lines in the CONFIGURATION section: 'RUNTYPE': 'seqwise', to 'RUNTYPE': 'chunkwise',

and also redefine the chuckwise sizes, for example:

'CHUNKSIZE': 3000000,
'OVERLAP': 8000,

In the class FastaIterator, I have also changed to:

class FastaIterator: config = { 'RUNTYPE': 'chunkwise', 'CHUNKSIZE': 3000000, 'OVERLAP': 8000, }

Now it is running all my fasta files. Interestingly, the At example is also running much faster! ~10 minutes, instead of 25min!

You can also add in the header of the script to maintain compatibility with seqwise : sys.setrecursionlimit(107) threading.stack_size(227)

But these option are no recommended in Python

amvarani commented 2 years ago

Hi there, Finally, I got all results, and they are great! I really enjoyed the AnnoSINE!! Nice software that I hope to use a lot!

To help you guys to close this issue, I will summarize what I did: 1) For ubuntu linux, install sudo apt-get install lib32z1

2) correct the irf and trf call in the 'AnnoSINE.py' For example: os.system('irf ' + out_genome_assembly_path +'/Step6_irf_input.fasta ' os.system('trf ' + out_genome_assembly_path + '/Step4_rna_output.fasta '

3) Modified SINEFinder.py to run in 'chunkwise' instead of 'seqwise' mode. Script attached

4) Activate the multithreading in the nhmmer from 'AnnoSINE.py': For example: os.system( 'nhmmer --cpu 28 -o ../Family_Seq/' + dir_hmm[num_dir_hmm] + '/' + dir_hmm[num_dir_hmm] + '.out '

It's also desirable to change the BLAST to DIAMOND in the 'AnnoSINE.py' to make the things faster for very large genomes (plant genomes ranging more than 8Gb).

I hope that it will help!

Best

SINEFinder2.zip