zstephens / neat-genreads

NEAT read simulation tools
Other
95 stars 27 forks source link

IndexError: list index out of range #65

Closed schmeing closed 4 years ago

schmeing commented 4 years ago

Hi,

I get the following error:

sampling reads...
1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% Traceback (most recent call last):
  File "neat-genReads.py", line 743, in <module>
    main()
  File "neat-genReads.py", line 572, in main
    coverage_avg = sequences.init_coverage(tuple(coverage_dat),fragDist=FRAGLEN_DISTRIBUTION)
  File "py/SequenceContainer.py", line 90, in init_coverage
    if self.FM_pos[i][j] == None:
IndexError: list index out of range

Calling neat-genReads.py in the following way:

python2 neat-genReads.py -r athaliana.fa -R 150 -o neat -c 31 -e errors.p -M 0 --pe-model fraglen.p --gc-model gc_bias.p -p 2 -v corrected.vcf

I am trying to simulate diploid with mutations only from the vcf and no random mutations. Am I calling the simulator wrong?

Thanks

zstephens commented 4 years ago

Greetings! My first thought is that maybe the custom fragment length distribution is the cause. If you replace the --pe-model parameter with something like --pe 300 30, does it still crash?

schmeing commented 4 years ago

It seems to be a combination of the --pe-model parameter and the diploid. I ran successfully:

python2 neat-genReads.py -r athaliana.fa -R 150 -o neat -c 31 -e errors.p -M 0 --pe-model fraglen.p --gc-model gc_bias.p

and:

python2 neat-genReads.py -r athaliana.fa -R 150 -o neat -c 31 -e errors.p -M 0 --pe 300 30 --gc-model gc_bias.p -p 2 -v storage/athaliana/reference/ERR2017816/pilon-corrected.vcf
zstephens commented 4 years ago

Looking through the section of code where the error occurred, I think there are a few possibilities.

Does the input VCF file contain any large indels?

Would you be able to show my your fragment length pickle file? Or list the values of lengths it includes via: python >>import pickle >>[values, probabilities] = pickle.load(open('fraglen.p','rb')) >>print values

Thanks!

schmeing commented 4 years ago

If it helps I can provide you all my input files. Just let me know how you like to receive them.

In the VCF nothing sticks out at the position of the crash. The first two chromosomes contain longer variants and longer indels than the crashing third.

The fragment length looks like this:

[148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 601]

I tried if the read length being longer than the minimum fragment length has something to do with it, but -R 147 still crashes. Although, I get

Warning: Read length of error model (150) does not match -R value (147), rescaling model...

so I don't know if that really changed the loops around the crash.

Additionally I tried a Gaussian parametrization that is fitted to the mapped distribution and it also crashes: --pe 319 59

zstephens commented 4 years ago

hmmm. I'm still unable to replicate this. I downloaded the arabidopsis reference (tair10) from here: https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release

and ran the following command:

python genReads.py -r TAIR10.fa -p 2 -c 31 -R 150 -M 0 --pe 319 59 -o test

and it completed successfully. So it almost certainly is related to the input models or vcf.

I'd definitely like to try running the exact command as you, with the same pickle/vcf inputs. If they're small enough to attach here or send over email (to zstephe2@illinois.edu) that'd work.

Thanks for your help with this!

schmeing commented 4 years ago

I send you my input files. I strongly assume it is some combination of the wide fragment length distribution and the variation. Running it without the vcf like you did above also works for me, but I cannot pin down the cause in the vcf as nothing sticks out. Running the narrower fragment length distribution that you suggested works in combination with the vcf, but it is not the fraglen.p per se as the broader distribution also crashes.

zstephens commented 4 years ago

I pushed a quick update to the repository that simplifies all input variants to their minimal form. E.g. [GTTTTA --> GTTTA] gets collapsed down to [GT --> G]. It's not 100% clear to me why the expanded form was causing the original index error, but with this change I was able to successfully simulate reads from arabidopsis chr3 using your input variants.

Could you try out your simulation again with the updated repo and see if it completes?

schmeing commented 4 years ago

Yes, works now. Thanks for the fast fix!