Closed schmeing closed 4 years ago
Greetings! My first thought is that maybe the custom fragment length distribution is the cause. If you replace the --pe-model parameter with something like --pe 300 30, does it still crash?
It seems to be a combination of the --pe-model parameter and the diploid. I ran successfully:
python2 neat-genReads.py -r athaliana.fa -R 150 -o neat -c 31 -e errors.p -M 0 --pe-model fraglen.p --gc-model gc_bias.p
and:
python2 neat-genReads.py -r athaliana.fa -R 150 -o neat -c 31 -e errors.p -M 0 --pe 300 30 --gc-model gc_bias.p -p 2 -v storage/athaliana/reference/ERR2017816/pilon-corrected.vcf
Looking through the section of code where the error occurred, I think there are a few possibilities.
Does the input VCF file contain any large indels?
Would you be able to show my your fragment length pickle file? Or list the values of lengths it includes via:
python
>>import pickle
>>[values, probabilities] = pickle.load(open('fraglen.p','rb'))
>>print values
Thanks!
If it helps I can provide you all my input files. Just let me know how you like to receive them.
In the VCF nothing sticks out at the position of the crash. The first two chromosomes contain longer variants and longer indels than the crashing third.
The fragment length looks like this:
[148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 601]
I tried if the read length being longer than the minimum fragment length has something to do with it, but -R 147
still crashes. Although, I get
Warning: Read length of error model (150) does not match -R value (147), rescaling model...
so I don't know if that really changed the loops around the crash.
Additionally I tried a Gaussian parametrization that is fitted to the mapped distribution and it also crashes: --pe 319 59
hmmm. I'm still unable to replicate this. I downloaded the arabidopsis reference (tair10) from here: https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release
and ran the following command:
python genReads.py -r TAIR10.fa -p 2 -c 31 -R 150 -M 0 --pe 319 59 -o test
and it completed successfully. So it almost certainly is related to the input models or vcf.
I'd definitely like to try running the exact command as you, with the same pickle/vcf inputs. If they're small enough to attach here or send over email (to zstephe2@illinois.edu) that'd work.
Thanks for your help with this!
I send you my input files. I strongly assume it is some combination of the wide fragment length distribution and the variation. Running it without the vcf like you did above also works for me, but I cannot pin down the cause in the vcf as nothing sticks out. Running the narrower fragment length distribution that you suggested works in combination with the vcf, but it is not the fraglen.p per se as the broader distribution also crashes.
I pushed a quick update to the repository that simplifies all input variants to their minimal form. E.g. [GTTTTA --> GTTTA] gets collapsed down to [GT --> G]. It's not 100% clear to me why the expanded form was causing the original index error, but with this change I was able to successfully simulate reads from arabidopsis chr3 using your input variants.
Could you try out your simulation again with the updated repo and see if it completes?
Yes, works now. Thanks for the fast fix!
Hi,
I get the following error:
Calling neat-genReads.py in the following way:
I am trying to simulate diploid with mutations only from the vcf and no random mutations. Am I calling the simulator wrong?
Thanks