zstephens / neat-genreads

NEAT read simulation tools
Other
95 stars 27 forks source link

Something went wrong #7

Closed muppetjones closed 7 years ago

muppetjones commented 7 years ago

At higher coverage paramters (e.g., -c 500), I am getting the following quite often:

Error: Something went wrong!
(14487, 'TCC', 'TTT') TTC

from this command

  for i in $(seq 1 10)
  do
    ( python /neat-genreads/genReads.py \
      -r ${REFERENCE} \
      -R 126 \
      -o "${prefix}" \
      --pe 300 30 \
      -t "${REGIONS}" \
      -c 500 \
      -M 0 \
      --bam --vcf \
      --job $i 10 ) &
  done

I am unsure what the exact error is, but as the script exists when this occurs, the resulting files are incomplete.

-rw-r--r-- 1 root root 151M Oct 12 16:55 normal_golden.bam.job01of10
-rw-r--r-- 1 root root 264M Oct 12 17:06 normal_golden.bam.job03of10
-rw-r--r-- 1 root root  45M Oct 12 16:43 normal_golden.bam.job04of10
-rw-r--r-- 1 root root 172M Oct 12 16:57 normal_golden.bam.job05of10
-rw-r--r-- 1 root root 164M Oct 12 16:56 normal_golden.bam.job06of10
-rw-r--r-- 1 root root 1.1M Oct 12 16:39 normal_golden.bam.job07of10
-rw-r--r-- 1 root root  85M Oct 12 16:48 normal_golden.bam.job08of10
-rw-r--r-- 1 root root  30M Oct 12 16:42 normal_golden.bam.job09of10
-rw-r--r-- 1 root root 137M Oct 12 16:53 normal_golden.bam.job10of10
-rw-r--r-- 1 root root  821 Oct 12 16:55 normal_golden.vcf.job01of10
-rw-r--r-- 1 root root 843K Oct 12 17:06 normal_golden.vcf.job03of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job04of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job05of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job06of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job07of10
-rw-r--r-- 1 root root    0 Oct 12 16:38 normal_golden.vcf.job08of10
-rw-r--r-- 1 root root  89K Oct 12 16:42 normal_golden.vcf.job09of10
-rw-r--r-- 1 root root  16K Oct 12 16:53 normal_golden.vcf.job10of10
-rw-r--r-- 1 root root 1.7G Oct 12 17:09 normal_merged_golden.bam
-rw-r--r-- 1 root root 947K Oct 12 17:09 normal_merged_golden.vcf
-rw-r--r-- 1 root root 2.0G Oct 12 17:07 normal_merged_read1.fq
-rw-r--r-- 1 root root 2.0G Oct 12 17:08 normal_merged_read2.fq
-rw-r--r-- 1 root root 180M Oct 12 16:55 normal_read1.fq.job01of10
-rw-r--r-- 1 root root 312M Oct 12 17:06 normal_read1.fq.job03of10
-rw-r--r-- 1 root root  53M Oct 12 16:43 normal_read1.fq.job04of10
-rw-r--r-- 1 root root 203M Oct 12 16:57 normal_read1.fq.job05of10
-rw-r--r-- 1 root root 193M Oct 12 16:56 normal_read1.fq.job06of10
-rw-r--r-- 1 root root 1.3M Oct 12 16:39 normal_read1.fq.job07of10
-rw-r--r-- 1 root root 100M Oct 12 16:48 normal_read1.fq.job08of10
-rw-r--r-- 1 root root  35M Oct 12 16:42 normal_read1.fq.job09of10
-rw-r--r-- 1 root root 165M Oct 12 16:53 normal_read1.fq.job10of10
-rw-r--r-- 1 root root 180M Oct 12 16:55 normal_read2.fq.job01of10
-rw-r--r-- 1 root root 312M Oct 12 17:06 normal_read2.fq.job03of10
-rw-r--r-- 1 root root  53M Oct 12 16:43 normal_read2.fq.job04of10
-rw-r--r-- 1 root root 203M Oct 12 16:57 normal_read2.fq.job05of10
-rw-r--r-- 1 root root 193M Oct 12 16:56 normal_read2.fq.job06of10
-rw-r--r-- 1 root root 1.3M Oct 12 16:39 normal_read2.fq.job07of10
-rw-r--r-- 1 root root 100M Oct 12 16:48 normal_read2.fq.job08of10
-rw-r--r-- 1 root root  35M Oct 12 16:42 normal_read2.fq.job09of10
-rw-r--r-- 1 root root 165M Oct 12 16:53 normal_read2.fq.job10of10
zstephens commented 7 years ago

Hey there,

That error is indicating that on an attempt to modify the reference with an indel the reference nucleotide to be altered was not as expected. From the command you provided I don’t see how any mutations could’ve even been inserted (-M 0 means no random mutations, was there a -v parameter that was omitted?).

I see that the variant that was inserted (my guess is from -v) is a multi-SNP (if that terminology is standard) where 2 bases are subbed out for 2 others. At the moment I believe NEAT is only set up to accept variants in the following form:

SNP: (position=X, ref=A, alt=B) INS: (position=X, ref=A, alt=ABCDE…) DEL: (position=X, ref=ABCDE…, alt=A)

So the unexpected format caused the error. Acceptable indels are either one-to-many or many-to-one, as specified above. Multi-SNPS (e.g. 2-to-2, in your example) and similarly complex insertion-deletion combos (e.g. ref=ACGT, alt=AA) are not yet supported. Here are my ideas for how to move forward on this:

1) Ideally I’d like to add full support for arbitrarily complex events (such that they appear in the output vcf just as they do in the input)

2) Slightly-less ideal would be to break them up into multiple sub-events. E.g. if the input VCF contained:

14487, TCC —> TTT

The output VCF would contain:

14488, C —> T 14489, C —> T

3.) At a bare minimum the events could be ignored, and a warning could be printed to console.

If option 2 is acceptable I’ll take a crack at implementing that later today. If not then I’ll implement option 3 and eventually get around to fully supporting the events a la option 1.

Cheers, Zach

On Oct 12, 2016, at 12:09 PM, Stephen J Bush notifications@github.com<mailto:notifications@github.com> wrote:

At higher coverage paramters (e.g., -c 500), I am getting the following quite often:

Error: Something went wrong! (14487, 'TCC', 'TTT') TTC

from this command

for i in $(seq 1 10) do ( python /neat-genreads/genReads.py \ -r ${REFERENCE} \ -R 126 \ -o "${prefix}" \ --pe 300 30 \ -t "${REGIONS}" \ -c 500 \ -M 0 \ --bam --vcf \ --job $i 10 ) & done

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_zstephens_neat-2Dgenreads_issues_7&d=DQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=b2ZBMwFEiVNv3WZwMvc6CIvKfmzaihavV8Uqg3KomQM&s=PSPgbVVxCY9AxrQHE71m78zrvLY8hlYSSI3aFDcjqRg&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_notifications_unsubscribe-2Dauth_AIHT1fQPmM8ojMOar-5FbmregigGXmkvWDks5qzRQ9gaJpZM4KU-5Ffq&d=DQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=b2ZBMwFEiVNv3WZwMvc6CIvKfmzaihavV8Uqg3KomQM&s=ORsvAM1Us5d-KAxSQxzOkk5q0YeGE9liSn3HwuQol2c&e=.

muppetjones commented 7 years ago

Thanks for the reply.

Yep--I posted the wrong command. It's exactly the same, but giving the Cosmic VCF via -v. Turns out the Cosmic VCF is more VCF-ish, which could explain why it's breaking. It does, however, contain multiple nucleotide variants

Implementing (1) would be great, but VCFs are a mess, so that's not a trivial fix. Also, VCFs aren't normalized, so one VCF might report the same variant in a completely different way.

Similarly, (2) could work fine. Without information on the recurrence of the variant, we can't make any assumptions on whether or not the variants are linked. In most cases, this approach is likely fine.

I'd go with (3) for now, and just add a separate count for ignored multi-nucleotide variants. For my use case in particular, the VCF is so big that losing a few is not important.

found 11533965 valid variants in input vcf.
 * 175 variants skipped due to: (quality filtered / reference genotypes / invalid syntax)
 * 1640512 variants skipped due to multiple variants found per position
...
found 194678 valid variants for 5 in input VCF...
593370 variants skipped (invalid position or alt allele)

Based on the above, it looks like there is initial filtering when the VCF is read, then again for each chromosome. Is the invalid position due in part to the regions given in the bed file?

zstephens commented 7 years ago

I believe I have fixed this.

Multinucleotide variants are now supported. From some preliminary analysis it looks like everything is working, but I’ll need to run through more test cases to be sure the golden bam is still being constructed correctly, etc.

Additionally, if the inserted VCF (-v parameter) contains overlapping or conflicting variants they are now ignored.

e.g. if the input VCF had:

100 ACGT A 101 C T

The second variant will now be ignored. Previously it was not and this was the cause of the “Something went wrong!” error you encountered.

Also, I updated the variant stats printed to console so that it’s more descriptive why certain input mutations were skipped.

Hope this helps, Zach

On Oct 12, 2016, at 1:42 PM, Stephen J Bush notifications@github.com<mailto:notifications@github.com> wrote:

Thanks for the reply.

Yep--I posted the wrong command. It's exactly the same, but giving the Cosmic VCF via -v. Turns out the Cosmic VCF is more VCF-ish, which could explain why it's breaking. It does, however, contain multiple nucleotide variants

Implementing (1) would be great, but VCFs are a mess, so that's not a trivial fix. Also, VCFs aren't normalized, so one VCF might report the same variant in a completely different way.

Similarly, (2) could work fine. Without information on the recurrence of the variant, we can't make any assumptions on whether or not the variants are linked. In most cases, this approach is likely fine.

I'd go with (3) for now, and just add a separate count for ignored multi-nucleotide variants. For my use case in particular, the VCF is so big that losing a few is not important.

found 11533965 valid variants in input vcf.

Based on the above, it looks like there is initial filtering when the VCF is read, then again for each chromosome. Is the invalid position due in part to the regions given in the bed file?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_zstephens_neat-2Dgenreads_issues_7-23issuecomment-2D253301353&d=DQMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=aiAp-_tCYKVpNyRr_TfoVZ3oKo1uwFBO8AODKHrQ380&s=AXDdIe33WtVDV2xTEFVP919xTlULUgBwjPZ1lW0gBUY&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_notifications_unsubscribe-2Dauth_AIHT1WJxRHJG-2Div6xp-2DYKoV5HsYxFRAaks5qzSoLgaJpZM4KU-5Ffq&d=DQMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=aiAp-_tCYKVpNyRr_TfoVZ3oKo1uwFBO8AODKHrQ380&s=Ob0yc6q_yfiykzLM4fcGoK0mi6BdAEXR1TXpixHY_KY&e=.

muppetjones commented 7 years ago

Big thanks. I really appreciate it.

muppetjones commented 7 years ago

Just came across this: (From cosmic.vcf--trying to find the issue now)

Error: Something went wrong!
(-1, 'A', 'G') T
muppetjones commented 7 years ago

Both SNP and INDEL insertion (only places that can call Something went wrong, correct?) check for an eventPos == -1, so the only place this can come from is self.snpList

Proposed fix (running now):

@@ -173,6 +173,8 @@ class SequenceContainer:
                                myAlt = inpV[2][whichAlt[i]]
                                myVar = (inpV[0]-self.x,inpV[1],myAlt)
                                inLen = max([len(inpV[1]),len(myAlt)])
+                               if myVar[0] < 0:
+                                       continue
                                #print myVar, chr(self.sequences[p][myVar[0]])
                                if len(inpV[1]) == 1 and len(myAlt) == 1:
                                        if self.blackList[p][myVar[0]]:
@@ -291,17 +293,20 @@ class SequenceContainer:
                        all_snps[p] = [n for n in all_snps[p] if self.blackList[p][n[0]] != 1]

                # modify reference sequences
+               bad_snp = []
                for i in xrange(len(all_snps)):
                        for j in xrange(len(all_snps[i])):
                                # sanity checking (for debugging purposes)
                                vPos = all_snps[i][j][0]
                                if all_snps[i][j][1] != chr(self.sequences[i][vPos]):
-                                       print '\nError: Something went wrong!\n', all_snps[i][j], chr(self.sequences[i][vPos]),'\n'
-                                       exit(1)
+                                       print '\nError: SNP went wrong!\n', all_snps[i][j], chr(self.sequences[i][vPos]),'\n'
+                                       bad_snp.append(i)
+                                       #exit(1)
                                else:
                                        self.sequences[i][vPos] = all_snps[i][j][2]

                adjToAdd = [[] for n in xrange(self.ploidy)]
+               bad_indel = []
                for i in xrange(len(all_indels)):
                        for j in xrange(len(all_indels[i])):
                                # sanity checking (for debugging purposes)
@@ -310,8 +315,9 @@ class SequenceContainer:
                                #print all_indels[i][j], str(self.sequences[i][vPos:vPos2])
                                #print len(self.sequences[i]),'-->',
                                if all_indels[i][j][1] != str(self.sequences[i][vPos:vPos2]):
-                                       print '\nError: Something went wrong!\n', all_indels[i][j], str(self.sequences[i][vPos:vPos2]),'\n'
-                                       exit(1)
+                                       print '\nError: INDEL went wrong!\n', all_indels[i][j], str(self.sequences[i][vPos:vPos2]),'\n'
+                                       bad_indel.append(i)
+                                       #exit(1)
                                else:
                                        self.sequences[i] = self.sequences[i][:vPos] + bytearray(all_indels[i][j][2]) + self.sequences[i][vPos2:]
                                        adjToAdd[i].append((all_indels[i][j][0],len(all_indels[i][j][2])-len(all_indels[i][j][1])))

Edit: As an aside, I'd fix and open up a PR, but (a) I use python3, and (b) my setup will auto-beautify to pep8 / tabs-to-spaces, which will massively change the code. I am not set up to easily change back and forth.

Edit: myVar[0] == -1 --> myVar[0] < 0

Edit:

Error: INDEL went wrong!
(16645, 'TGCT', 'T') TGGT

Error: INDEL went wrong!
(16644, 'ATGC', 'A') ATGG

Error: INDEL went wrong!
(16138, 'TGTT', 'T') TGGT
zstephens commented 7 years ago

Hmm,

The reason I had NEAT bomb out with an error instead of just skipping the variant was that I didn’t want it to be possible to ever run into that situation, letting me know there’s a roaming bug to squash if it ever comes up.

Looking at this case in particular, I’m having a hard time replicating the problem. As you observed, it looks like there’s a variant in the list provided to insert_mutations() that had a position == -1. This shouldn’t be possible, in theory, and my attempts to generate pathological cases that elicit this behavior have thus far not been successful.

Based on the error message provided, would you be able to deduce the particular variant (presumably inserted from your vcf?) that is causing the issue? With that specific variant (and the relevant command line parameters, -R and --pe) the debugging process would be a bit easier.

Cheers, Zach

On Oct 17, 2016, at 1:44 PM, Stephen J Bush notifications@github.com<mailto:notifications@github.com> wrote:

Both SNP and INDEL insertion (only places that can call Something went wrong, correct?) check for an eventPos == -1, so the only place this can come from is self.snpList

Proposed fix (running now):

@@ -173,6 +173,8 @@ class SequenceContainer: myAlt = inpV[2][whichAlt[i]] myVar = (inpV[0]-self.x,inpV[1],myAlt) inLen = max([len(inpV[1]),len(myAlt)])

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_zstephens_neat-2Dgenreads_issues_7-23issuecomment-2D254296250&d=DQMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=vNIFQcMVgh4L6z_aVk885DVzwRBuu6Ds3HObk8kr5G8&s=tm_QyCWwj77VnQvtW2fcHDALDtA4Ts3fyUAM0KOzgCA&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3Agithub.com_notifications_unsubscribe-2Dauth_AIHT1VZOvZjQnlqFeuWOs6qoMgDWQYEUks5q08IGgaJpZM4KU-5Ffq&d=DQMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=iSEBNMfCHfS5bLIS7AqI6wkxgRNv8-2dMwRF43ddddU&m=vNIFQcMVgh4L6z_aVk885DVzwRBuu6Ds3HObk8kr5G8&s=Lv7g_kr9ezdO38xhBqtzUGzG0Mrxt0LqVp-ckS16Ujk&e=.

muppetjones commented 7 years ago

Unfortunately, I've not be able to figure out exactly which variant is causing the error. The position is calculated, so it's not useful, and the exact SNP is not unique. A print instead of a blanket continue in insert_mutations would help, and I should have put it in.

The parameters: -R /refs/ensembl/ensembl_build_75.fa (from ftp://ftp.ensembl.org/pub/release-75) and --pe 300 30.

For this use case, I'm not particularly concerned which variants are included, and this file contains enough. I'd rather the run complete. In the example above, the variants are stored in a list that you could write to a file after processing, etc.

Aside: Use the logging library instead of print. You don't have to worry about flushing output--it prints to the console/file in real-time. The following is a slightly hack-ish way to setup a logger in a script. If for some reason the script is not called as main, it'll keep the logger setup as the calling script set up logging, otherwise, it'll make sure we're logging to a file and the console.

import logging

def setup_logger(console=True, log_file=None):
    """Setup a console and file logger."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)

    formatter = logging.Formatter(
        '[%(asctime)s - %(name)s:%(lineno)s - %(levelname)s] %(message)s')

    # create console handler with a higher log level
    if console:
        ch = logging.StreamHandler()
        ch.setLevel(logging.DEBUG)
        ch.setFormatter(formatter)
        logger.addHandler(ch)

    # create file handler which logs even debug messages
    if log_file:
        fh = logging.FileHandler(log_file)
        fh.setLevel(logging.DEBUG)
        fh.setFormatter(formatter)
        logger.addHandler(fh)

    return logger

# get a logger object for the current file, but don't set it up
log = setup_logger(console=False, log_file=None)

if __name__ == '__main__':
    # We're running the script, so set up the logger as we intend
    log = setup_logger(
        console=True,
        log_file='script.log'
    )

    log.debug('Debug msg')
    log.error('Something went horridly wrong!')
[2016-10-18 13:36:10,680 - __main__:31 - DEBUG] Debug msg
[2016-10-18 13:36:10,691 - __main__:32 - ERROR] Something went horridly wrong!