zjshi / gt-pro

MIT License
23 stars 7 forks source link

contain error damage to the respective input file, enforce FASTQ format #22

Closed boris-dimitrov closed 4 years ago

boris-dimitrov commented 5 years ago

we used to output garbage if the input wasn't properly formatted FASTQ; we now check for '@' and '+' characters at the start of every other line and we also ensure all DNA letters are from the ACGT alphabet before we try to encode a kmer

if there is an error -- whether I/O error or parse error -- it is reported on stderr, and written to file out.X.err; suppressing the creation of the normal out.X.tsv; only the input file that has the error is affected, the remaining input files continue processing as normal

cc @zhaoc1 @zjshi on feedback about this approach to error handling, and also thoughts about naming of output files -- should we have an option to automatically name the output foo.tsv for given input foo.fastq -- or make the current out.{0, 1, 2, ...}.tsv optional?

what should we do if an output file exists? skip its generation? then add option "-f" to overwrite that behavior? skipping seems convenient if you run on a bunch of files, hen a few of the files have errors, you fix and rerun?

should we add any statistics for the files that it ran on etc -- i added some basic ones mostly for sanity checking

I am also thinking I might add an option for the program to take input from STDIN and output to STDOUT -- imagining a workflow where it is run in this way to aggregate across samples, like cat *fastq | gt_pro > all_snps.tsv

zjshi commented 5 years ago

we used to output garbage if the input wasn't properly formatted FASTQ; we now check for '@' and '+' characters at the start of every other line and we also ensure all DNA letters are from the ACGT alphabet before we try to encode a kmer

Sounds great! The DNA sequence could also have ambiguous letters, I used to treat them as gaps in the sequence. But I guess skipping the kmers with any char other than ACGT simply works as well.

if there is an error -- whether I/O error or parse error -- it is reported on stderr, and written to file out.X.err; suppressing the creation of the normal out.X.tsv; only the input file that has the error is affected, the remaining input files continue processing as normal

Sounds great!

cc @zhaoc1 @zjshi on feedback about this approach to error handling, and also thoughts about naming of output files -- should we have an option to automatically name the output foo.tsv for given input foo.fastq -- or make the current out.{0, 1, 2, ...}.tsv optional?

I would prefer the matched names. One issue is that many users starts with some compressed fastqs (fastq.gz), they may need supply input through some bash redirection e.g. <(gzip -dc foo.fastq.gz) .

what should we do if an output file exists? skip its generation? then add option "-f" to overwrite that behavior? skipping seems convenient if you run on a bunch of files, hen a few of the files have errors, you fix and rerun?

I would prefer an overwrite by default, but allow an option to skip. One less expected but possible scenario is when an output file exists and there is an error during current running, I guess we may have no choice but skip the overwrite.

should we add any statistics for the files that it ran on etc -- i added some basic ones mostly for sanity checking

I totally agree. Besides the sanity checking benefit, it may also cheer up the users by simply letting them know how much data has been processed during a short period of time ;)

I am also thinking I might add an option for the program to take input from STDIN and output to STDOUT -- imagining a workflow where it is run in this way to aggregate across samples, like cat *fastq | gt_pro > all_snps.tsv

This should be super helpful!

boris-dimitrov commented 5 years ago

Thanks for the feedback, Jason! I love the <(gzip -dc .) usage so let's continue to support that with automatic numbered output names.

I worry about accidentally overwriting pre-existing files that aren't gtpro outputs. If the overwritten files weren't produced by gtpro, the user might not have the easy ability to regenerate them.

How about we rename "out.X.tsv" to "gtpro_out.db_name.X.tsv" because that way the file name we are overwriting is certain to have been generated by gtpro for this very database, and therefore is not some precious old file we would get in trouble for deleting?

We could make that output prefix configurable, with a default derived from the DB name as above. That way our defaults at least are guaranteed to be non-destructive.

boris-dimitrov commented 5 years ago

Btw we still treat N as a gap in the sequence; the new feature here is that a letter other than ACGTN no longer silently overwrites its neighbors with T.

We could skip kmers with random letters from the alphabet, but, what I chose to do here was err out the file. Because I feel if we are parsing a string with random letters it's probably not a DNA sequence at all, but likely header or quality string -- i.e. a problem FASTQ file or a bug in our parser that should be investigated, rather than quietly ignored.

I could certainly create an option to quietly ignore, though.

boris-dimitrov commented 5 years ago

STDIO streaming implemented in https://github.com/zjshi/gt-pro2.0/pull/23

zjshi commented 5 years ago

I worry about accidentally overwriting pre-existing files that aren't gtpro outputs. If the overwritten files weren't produced by gtpro, the user might not have the easy ability to regenerate them.

Right, I could share your concern.

How about we rename "out.X.tsv" to "gtpro_out.db_name.X.tsv" because that way the file name we are overwriting is certain to have been generated by gtpro for this very database, and therefore is not some precious old file we would get in trouble for deleting?

Sounds great, it also solve the issues when multiple gtpro databases are available (future versions)

We could make that output prefix configurable, with a default derived from the DB name as above. That way our defaults at least are guaranteed to be non-destructive.

Yes, this also sounds great!

We could skip kmers with random letters from the alphabet, but, what I chose to do here was err out the file. Because I feel if we are parsing a string with random letters it's probably not a DNA sequence at all, but likely header or quality string -- i.e. a problem FASTQ file or a bug in our parser that should be investigated, rather than quietly ignored.

I see, I agree that gtpro should alert user that something might go terribly wrong e.g. fastq corruption or else. Now we may have two options beside alerting: 1. throw the exception and abort, 2. moving on as long as downstream records are fine. I guess users may be less irritated with the second option :)