wodanaz / Assembling_viruses

0 stars 0 forks source link

Adds supermetadata step and filters using spike.bed #42

Closed johnbradley closed 3 years ago

johnbradley commented 3 years ago

Adds a step to generate a supermetadata table and filter reads based on spike.bed. Adds -m mode and -D date.tab command line arguments removing -s. Produces a spreadsheet joining genotype and supermetadata table. Adds new conda requirements to create a xlsx spreadsheet file.

Details

Supermetadata

New script supermetadata-modify-titles.sh generates a supermetadata table based on https://github.com/wodanaz/Assembling_viruses/issues/38#issue-839017413.

Spike Filtering

New scripts: intersect-spike.sh, run-spike-genotype-compiler.sh, run-spike-depth-compiler.sh perform spike intersection filtering and genotype/depth compiling on https://github.com/wodanaz/Assembling_viruses/issues/38#issuecomment-807288367.

The run-bcftools-query-alt-ad.sh, run-genotype-compiler.sh, and run-depth-compiler.sh scripts have been removed since their logic is now handled by the new *spike*.sh scripts.

Spreadsheet

New script: create-spreadsheet.py creates a spreadsheet joining genotype and supermetadata table.

Note

This PR includes a commit that expands memory requirements for some steps that were killed for using too much memory in testing this code.

Fixes #38

johnbradley commented 3 years ago

One difference to note is I made the spike filtering regular expression more rigorous. The regex from https://github.com/wodanaz/Assembling_viruses/issues/38#issuecomment-807288367 looks like this:

...grep -E '22812|22813|22917|23012|23063|23403|23592|23593'...

The code checks that it starts at the beginning of the line and checks up to a word boundary: https://github.com/wodanaz/Assembling_viruses/blob/b033628da4eeab7d80198c77b5563180083bd2e7/scripts/intersect-spike.sh#L20 For example 1228121 would have been included in the original grep. The current values in spike.bed prevent such a large value, so the regex difference is just to be a more robust check. This filtering would likely be simpler to express in python, but I wanted to limit the new python changes to the spreadsheet generation for this PR.