zstephens / exogene

A workflow for identifying viral integrations in both short and long read data
GNU General Public License v3.0
7 stars 2 forks source link

Custom viral fasta must also be indexed #3

Closed tgjohnst closed 1 year ago

tgjohnst commented 2 years ago

Thanks so much for making this workflow available and dockerizing it!

When testing it with a custom viral genome file (-v), I noticed that the workflow would run, but I saw a suspicious early [E::bwa_idx_load_from_disk] fail to locate the index files message and the rest of the run would continue and eventually fail to find any integration sites.

It turns out this was due to the viral fasta I was supplying not having been indexed with bwa index (init_ref.sh indexes the joint reference but not the viral one alone) since it is used as the target of the initial mapping step (assumedly your included reference is already indexed). This is easy enough to do but took a while to figure out because there's no documentation suggesting that this file needs to be indexed in the README and I was trying to figure out if the joint indexing had failed.

As far as solutions, I was thinking of either:

  1. Including a note in the README that custom viral genome files must be indexed with bwa index (this wouldn't require any repackaging of the docker container)
  2. Adding a behavior to init_ref.sh that also indexes the supplied viral reference fasta with samtools and bwa if the -v flag is specified. If you'd prefer this not be the default behavior, there could be an additional commandline flag to enable it, or a check for a matching bwa index file with appropriate suffix so it's not reindexed if those files already exist.

Cheers! Tim J

zstephens commented 1 year ago

Greetings, apologies for missing this (for nearly an entire year)!

Thanks for identifying this! I added a blurb to the README, and if there are more substantive updates in the future I'll push an update to init_ref.sh and the docker container as well.