wodanaz / Assembling_viruses

0 stars 0 forks source link

Removing fix-merged-bed-positions.sh #19

Closed wodanaz closed 3 years ago

wodanaz commented 3 years ago

Hi @johnbradley , since you pointed out the issues associated with the scrip fix-merged-bed-positions.sh to make it more general. I have been thinking a lot about it and decided that I would like to remove it from the pipeline and replaced it with a command that removes the first and last 25 nucleotides of the consensus sequence. It often has very little usable data and very lo depth that many archived sequences don't have them. So I thought on adding the following line to each consensus sequence that is generated in run-bcftools-consensus.sh

!/bin/bash

Make final consensus fasta sequence using the SNPs in the vcf file - consensus

#

SBATCH --job-name=ev-bcftools-concensus

stop if a command fails (non-zero exit status)

set -e

module load bcftools/1.10.2-fasrc01

Determine the file to process in $FILENAMES_FILE based on SLURM_ARRAY_TASK_ID

VCFFILE=$(awk NR==$SLURM_ARRAY_TASK_ID $FILENAMES_FILE)

root=basename $VCFFILE .gatk.filt.vcf bcftools view -Oz $VCFFILE > $VCFFILE.gz bcftools index $VCFFILE.gz bcftools norm -f $GENOME $VCFFILE.gz -Ob -o $root.norm.bcf bcftools consensus -m $root.merged.bed -f $GENOME -p ${root}_ -s ${root} -H A $VCFFILE.gz | seqtk trimfq -b 25 -e 25 > $root.masked.fasta

Also, I am ready to make a small modification to run-bedtools-merge.sh

in wish, I remove the need to generate an extra BED file by:

awk -v GENOME="$GENOME_BASE_NAME" '{ if ( $3 == 0 ) print GENOME "\t" $2 "\t" $2 + 1 }' ${BEDFILE} | bedtools merge | awk '{ print $1 "\t" $2 "\t" $3 - 1 }' > $root.merged.bed

Lastly, I would like to add a floating value of the genome length to replace 29871 in create-coverage-table.sh to insert a computed value that depends upon the reference genome length. by adding:

!/bin/bash

make a table with coverage information

#

SBATCH --job-name=ev-coverage-table

stop if a command fails (non-zero exit status)

set -e

module load bioawk

GENOME_LENGTH=bioawk -c fastx '{print ">" $name ORS length($seq)}' $GENOME

for i in cat depths.list; do root=basename $i .depth.bed percent=awk '{ if ( $3 == 0 ) count++ } END { print 100 - ( count*100 / $GENOME_LENGTH ) }' $i echo $root $percent >> table.tab done

sort -k1,1 -V table.tab > table.sort.tab

What do you think?

wodanaz commented 3 years ago

@johnbradley I forgot to mention that I have seqtk installed in my home directory and if used should be listed as a dependency. right?

johnbradley commented 3 years ago

What do you think?

The changes sound good to me. Do you want me to make these changes, or would you rather make the changes? If you want to make the changes I would be glad to meet and walk you through making and committing them.

I forgot to mention that I have seqtk installed in my home directory and if used should be listed as a dependency. right?

Yes. What version are you using?

wodanaz commented 3 years ago

I can make the commits for now.

the version of seqkt I am using is Version: 1.3-r116-dirty