Closed johnbradley closed 3 years ago
@wodanaz I have a few questions: How can I tell if we are running in campus or hospital mode?
For the campus version:
Devos_6871_210317A5.csv
file coming from?For the hospital version:
@wodanaz I have a few questions: How can I tell if we are running in campus or hospital mode?
For the campus version:
- Where is the
Devos_6871_210317A5.csv
file coming from?
This is coming from pangolin output. or ($EVDIR/$PROJECTNAME.csv)
- Should I create an optional command line argument to specify the week( "Week_29")?
Yes, because it varies, but will depend if there is a mix of weeks in the sample. To make it more general. I would just include the date.tab file from the beginning.
For the hospital version:
- Will the date.tab file included in the DDS project that we download?
No, I should have that file in the directory (I hope to always get this file before the sequencing is done)
- Should I create a command line argument for run-escape-variants.sh to optionally pass in the date.tab file?
Yes, I think this is a great idea. Just in case I don't have the file at the time of running
Best,
Alejo
@wodanaz How should I tell the difference between campus or hospital mode? One options is I to add a command line argument to specify either campus or hospital modes.
Great question. I was thinking that we could deprecate the surveillance mode and replacing it with -m (mode, with h (hospital surveillance) or c (campus surveillance ) or e (experimental) as options). Here, experimental doesn't have patient metadata, so we don't need to change the name in the fasta consensus.
This also means we need to change which files are saved. And, for now the reports needed for mutations are only in the spike protein region. So, after doctors changed their minds on what data is relevant. I run depth_compiler.pl and genotype_compiler.pl to get mutations and their depths, then I filter the vcf files by the spike region with a new bed file that I called spike.bed which I made as:
cat spike.bed
NC_045512 21563 25384
To generate this file for genotypes within spike I made the following change:
module load bedtools2
gunzip *gatk.filt.vcf.gz
ls *gatk.filt.vcf > vcfs.list
for i in `cat vcfs.list`; do
root=`basename $i .gatk.filt.vcf`;
echo '#!/usr/bin/env bash' > $root.vcf2tab.sh;
echo "#SBATCH -N 1" >> $root.vcf2tab.sh;
#echo "#SBATCH --mem 500" >> $root.vcf2tab.sh;
echo "intersectBed -a $i -b spike.bed > $root.spike.vcf" >> $root.vcf2tab.sh;
echo "awk '{ print \$2 \"\\t\" \$5}' $root.spike.vcf | grep -E '22812|22813|22917|23012|23063|23403|23592|23593'' > $root.spike.tab" >> $root.vcf2tab.sh;
done
for file in *vcf2tab.sh ; do sbatch $file ; done
perl genotype_compiler.pl *.spike.tab > spike_genotypes.tab
cp spike_genotypes.tab spike_genotypes.backup.tab
sed -r 's/ /\t/g' spike_genotypes.tab | sed -r 's/.spike.tab//g' | sort -k1,1n > spike_genotypes.final.tab
and for read depths, to get an idea of data quality I run the following,
for i in `cat vcfs.list`; do
root=`basename $i .gatk.filt.vcf`;
echo '#!/usr/bin/env bash' > $root.vcf2tab.sh;
echo "#SBATCH -N 1" >> $root.vcf2tab.sh;
#echo "#SBATCH --mem 500" >> $root.vcf2tab.sh;
echo "intersectBed -a $i -b spike.bed > $root.spike.vcf" >> $root.vcf2tab.sh;
echo "awk '{ print \$2 \"\\t\" \$10}' $root.spike.vcf | grep -E '22812|22813|22917|23012|23063|23403|23592|23593' | sed -r 's/:/\t/g' | awk '{ print \$1 \"\\t\" \$3}'> $root.depth.tab" >> $root.vcf2tab.sh;
done
for file in *vcf2tab.sh ; do sbatch $file ; done
perl depth_compiler.pl *.depth.tab > spike_depths.tab
cp spike_depths.tab spike_depths.backup.tab
sed -r 's/ /\t/g' spike_depths.tab | sed -r 's/.depth.tab//g' | sort -k1,1n > spike_depths.final.tab
Now, I have two tab files. Which, I wish I could merge with the metadata from coverage, date and lineage using bash but I do it in excel.
I just don't know how to transpose tables in bash. I have tried some things from R without success. Do you have any ideas?
Thanks so much John!
I just don't know how to transpose tables in bash. I have tried some things from R without success. Do you have any ideas?
I've never tried to transpose a table in bash. Bash is good for simple tasks, but can rather challenging to use for complicated tasks. This is probably the case for transposing some data. I can add a script to merge the tab files. What programming language do you have the most familiarity with R, python, perl, other?
I more familiar with R, but I can do python as well. Yea note that the table generated by the compiler has samples as columns and transposing to samples as rows would make the merged table easier to understand.
Yeah, I thought transposing was an easy task for bash but it is not lol. i found it very hard to do, and using t() funtion in R was hard to keep the rownames and colnames in shape
Since you are more familiar with R I'll see if I can create an R script to merge these tables.
Thanks, I would love to learn how to make a transposition more efficently.
;)
@wodanaz Should the spike genotype_compiler/depth_compiler code above replace the current genotype_compiler/depth_compiler code? Or should I add the spike code as a new step in addition to the current genotype_compiler/depth_compiler code?
No, they are still the same. the difference is the input tables, which have been pre-filtered and customized to only search for the mutations of concern.
@wodanaz For merging are you wanting to transpose and add in the coverage, date, and lineage from supermetadata.tab?
So if spike_genotypes.final.tab
looks something like this:
S01
22812 G
22813 T
22917 C
And supermetadata.tab
looks like this:
S01$99.92$Week_28$B.3
Instead spike_genotypes.final.tab
should look like this?
Sample Coverage Date Lineage 22812 22813 22917
S01 99.92 Week_28 B.3 G T C
And for spike_depths.final.tab
we would follow the same pattern?
@wodanaz For merging are you wanting to transpose and add in the coverage, date, and lineage from supermetadata.tab? So if
spike_genotypes.final.tab
looks something like this:S01 22812 G 22813 T 22917 C
And
supermetadata.tab
looks like this:S01$99.92$Week_28$B.3
Instead
spike_genotypes.final.tab
should look like this?
Yes, that's exactly how i make it look after excel processing.
Sample Coverage Date Lineage 22812 22813 22917 S01 99.92 Week_28 B.3 G T C
And for
spike_depths.final.tab
we would follow the same pattern?
Yes, although I don't merge depths with the supermetadata table since it is goes to another sheet in the excel document.
Thanks!
If the goal is an excel document python has a library called pandas (which works similar to R dataframes) that can save as an excel spreadsheet with multiple sheets.
That's a great idea and it will help me to save a lot of time :)
Cool. I'll use python/pandas to transform/merge the genotype data into an excel file. Once that's done we can add additional sheets.
It sounds like it works as write.xls in R. But I am looking forward to learn the python/pandas way.
Notes from @wodanaz
Here are two ways I have been changing the name of the consensus sequences manually.
To do this, I need to know if the samples are from the hospital or campus. Because in campus date is by week, in the hospital is by collection day. So I have two different methods.
Right now, every run is from the same week so, I just add it as a string. very easy but it may change in case we start sequencing earlier samples
for the hospital, it's just a bit different. I really need the date table.
date.tab is a tab file that has two columns, one with sample ID and a second column with date